Error Analysis to Evaluate LLM Applications
A practical guide to identifying, categorizing, and analyzing failure modes in LLM applications using Langfuse.
To improve your LLM app, you must understand how it fails. Aggregate metrics won’t tell you if your system retrieves the wrong documents or if the model’s tone alienates users. Error analysis provides this crucial context.
The framework in this guide is adapted from Hamel Husain’s Eval FAQ.
This guide describes a four-step process to identify, categorize, and quantify your application’s unique failure modes. The result is a specific evaluation framework that is far more useful than generic metrics:
- Gather a diverse dataset of traces
- Open code to surface failure patterns
- Structure failure modes
- Label and quantify
I’ll demonstrate this process using a demo chatbot in the Langfuse documentation that uses the Vercel AI SDK and has access to a RAG tool to retrieve documents from the Langfuse documentation. The demo chat app logs traces into the Langfuse demo project and has already answered 19k user queries in the past year.
Here’s the chat interface (you can find the demo chat app here):
1. Gather a Diverse Dataset
To start our error analysis, we assemble a representative dataset of 50-100 traces produced by the demo chat app. The quality of your analysis depends on the diversity of this initial data.
Existing Production Traces: If you already have real user traces, as in our example, create your dataset based on them. I recommend first manually clicking through your traces, focusing only on the user input, and adding a diverse set of traces to an annotation queue.
You can also query for traces with negative user feedback, long conversations, high latency, or specific user metadata. The goal is not a random sample, but a set that covers a wide range of user intents and potential edge cases.
In Langfuse, you can add traces to an annotation queue by clicking the “Add to Annotation Queue” button:
Synthetic Dataset: If you lack production data, generate a synthetic dataset covering anticipated user behaviors and potential failure points. We have a Python cookbook that shows how to do this here. Once created, add these traces to a Langfuse Annotation Queue. Note that the quality of your dataset matters a lot for the success of your error analysis; it needs to be diverse and representative of the real world.
The Annotation Queue we created will serve as your workspace for the analysis. For our demo chatbot, we selected 40 traces reflecting different user questions, from simple definitions to complex comparisons:
2. Open Coding: Surface Failure Patterns
In the next step, we open our Annotation Queue and carefully review every trace and its associated tool use. The objective is to apply raw, descriptive labels without forcing them into predefined categories.
For each trace, assign two annotations:
-
A binary score: Pass or Fail. This forces a clear judgment call.
-
A free-text comment: Describe the first point of failure you observe. This process is called open coding, as we are not forcing any categories on the data.
If you have traces with multiple errors, focusing on the first failure is efficient. A single upstream error, like incorrect document retrieval, often causes multiple downstream issues. Fixing the root cause resolves them all. Your comment should be a raw observation, not a premature diagnosis.
Here are some examples from our demo chat app:
3. Structure Failure Modes
After annotating all traces, the next step is to structure your free-text comments into a coherent taxonomy.
Export your comments from the Langfuse annotation job (you can query your comments via the Langfuse API). You can use an LLM to perform an initial clustering of these notes into related themes. Review and manually refine the LLM’s output to ensure the categories are distinct, comprehensive, and accurately reflect your application’s specific issues.
For our docs chatbot, we used the following prompt on our exported annotations:
You are given a list of open-ended annotations describing failures of an LLM-powered assistant that answers questions about Langfuse. Organize these into a small set of coherent failure categories, grouping similar mistakes together. For each category, provide a concise descriptive title and a one-line definition. Only cluster based on the issues in the annotations—do not invent new failure types.
This produced a clear taxonomy:
Failure Mode | Definition |
---|---|
Hallucinations / Incorrect Information | The assistant gives factually wrong answers or shows lack of knowledge about the domain. |
Context Retrieval / RAG Issues | Failures related to retrieving or using the right documents. |
Irrelevant or Off-Topic Responses | The assistant produces content unrelated to the user’s question. |
Generic or Unhelpful Responses | Answers are too broad, vague, or do not directly address the user’s question. |
Formatting / Presentation Issues | Problems with response delivery, such as missing code blocks or links. |
Interaction Style / Missing Follow-ups | The assistant fails to ask clarifying questions or misses opportunities for guided interaction. |
4. Label and Quantify
With our error labels in place, we can now annotate our dataset with these failure modes.
First, create a new Score configuration in Langfuse containing each failure mode as a boolean or categorical option. Then, re-annotate your dataset using this new, structured schema.
This labeled dataset allows you to use Langfuse analytics to pivot and aggregate the data. You can now answer critical questions like, “What is our most frequent failure mode?” For our demo chatbot, the analysis revealed that Context Retrieval Issues were the most common problem.
Here are the results after labeling our dataset:
Common Pitfalls
-
Generic Metrics: Avoid starting with off-the-shelf metrics like “conciseness” or “hallucinations.” Let your application’s actual failures define your evaluation criteria.
-
One-and-Done Analysis: Error analysis is not a static task. As your application and user behavior evolve, so will its failure modes. Make this process a recurring part of your development cycle.
Next Steps
This error analysis produces a quantified, application-specific understanding of your primary issues. These insights provide a clear roadmap for targeted improvements, whether in your prompts, RAG pipeline, or model selection.
The structured failure modes you defined serve as the foundation for building automated evaluators, which can scale this analysis across your application. However, before setting up automated evaluators, ensure you first address the obvious issues encountered during the error analysis. You can typically go through multiple rounds of this process before reaching a plateau.
In the upcoming blog post, we will set up automated evaluators and use them to continuously improve our demo chatbot.