What does a good trace look like?
You see traces appearing in Langfuse, but how do you know if you've done it well? Here are a couple of things you can look at and optimize.
What's the scope of one trace?
Langfuse's data model has three levels of grouping: observations (individual steps) are grouped into traces via a trace_id, and traces can be grouped into sessions via a session_id.
A trace represents one self-contained unit of work in your application. Good examples of a typical trace:
- One chatbot turn (user sends a message, your app retrieves context, calls the LLM, returns a response)
- One agent run (the agent receives a task, reasons, calls tools, and produces a result)
- One pipeline execution (a document comes in, gets chunked, embedded, and stored)
If multiple of these happen in sequence, e.g. a multi-turn conversation, or several agent runs that feed into a final report, that's where sessions come in. Each step is its own trace, and the session ties them together.
A trace shows up in the Langfuse UI as a trace tree and an agent graph:
![]()
![]()
Look at the trace tree
When you click on a trace, you see the trace tree. There are two things you can check:
Are the right steps showing up?
You should see your LLM calls, tool calls, and other important steps represented in the tree. They should have the correct observation type.
For example
- an LLM call should show up as a
generation. This is important because agenerationcan carry cost, token usage, and model information. - a tool call should show up as a
tool. You can then filter on tool call observations when you create LLM-as-a-judge evaluators.
Framework integrations typically set these types automatically. If you're instrumenting manually, you can set them via the as_type parameter (Python) or asType (JS/TS). See the observation types docs for a full list.
Is there noise you don't need?
Not every observation in the tree is useful for understanding what your application did. HTTP spans, database queries, and framework internals often add clutter without giving you meaningful insight. If you see observations like these polluting your trace tree, you can filter them out.
![]()
Choose good names
Observation- and trace names are used in many places:
- When setting up LLM-as-a-judge evaluators, you target specific observations by name.
- In dashboards, you can filter and aggregate metrics by observation name.
- In the tracing table, names help you quickly identify what each step does.
Try to name the observations after what they do: classify-intent, generate-response, summarize-results. This makes it easier to understand what each step does when you're looking at the trace tree, and makes filtering on a specific step easier.
Try not to name observations after the AI model used (
gpt-4o,claude-sonnet). It breaks as soon as you swap models. The model is already a separate attribute ongenerationobservations, use that instead.
Choose meaningful input and output
In general, it's recommended that operations have an input and/or output. If an observation has neither, ask yourself if an observation is actually useful or if you can drop it.
For your most viewed observations, take some extra care to set them up. You will likely create pre-filtered views on your tracing and session screens. The observations you filter for here are the ones that will get looked at the most. For these, ask yourself: what do I need to see to quickly evaluate a trace/session at a glance?
![]()
Typical input/output for GENERATION observations:
- For a chatbot: the user message (input) and the assistant response (output).
- For a RAG pipeline: the user query and the generated answer.
- For a classification task: the text being classified and the predicted label.
If your input and output fields are showing up empty unintentionally, see why are the input and output of my trace empty?
Useful attributes
Observations have a number of attributes that can be useful for your use cases. These will allow you to go even further with filtering, scoring, and making dashboards.
Add metadata for context
Metadata is a flexible key-value store on each observation. Some data that might be useful to save under metadata:
- Evaluation context: When configuring LLM-as-a-judge evaluators, you can reference metadata fields. This is useful for passing ground truth, expected behavior, or other context that the evaluator needs but that isn't part of the actual input/output.
- Filtering: You can filter by metadata keys in the Langfuse UI, which is helpful when you need to find traces with specific characteristics.
- Annotation context: When doing manual review, metadata gives annotators extra information to make better judgments.
Track model, tokens, and cost on generations
If you want to understand what your LLM usage costs, broken down by model, by user, by feature, you need three things on your generation observations:
- Model name: Langfuse uses this to look up pricing in the model pricing table. If the model name doesn't match, Langfuse can't calculate cost automatically.
- Usage details: Input tokens, output tokens, and optionally cached tokens. This is what powers the token usage views in dashboards.
- Cost details (optional): If you want to override Langfuse's automatic pricing — for example, if you have a custom pricing agreement — you can pass cost explicitly.
Most integrations capture all of this automatically. If you're instrumenting manually, see the token and cost tracking docs.
You can see these attributes on the GENERATION observation in the Langfuse UI.
![]()
Use tags for business-level dimensions
Tags enable filtering and metric breakdowns across dimensions that matter to your business. Good tags answer questions like "how does latency differ between our web and api users?".
One property of tags is that they are immutable and must be set at observation creation time. This makes them great for things you know upfront (where the request came from, which feature it's part of), but not for things you learn later.
If you need to label traces based on something determined after the fact, like an LLM-as-a-judge evaluation result, use scores instead.
Link prompts to traces
If you manage your prompts in Langfuse, you can link them to your generations. This lets you see which prompt version was used for a given trace, and track how metrics change across prompt versions. Useful when you're iterating on prompts and want to compare performance.
Set the environment
Set the environment attribute (production, staging, development) so that your test traces don't pollute production dashboards and evaluations.
Track users with user IDs
Setting the user ID connects traces to specific users, which unlocks per-user views in Langfuse. Useful if you want to answer questions like:
- Which users are costing us the most?
- How does output quality vary across users?
- What does a specific user's usage pattern look like?
Group related traces with session IDs
If your application involves multiple traces that logically belong together, group them into a session. This gives you a session replay view where you can see the full interaction in sequence.
This makes sense when:
- You're building a chatbot (each user message creates a new trace, but the whole conversation is one session)
- You have multiple agents that each contribute to a final output (e.g., five agents that collaborate to produce a report)
- Your workflow spans multiple requests with human-in-the-loop steps in between
If your application is single-request/single-response with no continuity between calls, you probably don't need sessions.
![]()