Evaluating LLM Agents
This guide show how we thing about agent evaluations at Langfuse. Before we dive into the code, let’s establish a clear mental model for what we’re building and testing.
What is an LLM Agent?
An “agent” is more than just a single call to an LLM. It’s a system that operates in a continuous loop. This loop begins when the LLM receives an input, either from a human or as feedback from a previous step. Based on this input, the LLM decides on an “Action,” which often involves calling an external tool like a search API or a database query. This action interacts with an “Environment,” which then produces “Feedback” (like search results or data) that is fed back to the LLM.
This cycle of Call, Action, Environment, and Feedback continues until the agent decides to stop and generate a final answer. This entire sequence of events is what we call a “trace” or a “trajectory.”

Why Evaluate Agents?
Evaluating these complex, multi-step trajectories is important because they can fail in several ways. We might not have given the agent clear enough instructions, or the LLM itself might fail to generalize its reasoning to new or unexpected user questions.
Common problems when working with agents
When working with agents, three problems show up again and again: understanding, specification, and generalization. You often lack understanding of what the agent actually does on real traffic—what tools it calls, where it gets stuck—because you’re not systematically inspecting traces or linking them to user feedback.
The task is frequently underspecified: prompts and examples don’t clearly encode what “good” behavior is, so the agent improvises in unpredictable ways. And even once you’ve tightened the spec, the agent may still struggle to generalize—performing well on a few handpicked examples but failing on slightly different real-world queries—unless you add systematic, dataset-based evaluations to check robustness at scale.

The 3 Phases of Evaluation
The process of evaluation has a clear, phased approach.
Phase 1: Early Development (Manual Tracing) When you’re first building an agent, the most valuable thing you can do is simply look at the traces. Manual tracing gives you immediate insight into your agent’s reasoning.
Phase 2: First Users (Online Evaluation) As you get your first users, you can implement user feedback mechanisms, like thumbs-up or thumbs-down buttons, to flag problematic traces for review.
Phase 3: Scaling (Offline Evaluation) The final phase, and the focus of this notebook, is creating an automated “offline evaluation dataset.” As you scale, you can’t manually review every trace. You need a “gold standard” dataset of inputs and their expected outputs or trajectories. This benchmark allows you to test your agent automatically, prevent regressions, and confidently make improvements.

Our Focus: Three Offline Evaluation Strategies
In this notebook, we will focus on three practical, automated evaluation strategies that use a Langfuse dataset for experimentation.
1) Final Response (Black-Box): First is the “Final Response” evaluation. This method only cares about the user’s input and the agent’s final answer, completely ignoring the steps it took to get there. It’s flexible, but it doesn’t tell you why an agent failed.
2) Trajectory (Glass-Box): Second is the “Trajectory” evaluation. This method checks if the agent took the “correct path.” It compares the agent’s actual sequence of tool calls against the expected sequence from our benchmark dataset. This helps pinpoint exactly where in the reasoning process a failure occurred.
3) Single Step (White-Box): Third is the “Single Step” evaluation. This is the most granular test, acting like a unit test for your agent’s reasoning. Instead of running the whole agent, it tests each decision-making step in isolation to see if it produces the expected next action.
Get Started
Below, we will define a sample agent, create a small benchmark dataset, and add LLM-as-a-judge evaluations in Langfuse.
Note: In this guide, we are using Pydantic AI agents, but can be generalized to any other framework or tool.
Step 0: Install Packages
%pip install -q --upgrade "pydantic-ai[mcp]" langfuse openai nest_asyncio aiohttpStep 1: Set Environment Variables
Get your Langfuse API keys from project settings.
import os
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # EU region
# os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com" # US region
os.environ["OPENAI_API_KEY"] = "sk-proj-..."Step 2: Enable Langfuse Tracing
Enable automatic tracing for Pydantic AI agents.
from langfuse import get_client
from pydantic_ai.agent import Agent
langfuse = get_client()
assert langfuse.auth_check(), "Langfuse auth failed - check your keys"
Agent.instrument_all()
print("✅ Pydantic AI instrumentation enabled")Step 3: Create Agent
Build an agent that searches Langfuse docs using the Langfuse Docs MCP Server.
from typing import Any
from pydantic_ai import Agent, RunContext
from pydantic_ai.mcp import MCPServerStreamableHTTP, CallToolFunc, ToolResult
LANGFUSE_MCP_URL = "https://langfuse.com/api/mcp"
async def run_agent(item, system_prompt="You are an expert on Langfuse. ", model="openai:gpt-4o-mini"):
langfuse.update_current_trace(input=item.input)
tool_call_history = []
async def process_tool_call(
ctx: RunContext[Any],
call_tool: CallToolFunc,
tool_name: str,
args: dict[str, Any],
) -> ToolResult:
tool_call_history.append({"tool_name": tool_name, "args": args})
return await call_tool(tool_name, args)
langfuse_docs_server = MCPServerStreamableHTTP(
url=LANGFUSE_MCP_URL,
process_tool_call=process_tool_call,
)
agent = Agent(
model=model,
system_prompt=system_prompt,
toolsets=[langfuse_docs_server],
)
async with agent:
result = await agent.run(item.input["question"])
langfuse.update_current_trace(
output=result.output,
metadata={"tool_call_history": tool_call_history},
)
return result.output, tool_call_historyStep 4: Create Evaluation Dataset
Build a benchmark dataset with test cases. Each case includes:
input: User questionexpected_output.response_facts: Key facts the response must containexpected_output.trajectory: Expected sequence of tool callsexpected_output.search_term: Expected search query (if applicable)
test_cases = [
{
"input": {"question": "What is Langfuse?"},
"expected_output": {
"response_facts": [
"Open Source LLM Engineering Platform",
"Product modules: Tracing, Evaluation and Prompt Management"
],
"trajectory": ["getLangfuseOverview"],
}
},
{
"input": {"question": "How to trace a python application with Langfuse?"},
"expected_output": {
"response_facts": [
"Python SDK, you can use the observe() decorator",
"Lots of integrations, LangChain, LlamaIndex, Pydantic AI, and many more."
],
"trajectory": ["getLangfuseOverview", "searchLangfuseDocs"],
"search_term": "Python Tracing"
}
},
{
"input": {"question": "How to connect to the Langfuse Docs MCP server?"},
"expected_output": {
"response_facts": [
"Connect via the MCP server endpoint: https://langfuse.com/api/mcp",
"Transport protocol: `streamableHttp`"
],
"trajectory": ["getLangfuseOverview"]
}
},
{
"input": {"question": "How long are traces retained in langfuse?"},
"expected_output": {
"response_facts": [
"By default, traces are retained indefinitely",
"You can set custom data retention policy in the project settings"
],
"trajectory": ["getLangfuseOverview", "searchLangfuseDocs"],
"search_term": "Data retention"
}
}
]
DATASET_NAME = "pydantic-ai-mcp-agent-evaluation"
dataset = langfuse.create_dataset(name=DATASET_NAME)
for case in test_cases:
langfuse.create_dataset_item(
dataset_name=DATASET_NAME,
input=case["input"],
expected_output=case["expected_output"]
)Step 5: Set Up Evaluators
Create three evaluators in the Langfuse UI. Each tests a different aspect of agent behavior. You can find the documentation on setting them up here.
1. Final Response Evaluation (Black Box)
Tests output quality. Works regardless of internal implementation.

Prompt template:
You are a teacher grading a student based on the factual correctness of their statements.
### Examples
#### Example 1:
- Response: "The sun is shining brightly."
- Facts to verify: ["The sun is up.", "It is a beautiful day."]
- Reasoning: The response includes both facts.
- Score: 1
#### Example 2:
- Response: "When I was in the kitchen, the dog was there"
- Facts to verify: ["The cat is on the table.", "The dog is in the kitchen."]
- Reasoning: The response mentions the dog but not the cat.
- Score: 0
### New Student Response
- Response: {{response}}
- Facts to verify: {{facts_to_verify}}2. Trajectory Evaluation (Glass Box)
Verifies the agent used the correct sequence of tools.

Prompt template:
You are comparing two lists of strings. Check whether the lists contain exactly the same items. Order does not matter.
## Examples
Expected: ["searchWeb", "visitWebsite"]
Output: ["searchWeb"]
Reasoning: Output missing "visitWebsite".
Score: 0
Expected: ["drawImage", "visitWebsite", "speak"]
Output: ["visitWebsite", "speak", "drawImage"]
Reasoning: Output matches expected items.
Score: 1
Expected: ["getNews"]
Output: ["getNews", "watchTv"]
Reasoning: Output contains unexpected "watchTv".
Score: 0
## This Exercise
Expected: {{expected}}
Output: {{output}}3. Search Quality Evaluation
Validates search query quality when agents search documentation.

Prompt template:
You are grading whether a student searched for the right information. The search term should correspond vaguely with the expected term.
### Examples
Response: "How can I contact support?"
Expected search topics: Support
Reasoning: Response searches for support.
Score: 1
Response: "Deployment"
Expected search topics: Tracing
Reasoning: Response doesn't match expected topic.
Score: 0
Response: (empty)
Expected search topics: (empty)
Reasoning: No search expected, no search done.
Score: 1
### New Student Response
Response: {{search}}
Expected search topics: {{expected_search_topic}}Create these evaluators in Langfuse UI under Prompts → Create Evaluator.
Step 6: Run Experiments
Run agents on your dataset. Compare different models and prompts to find the best configuration.
dataset = langfuse.get_dataset(DATASET_NAME)
result = dataset.run_experiment(
name="Production Model Test",
description="Monthly evaluation of our production model",
task=run_agent
)
print(result.format())Step 7: Compare Multiple Configurations
Test different prompts and models to find the best configuration.
from functools import partial
system_prompts = {
"simple": (
"You are an expert on Langfuse. "
"Answer user questions accurately and concisely using the available MCP tools. "
"Cite sources when appropriate."
),
"nudge_search": (
"You are an expert on Langfuse. "
"Answer user questions accurately and concisely using the available MCP tools. "
"Always cite sources when appropriate. "
"When unsure, use getLangfuseOverview then search the docs. You can use these tools multiple times."
)
}
models = ["openai:gpt-5-mini", "openai:gpt-5-nano"]
dataset = langfuse.get_dataset(DATASET_NAME)
for prompt_name, prompt_content in system_prompts.items():
for test_model in models:
task = partial(
run_agent,
system_prompt=prompt_content,
model=test_model,
)
result = dataset.run_experiment(
name=f"Test: {prompt_name} {test_model}",
description="Comparing prompts and models",
task=task
)
print(result.format())