AI Agent Observability, Tracing & Evaluation with Langfuse
Trace, monitor, evaluate, and test AI agents in production. Learn about agent observability strategies, evaluation techniques, and how to use Langfuse with LangGraph, OpenAI Agents, Pydantic AI, CrewAI, n8n, and more.
What are AI Agents?
An AI agent is a system that autonomously performs tasks by planning its task execution and utilizing available tools. AI Agents leverage large language models (LLMs) to understand and respond to user inputs step-by-step and decide when to call external tools.
To solve tasks, agents use:
- planning by devising step-by-step actions from the given task
- tools to extend their capabilities like RAG, external APIs, or code interpretation/execution
- memory to store and recall past interactions for additional contextual information
What are Agents Used For?
Common use cases include:
- Customer Support: AI agents use RAG to automate responses, autonomously take action and efficiently handle inquiries with accurate information.
- Market Research: Agents collect and synthesize information from various sources, delivering accurate and concise summaries to users.
- Software Development: AI agents break coding tasks into smaller sub-tasks and then recombine them to create a complete solution.
Design Patterns of AI Agents
An AI agent usually consists of 5 parts: A language model with general-purpose capabilities that serves as the main brain or coordinator, and four sub-modules: a planning module to divide the task into smaller steps, an action module that enables the agent to use external tools, a memory module to store and recall past interactions and a profile module, to describe the behavior of the agent.

In single-agent setups, one agent is responsible for solving the entire task autonomously. In multi-agent setups, multiple specialized agents collaborate, each handling different aspects of the task to achieve a common goal more efficiently. These agents are also often referred to as state-based or stateful agents as they route the task through different states.

What is AI Agent Observability?
Observing agents means tracking and analyzing the performance, behavior, and interactions of AI agents. This includes real-time monitoring of multiple LLM calls, control flows, decision-making processes, and outputs to ensure agents operate efficiently and accurately.
Langfuse is an open-source LLM engineering platform that provides deep insights into metrics such as latency, cost, and error rates, enabling developers to debug, optimize, and enhance their AI systems. Using Langfuse observability, teams can identify and resolve issues, streamline workflows, and maintain high-quality outputs by evaluating agent responses in complex, multi-step AI agents.
Industry Trends in Agent Observability
As AI agents become more prevalent in production, the observability landscape is evolving rapidly. The industry is converging on OpenTelemetry (OTEL) as a standard for collecting agent telemetry data, preventing vendor lock-in and enabling interoperability across frameworks. Many agent frameworks — including Pydantic AI, smolagents, and Strands Agents — now emit traces via OpenTelemetry, which Langfuse natively supports.
There is also a shift from reactive log-based monitoring to proactive, structured tracing with typed observation data. Rather than parsing unstructured logs after failures, teams now instrument agents with rich semantic types (tool calls, retriever steps, guardrail checks) for real-time insight into agent behavior.
Additionally, cost optimization is becoming critical as agent workloads scale. Agents that autonomously chain multiple LLM and API calls can incur unpredictable costs, making real-time cost tracking and per-trace cost attribution essential for production deployments.
Why AI Agent Observability is Important
Debugging and Edge Cases
Agents use multiple steps to solve complex tasks, and inaccurate intermediary results can cause failures of the entire system. Tracing these intermediate steps and testing your application on known edge cases is essential.
When deploying LLMs, some edge cases will always slip through in initial testing. A proper analytics set-up helps identify these cases, allowing you to add them to future test sets for more robust agent evaluations. With Datasets, Langfuse allows you to collect examples of inputs and expected outputs to benchmark new releases before deployment. Datasets can be incrementally updated with new edge cases found in production and integrated with existing CI/CD pipelines.
Tradeoff of Accuracy and Costs
LLMs are stochastic by nature, meaning they are a statistical process that can produce errors or hallucinations. Calling language models multiple times while selecting the best or most common answer can increase accuracy. This can be a major advantage of using agentic workflows.
However, this comes with a cost. The tradeoff between accuracy and costs in LLM-based agents is crucial, as higher accuracy often leads to increased operational expenses. Often, the agent decides autonomously how many LLM calls or paid external API calls it needs to make to solve a task, potentially leading to high costs for single-task executions. Therefore, it is important to monitor model usage and costs in real-time.
Langfuse monitors both costs and accuracy, enabling you to optimize your application for production.
Understanding User Interactions
AI agents analytics allows you to capture how users interact with your LLM applications. This information is crucial for refining your AI application and tailoring responses to better meet user needs.
Langfuse Analytics derives insights from production data, helping you measure quality through user feedback and model-based scoring over time and across different versions. It also allows you to monitor cost and latency metrics in real-time, broken down by user, session, geography, and model version, enabling precise optimizations for your LLM application.
Tools to build AI Agents
You do not need any specific tools to build AI agents. However, there are several open-source frameworks that can help you build complex, stateful, multi-agent applications.
Application Frameworks
LangGraph
LangGraph (GitHub) is an open-source framework by the LangChain team for building complex, stateful, multi-agent applications. LangGraph includes built-in persistence to save and resume state, which enables error recovery and human-in-the-loop workflows.
LangGraph agents can be monitored with Langfuse to observe and debug the steps of an agent.
Llama Agents
Llama Agents (GitHub) is an open-source framework designed to simplify the process of building, iterating, and deploying multi-agent AI systems and turn your agents into production microservices.
Langfuse offers a simple integration for automatic capture of traces and metrics generated in LlamaIndex applications.

OpenAI Agents SDK
OpenAI Agents SDK provides a simple yet powerful framework for building and orchestrating AI agents. By instrumenting the SDK with Langfuse, you can capture detailed traces of agent execution, including planning, function calls, and multi-agent handoffs. This integration enables you to monitor performance metrics, trace issues in real time, and optimize your workflows effectively.
For a comprehensive guide on setting up this integration, please refer to our Trace the OpenAI Agents SDK with Langfuse notebook.
Hugging Face smolagents
Hugging Face smolagents is a minimalist framework for building AI agents. With the Langfuse integration, you can effortlessly capture and visualize telemetry data from your agents. By initializing the SmolagentsInstrumentor, your agent interactions are traced using OpenTelemetry and displayed in Langfuse, enabling you to debug and optimize decision-making processes.
For a comprehensive, step-by-step guide, see our integration notebook: Observability for smolagents with Langfuse.

Pydantic AI
Pydantic AI brings Pydantic’s type safety and ergonomic developer experience to agent development. You define your agent’s inputs, tool signatures, and outputs as Python types, and the framework handles validation plus OpenTelemetry instrumentation under the hood. The result is a FastAPI-style developer experience for building production-ready agents.
For a step-by-step guide, see our integration notebook: Trace Pydantic AI agents with Langfuse.

CrewAI
CrewAI is all about role-based collaboration among multiple agents. You assign each agent a distinct skillset or role, then let them cooperate to solve a problem. The framework offers a higher-level abstraction called a “Crew” that coordinates workflows, allowing agents to share context and build upon one another’s contributions. It is well-suited for tasks requiring multiple specialists working in parallel.
For setup instructions, see our integration guide: Trace CrewAI agents with Langfuse.

AutoGen
AutoGen, from Microsoft Research, frames agent interactions as asynchronous conversations among specialized agents. Each agent can be a ChatGPT-style assistant or a tool executor, and you orchestrate how they pass messages back and forth. This event-driven approach reduces blocking and is well-suited for longer tasks or scenarios requiring real-time concurrency.
For tracing setup, see: Trace AutoGen agents with Langfuse.

Strands Agents
Strands Agents SDK is a model-agnostic agent framework that runs anywhere and supports multiple model providers including Amazon Bedrock, Anthropic, OpenAI, and Ollama via LiteLLM. It emphasizes production readiness with first-class OpenTelemetry tracing, giving you end-to-end observability with a clean, declarative API for defining agent behavior.
For setup instructions, see: Trace Strands Agents with Langfuse.

Semantic Kernel
Semantic Kernel is Microsoft’s approach to orchestrating AI “skills” and combining them into workflows. It supports multiple programming languages (C#, Python, Java) and focuses on enterprise readiness, including security, compliance, and Azure integration. You can create a range of skills — some powered by AI, others by pure code — and compose them into multi-step plans.
For tracing setup, see: Trace Semantic Kernel with Langfuse.

No-code Agent Builders
For prototypes and development by non-developers, no-code builders can be a great starting point.
Flowise
Flowise (GitHub) is a no-code builder. It lets you build customized LLM flows with a drag-and-drop editor. With the native Langfuse integration, you can use Flowise to quickly create complex LLM applications in no-code and then use Langfuse to analyze and improve them.

Example of a catalog chatbot created in Flowise to answer any questions related to shop products.
Langflow
Langflow (GitHub) is a UI for LangChain, designed with react-flow to provide an effortless way to experiment and prototype flows.
With the native integration, you can use Langflow to quickly create complex LLM applications in no code and then use Langfuse to monitor and debug them.

Example of a chat agent with chain-of-thought reasoning built in Langflow by Cobus Greyling.
Dify
Dify (GitHub) is an open-source LLM app development platform. Using their Agent Builder nd variety of templates, you can easily build an AI agent and then grow it into a more complex system via Dify workflows.
With the native Langfuse integration, you can use Dify to quickly create complex LLM applications and then use Langfuse to monitor and improve them.

Agent Evaluation and Testing
Building an agent is only the first step. Agents can fail in nuanced ways — selecting the wrong tool, entering reasoning loops, or hallucinating in intermediate steps that produce a plausible-looking but incorrect final answer. To ship agents with confidence, you need a systematic approach to evaluation and testing.
Why Evaluate Agents?
When working with agents, three problems show up again and again: understanding, specification, and generalization. You often lack understanding of what the agent actually does on real traffic because you are not systematically inspecting traces. The task is frequently underspecified — prompts and examples don’t clearly encode what “good” behavior is, so the agent improvises in unpredictable ways. And even once you have tightened the spec, the agent may still struggle to generalize, performing well on handpicked examples but failing on slightly different real-world queries.
Three Evaluation Strategies
Langfuse supports three complementary strategies for evaluating agents:
-
Final Response (Black-Box): This method only looks at the user’s input and the agent’s final answer, ignoring the intermediate steps. It is flexible and easy to set up, but does not tell you why an agent failed.
-
Trajectory (Glass-Box): This strategy evaluates the full sequence of tool calls, reasoning steps, and decisions an agent made. You compare the actual trajectory against an expected one, catching issues like unnecessary tool calls, skipped steps, or inefficient reasoning paths.
-
Single Step (White-Box): This zooms in on individual steps within the agent’s execution, evaluating whether each tool call returned the right result or each reasoning step was sound. It provides the most granular feedback for debugging specific failures.
Three Phases of Evaluation
The evaluation process follows a natural progression as your application matures:
- Phase 1 — Manual Tracing: During early development, the most valuable activity is simply inspecting traces in Langfuse to understand your agent’s reasoning.
- Phase 2 — Online Evaluation: As you get your first users, implement user feedback mechanisms and automated LLM-as-a-Judge evaluators to flag problematic traces in real-time.
- Phase 3 — Offline Evaluation: At scale, create benchmark datasets of inputs and expected outputs, then run automated experiments to test your agent before each release, preventing regressions and enabling confident iteration.
Agent Evaluation Guides
To dive deeper, explore these hands-on guides:
- Agent Evaluation Guide — End-to-end walkthrough of all three evaluation strategies using Pydantic AI agents
- Evaluating OpenAI Agents — Online and offline evaluation for OpenAI Agents SDK
- LangGraph Agent Evaluation — Monitoring and evaluating LangGraph agents
- Synthetic Dataset Generation — Scale test coverage with LLM-generated data for agent evaluation
- Testing LLM Applications — Build a testing foundation with deterministic checks and LLM judges
Get Started
If you want to get started with building AI agents and monitoring them with Langfuse, here are the best places to begin:
- Build and trace an agent: Follow our end-to-end example of building a simple agent with LangGraph and tracking it with Langfuse.
- Compare agent frameworks: Read our AI Agent Comparison blog post for an in-depth guide on when to use which framework.
- Evaluate your agents: Start with the Agent Evaluation Guide to set up black-box, trajectory, and step-level evaluations.
- Explore all integrations: Browse the full list of supported integrations to find the right setup for your stack.