July 15, 2026

AI Agent Observability, Tracing & Evaluation with Langfuse

Trace, monitor, evaluate, and test AI agents in production. Learn what agent observability is and how to use Langfuse with LangGraph, OpenAI Agents SDK, Claude Agent SDK, CrewAI, Pydantic AI, Vercel AI SDK, and more.

Jannik Maierhöfer

What is AI agent observability?

AI agent observability is the practice of capturing every step an AI agent takes, including LLM calls, tool invocations, retrievals, and control-flow decisions, as structured traces that you can inspect, filter, and evaluate. It extends LLM observability from single completions to multi-step, non-deterministic workflows, where failures usually hide in an intermediate step rather than in the final answer.

A complete agent observability setup captures:

LLM calls are recorded with their prompts, completions, model parameters, token usage, and cost.
Tool calls are recorded with the tools that were available to the model, the tool it chose, and the arguments it passed.
Control flow is visible: which subagents, handoffs, or loop iterations ran, in what order, and how often the agent looped.
Context is captured: the prompt version, retrieved documents, and everything else the model actually saw at each step.
Sessions and users group multi-turn conversations so behavior can be analyzed per conversation and per user.
Quality signals such as user feedback and evaluation scores are attached directly to traces.

Langfuse is an open-source AI engineering platform that implements this end to end: tracing with typed observations, agent graph visualization, tool-call analytics, production monitors, and evaluation on top of the captured data. The rest of this post walks through each of these capabilities and how to instrument the most common agent stacks.

Why agents need dedicated observability

Agents differ from single-call LLM applications in three ways that shape what observability has to do:

Failures hide mid-trace. An agent that picks the wrong tool, retrieves the wrong document, or loops on a failing step can still produce a plausible final answer. Finding these failures requires inspecting the full trace, and turning them into dataset items so they stay fixed.
Agents decide their own spend. An agent autonomously chooses how many model calls, tool executions, and external API calls a task takes, so cost must be tracked in real time, attributed per trace, per user, and per model.
Agents run long. Plan-act-observe loops, subagent delegation, and long-horizon tasks produce traces with hundreds or thousands of observations, and teams running long-lived agents report traces reaching hundreds of thousands of observations. Tooling has to stay useful at that scale, which is why search, filtering, and aggregate views matter as much as the trace tree itself.

How modern agents are built, and what that means for tracing

The agentic loop and its harness

An agent today is a model calling tools in a loop: the harness assembles context, sends it to the model, executes the tool calls the model requests, feeds results back, and decides when the loop stops. Frameworks like LangGraph, the OpenAI Agents SDK, and the Claude Agent SDK are harnesses in this sense, and the industry has converged on OpenTelemetry (OTel) for emitting their telemetry. Many agent frameworks, including Pydantic AI, smolagents, Strands Agents, and LiveKit Agents, ship OTel instrumentation that Langfuse natively supports, which prevents vendor lock-in and lets you keep existing instrumentation when switching backends.

For observability, the loop is the unit of analysis: every iteration is a set of observations (a generation, its tool calls, their results), and questions like "why did the agent take 40 steps for this task" are answered by reading the loop, not the final answer.

Tool calling and MCP

Tools are how agents act, and the Model Context Protocol (MCP) has become the standard way to connect agents to external tools and data sources. When tracing MCP applications, client and server operations produce separate traces by default; Langfuse supports linking them by propagating OpenTelemetry context through MCP's _meta field, so one request flows through client, server, and downstream APIs as a single connected trace.

Subagents and multi-agent orchestration

Production agent systems increasingly delegate: a supervisor hands work to specialized subagents, the OpenAI Agents SDK models this as handoffs, and frameworks like LangChain DeepAgents spawn subagents for research and critique. In a trace, each subagent's work nests under its parent as typed observations, so the delegation structure is visible in the trace tree and the agent graph. When the agents run as separate services, their spans are assembled into one trace via distributed tracing, covered below.

Context engineering

What an agent does is determined as much by its assembled context (system prompt, skills, retrieved documents, memory) as by the model, so two runs with the same user input can diverge purely because the context differed. Observability has to capture what the model actually saw:

Prompt versions are linked to traces via prompt management, so quality metrics can be compared per prompt version.
Retrieved context is recorded in retriever observations, with the query as input and the returned documents as output.
The full message history sent to the model is the input of each generation observation, so the effective context of every step is inspectable after the fact.

Coding agents

Coding agents are now a mainstream agent class doing real engineering work: they edit files, run terminal commands, and call MCP tools. Langfuse traces the major ones, including Claude Code and OpenAI Codex via lifecycle hooks, GitHub Copilot via its native OpenTelemetry export, and Cursor, Kiro, OpenCode, and Augment Code via dedicated integrations. Teams use these traces for debugging agent sessions, per-developer cost dashboards, and rollout governance; our guide on tracing coding agents covers setup for all of them.

Structured tracing with observation types

Every step in a Langfuse trace is an observation with a semantic type. Langfuse supports 10 observation types: event, span, generation, agent, tool, chain, retriever, evaluator, embedding, and guardrail. Framework integrations set these automatically; with the SDKs you set them explicitly:

from langfuse import observe

@observe(as_type="agent")
def run_agent_workflow(query):
    return process_with_tools(query)

@observe(as_type="tool")
def call_weather_api(location):
    return weather_service.get_weather(location)

Typed observations make traces filterable ("show me all guardrail checks that failed") and they are what enables the agent graph view described next.

Visualizing agent behavior with agent graphs

The agent graph view turns a trace into a picture of what the agent did: nodes for steps, edges for how execution moved between them. Langfuse infers the graph automatically from observation timings and nesting; it appears for any trace that contains an observation type other than span, event, or generation, and automatically for the LangGraph integration. It works with any framework or custom instrumentation.

Since July 2026, the graph view (currently in beta) offers two modes that answer different questions:

	Aggregated (default)	Expanded ("as it ran")
A node is	one unique step name	one individual call
Repeated calls	collapse into a single node with a counter	appear as separate nodes
Loops	drawn as cycles	unrolled into an acyclic graph (DAG)
Best for	grasping structure and complexity at a glance	following or debugging a specific execution

Aggregated shows the agent's overall shape: retrieve_docs (3/3) means that step ran three times, and loops draw as cycles, so even a busy agent stays readable. Expanded shows the run step by step: every call is its own node and loops unroll in execution order, which is the right view for pinning down exactly where something happened. The layout is deterministic, so a trace draws the same way every time you open it.

Monitoring tool calls

Tool calls are the core of agent behavior, so Langfuse treats them as a distinct data structure rather than raw JSON in a payload. It distinguishes available tools (the tools offered to the LLM) from tool calls (the tools the LLM actually invoked, with arguments).

This structure shows up in three places:

1. In the trace view. All tools available to an LLM render at the top of each generation, with called tools highlighted alongside their arguments and call IDs. You can see at a glance whether the model picked the right tool, and click any tool to inspect its full definition and parameters.

2. In filters and dashboards. The observations table can be filtered by tool-call count, available-tool count, and tool names, and dashboard widgets aggregate toolCalls and toolDefinitions metrics over time. This turns tool behavior into queries:

Find looping agents by filtering for observations with 30+ tool calls but only 1 available tool.
Find dead tools by filtering for observations where a tool like get_weather was available but never invoked.
Audit tool scope by filtering for observations where a sensitive tool was actually called, and alerting on the result.

3. In evaluators. Code evaluators and LLM-as-a-Judge evaluators access recorded tool calls through a structured tool_calls field (each call carries id, name, arguments, type, and index), so checks like "did the agent call search before answering" are one line of code:

used_search = any(
    tool_call.name == "search" for tool_call in ctx.observation.tool_calls
)

Tool-call parsing covers payloads from OpenAI, LangChain and LangGraph, the Vercel AI SDK, Google ADK, and the Microsoft Agent Framework, with support for further frameworks expanding (see the changelog for the current list).

Debugging long agent traces

Production agent traces routinely reach thousands of observations, and long-horizon agents that plan, delegate, and retry can produce far more. At that size, scrolling a trace tree stops being a debugging strategy. Langfuse provides three tools for working with large traces:

The trace log view concatenates all observation data of a trace into a single scrollable document, so you can skim an agent run chronologically or Ctrl+F through the entire trace to find a specific string inside a loopy, verbose agent.
Full-text search finds a keyword or phrase across the inputs, outputs, and metadata of all traces and observations in a project, which helps when you remember a piece of content but not which trace it belongs to.
The observations table treats every LLM call, tool execution, and agent step as a row you can query directly. Filter by observation name, type, or model, sort generations by cost, or pull up all ERROR-level observations for one user, then save the view for one-click access. See the guide on working with observations.

Assembling multi-agent and distributed traces

When an agent system spans multiple services, for example a supervisor service delegating to specialized agent services, the individual steps only become one coherent trace if they share a trace ID. Langfuse supports this through trace IDs and distributed tracing:

Propagate the trace ID across service boundaries via standard OpenTelemetry context propagation, and all spans land in the same Langfuse trace.
Derive deterministic trace IDs from a seed, such as an external request ID, so any service (and any later system, like an evaluation pipeline) can compute the same trace ID independently.
Link MCP client and server traces by carrying the trace context in the MCP _meta field.
Group related traces into sessions to follow a multi-turn conversation where each turn is its own trace.

Once assembled, the full multi-agent run gets the same treatment as a single-service trace: one trace tree, one graph view, one place to attach scores.

Agent frameworks and how to trace them

You do not need a specific framework to build AI agents, and Langfuse is deliberately framework-agnostic: it accepts traces from its native SDKs, from OpenTelemetry, and from 100+ library and framework integrations. Below are the integrations we see used most for agents, as of July 2026.

LangGraph

LangGraph (GitHub) is an open-source framework by the LangChain team for building complex, stateful, multi-agent applications. LangGraph includes built-in persistence to save and resume state, which enables error recovery and human-in-the-loop workflows.

LangGraph agents can be monitored with Langfuse, and the agent graph view renders the LangGraph structure automatically.

Example trace in Langfuse

OpenAI Agents SDK

The OpenAI Agents SDK is a lightweight framework for building and orchestrating AI agents. By instrumenting the SDK with Langfuse, you can capture detailed traces of agent execution, including planning, function calls, and multi-agent handoffs.

For a setup guide, see Trace the OpenAI Agents SDK with Langfuse.

Example trace in Langfuse

Claude Agent SDK

The Claude Agent SDK is Anthropic's open-source framework for building tool-using AI agents, with native support for MCP. It emits traces via OpenTelemetry, so every prompt, model response, and tool call lands in Langfuse.

Setup guides are available for Python and JS/TS.

Pydantic AI

Pydantic AI brings Pydantic's type safety and ergonomic developer experience to agent development. You define your agent's inputs, tool signatures, and outputs as Python types, and the framework handles validation plus OpenTelemetry instrumentation under the hood.

For a step-by-step guide, see Trace Pydantic AI agents with Langfuse.

Example trace in Langfuse

CrewAI

CrewAI is all about role-based collaboration among multiple agents. You assign each agent a distinct skillset or role, then let them cooperate to solve a problem. The framework offers a higher-level abstraction called a "Crew" that coordinates workflows, allowing agents to share context and build upon one another's contributions.

For setup instructions, see Trace CrewAI agents with Langfuse.

Example trace in Langfuse

Vercel AI SDK

The Vercel AI SDK is a TypeScript toolkit for building AI applications and agents in Next.js, React, and Node.js, including tool calling and agentic loops. Its built-in telemetry integrates with Langfuse for full traces of generations and tool calls.

For setup instructions, see Observability for the Vercel AI SDK.

Hugging Face smolagents

Hugging Face smolagents is a minimalist framework for building AI agents. By initializing the SmolagentsInstrumentor, your agent interactions are traced using OpenTelemetry and displayed in Langfuse.

For a step-by-step guide, see Observability for smolagents with Langfuse.

Example trace in Langfuse

Strands Agents

Strands Agents SDK is a model-agnostic agent framework that supports multiple model providers including Amazon Bedrock, Anthropic, OpenAI, and Ollama via LiteLLM. It ships with first-class OpenTelemetry tracing.

For setup instructions, see Trace Strands Agents with Langfuse.

Example trace in Langfuse

More agent frameworks

Langfuse has maintained integrations for many more agent frameworks; the full list is in the integrations directory.

Framework	Notes
Microsoft Agent Framework	Microsoft's open-source agent framework with built-in OpenTelemetry support.
Semantic Kernel	Microsoft's SDK for orchestrating AI skills in C#, Python, and Java.
AutoGen	Now in maintenance mode; Microsoft recommends the Agent Framework for new projects (as of July 2026).
Google ADK	Google's Agent Development Kit, traced via OpenTelemetry.
LangChain DeepAgents	LangChain-based framework for agents that plan, spawn subagents, and iterate on complex tasks.
Mastra	TypeScript agent framework with workflows, RAG, and evals.
LlamaIndex Workflows	Event-driven, step-based orchestration for LlamaIndex agents.
Agno	Lightweight Python framework for multi-agent systems.
Amazon Bedrock AgentCore	AWS runtime for deploying and operating agents at scale.
Temporal	Durable-execution platform used to run long-lived, fault-tolerant agent workflows.

No-code agent builders

For prototypes and development by non-developers, no-code builders can be a great starting point.

Flowise

Flowise (GitHub) is a no-code builder. It lets you build customized LLM flows with a drag-and-drop editor. With the native Langfuse integration, you can use Flowise to quickly create complex LLM applications in no-code and then use Langfuse to analyze and improve them.

Example of a catalog chatbot created in Flowise to answer any questions related to shop products.

Langflow

Langflow (GitHub) is a UI for LangChain, designed with react-flow to provide an effortless way to experiment and prototype flows.

With the native integration, you can use Langflow to quickly create complex LLM applications in no code and then use Langfuse to monitor and debug them.

Example of a chat agent with chain-of-thought reasoning built in Langflow by Cobus Greyling.

Dify

Dify (GitHub) is an open-source LLM app development platform. Using their Agent Builder and variety of templates, you can easily build an AI agent and then grow it into a more complex system via Dify workflows.

With the native Langfuse integration, you can use Dify to quickly create complex LLM applications and then use Langfuse to monitor and improve them.

Example of a Dify Agent that summarizes meetings.

n8n

n8n is a workflow automation platform with AI agent nodes. The native Langfuse integration traces the LLM steps of n8n workflows.

Voice agent observability

Voice agents add a real-time pipeline (speech-to-text, LLM, text-to-speech) on top of the usual agent loop, and each stage can fail or add latency independently. Langfuse traces voice agents through dedicated integrations:

LiveKit Agents ships with built-in OpenTelemetry support; the Langfuse SDK registers as a span processor, capturing real-time voice sessions in Python and Node.js.
Pipecat traces map conversations to turns and service calls, with dedicated spans for STT, LLM, and TTS, time-to-first-byte metrics for latency analysis, and usage statistics per service.
Vapi connects Langfuse to no-code voice agents.

Audio itself is a first-class trace payload: Langfuse multi-modality support attaches audio files (mp3, wav, ogg, and more) to traces, so you can listen to the exact input and output of a turn while reading its trace. For evaluation, most teams score voice agents on transcripts today, using the same LLM-as-a-judge and human annotation workflows as text agents; evaluating audio qualities like tone or interruption behavior directly is still an emerging practice across the industry.

Monitoring agents in production

Once an agent is live, observability shifts from inspecting individual traces to watching the system as a whole:

Cost and token tracking aggregates model usage per trace, user, session, and model, so an agent that starts burning budget shows up in cost dashboards immediately.
Custom dashboards chart any trace or observation metric over time, including the toolCalls and toolDefinitions metrics described above, with dashboards built from the same filters used in the tables.
Monitors and alerts watch a metric like average cost per trace, an evaluation score, or p95 latency, and fire through Slack, webhooks, or GitHub Actions when it leaves the expected range. Monitors are available on Langfuse Cloud.
User feedback, explicit (ratings) and implicit (retries, abandonment, corrections), lands as scores on traces, feeding the same dashboards and monitors as evaluator scores.

For a method to turn this stream of production data into a concrete list of what to fix, the Langfuse Academy monitoring module and its deep dive on error analysis walk through finding and quantifying recurring failure patterns in traces.

Evaluating agents

Agents fail in nuanced ways: selecting the wrong tool, entering reasoning loops, or hallucinating in an intermediate step that produces a plausible-looking but incorrect final answer. Evaluation is how you catch these failures systematically, and it sits inside a larger loop: trace production behavior, monitor it, build datasets from real failures, experiment against them, and evaluate the results. The Langfuse Academy explains this AI engineering loop end to end, free and open; this section summarizes the agent-relevant core.

Start with manual review, then automate

The Academy's evaluation module describes how evaluation typically evolves, and agent teams follow the same path: you start by manually reviewing traces to build intuition for what good and bad look like for your agent, then identify specific failure modes worth checking for, and only then automate with dedicated evaluators. Teams that skip the manual step tend to measure things that do not matter. Manual review is also not a one-time phase: continuous review by human experts catches new failure modes and produces the ground-truth labels that keep automated evaluators calibrated.

Three evaluation methods cover agent quality, each suited to different checks:

Human annotation via annotation queues provides ground truth for ambiguous cases, such as whether a multi-step trajectory was reasonable.
Code evaluators check deterministic properties: the required tool was called (the tool_calls one-liner shown earlier), the step budget was respected, the output parses against a schema.
LLM-as-a-judge evaluators handle semantic judgments like task completion or groundedness; they need calibration against human labels to be trustworthy.

One practical recommendation from the Academy applies directly to agents: prefer binary pass/fail scores over graded 1-5 scales, because binary scores force a precise definition of what separates acceptable from unacceptable behavior.

Evaluate offline before shipping, online after

Following the Academy's loop, agent evaluation runs in two places. Offline, you build datasets from failing production traces and hand-written edge cases, then run experiments that compare a change (new prompt, new model, new tool set, new agent architecture) against a baseline on that dataset, and gate releases in CI on the resulting scores. Online, reference-free evaluators and user feedback score live traffic continuously, confirming that production quality matches what the experiments predicted. When production surfaces a new failure, it becomes a dataset item, and the loop closes.

Where to go deeper

Langfuse Academy covers the full loop with a module per step: tracing, monitoring, datasets, experiments, and evaluation.
Our AI agent evaluation guide is the agent-specific layer on top: tool-call checks, trajectory scoring, multi-turn evaluation, and CI gating in depth.
The agent evaluation cookbook is an end-to-end walkthrough of evaluating a Pydantic AI agent with MCP tools.

Get started

If you want to get started with building AI agents and monitoring them with Langfuse, here are the best places to begin:

Build and trace an agent: Follow our end-to-end example of building a simple agent with LangGraph and tracking it with Langfuse.
Learn the concepts: Work through the Langfuse Academy to understand the AI engineering loop from tracing to evaluation.
Evaluate your agents: Read the AI agent evaluation guide and set up your first dataset and experiment.
Explore all integrations: Browse the full list of supported integrations to find the right setup for your stack.

Was this page helpful?

PreviousComparing Open-Source AI Agent Frameworks