LLM-as-a-Judge Execution Tracing & Enhanced Observability


Every LLM-as-a-Judge evaluator execution now creates a trace, allowing you to inspect the exact prompts, responses, and token usage for each evaluation.
We’re excited to announce a major enhancement to Langfuse’s LLM-as-a-Judge evaluations: full tracing of evaluator executions. Every time an LLM-as-a-Judge evaluator runs, we now create a detailed trace that captures the complete LLM interaction, giving you unprecedented visibility into how your evaluations are performing.
What’s New
Every LLM-as-a-Judge evaluator execution going forward is linked to a Langfuse trace of the underlying LLM call. This means you can:
- Debug evaluation prompts: See exactly what prompt was sent to the judge LLM
- Inspect model responses: View the complete response including score and reasoning
- Monitor token usage: Track costs and performance for each evaluator execution
- Trace evaluation history: Navigate from any score back to its source LLM interaction
How to access execution traces: There are four ways to navigate to an evaluator execution trace:
- Score tooltip in trace view: For LLM-as-a-Judge scores, hover over any score badge and click “View execution trace”
- Tracing table: Filter the environment to
langfuse-llm-as-a-judge
to view all evaluator execution traces
- Scores table: Enable the “Execution Trace” column in the scores table to see all evaluator executions
- Evaluator logs table: View execution trace IDs in the evaluator logs for detailed execution history
Why This Matters
Previously, debugging failed evaluations or understanding why a judge gave a particular score required guesswork. Now, with full tracing:
- Trust your evaluations: Verify that the judge received the correct input and made sound judgments
- Optimize costs: Identify expensive evaluation patterns and optimize your prompts
- Faster debugging: Instantly see what went wrong when an evaluation fails
- Audit trail: Complete history of every evaluation decision for compliance and analysis
Getting Started
This feature is automatically enabled for all LLM-as-a-Judge executions going forward.