DocsEvaluationEvaluation MethodsLLM-as-a-Judge

LLM-as-a-Judge

Use an LLM to automatically score your application outputs. For this evaluation method, the LLM is presented with an observation, trace or experiment item and asked to score and reason about the output. It will then produce a score including a comment with chain-of-thought reasoning.

Why use LLM-as-a-judge?

  • Scalable: Judge thousands of outputs quickly versus human annotators.
  • Human‑like: Captures nuance (e.g. helpfulness, toxicity, relevance) better than simple metrics, especially when rubric‑guided.
  • Repeatable: With a fixed rubric, you can rerun the same prompts to get consistent scores.

Set up step-by-step

Create a new LLM-as-a-Judge evaluator

Navigate to the Evaluators page and click on the + Set up Evaluator button.

Evaluator create

Set the default model

Next, define the default model used for the evaluations. This step requires an LLM Connection to be set up. Please see LLM Connections for more information.

It’s crucial that the chosen default model supports structured output. This is essential for our system to correctly interpret the evaluation results from the LLM judge.

Pick an Evaluator

Evaluator select

Next, select an evaluator. There are two main ways:

Langfuse ships a growing catalog of evaluators built and maintained by us and partners like Ragas. Each evaluator captures best-practice evaluation prompts for a specific quality dimension—e.g. Hallucination, Context-Relevance, Toxicity, Helpfulness.

  • Ready to use: no prompt writing required.
  • Continuously expanded: by adding OSS partner-maintained evaluators and more evaluator types in the future (e.g. regex-based).

Choose which Data to Evaluate

With your evaluator and model selected, you now specify which data to run the evaluations on. You can choose between scoring live tracing data or offline experiments.

Evaluating live production traffic allows you to monitor the performance of your LLM application in real-time.

Live Observations (Recommended)

Run evaluators on individual observations such as LLM calls, tool invocations, or agent steps. This provides:

  • Granular control: Target specific observations in your trace
  • Performant system: Optimized architecture for high-volume evaluation
  • Flexible filtering: Apply a combination of trace and observation filters

SDK Requirements

RequirementPythonJS/TS
Minimum SDK versionv3+ (OTel-based)v4+ (OTel-based)
Migration guidePython v2 → v3JS/TS v3 → v4

Filtering by trace attributes: To filter observations by trace-level attributes (userId, sessionId, version, tags, metadata, trace_name), you must use propagate_attributes() in your instrumentation code. Without this, trace attributes will not be available on observations.

How it works:

  1. Select “Live Observations” as your evaluation target
  2. Narrow down the evaluation to a specific subset of data you’re interested in (observation type, trace name, trace tags, userId, sessionId, metadata etc.)
  3. To manage costs and evaluation throughput, you can configure the evaluator to run on a percentage (e.g., 5%) of the matched observations.

Map Variables & preview Evaluation Prompt

You now need to teach Langfuse which properties of your observation, trace, or experiment item represent the actual data to populate these variables for a sensible evaluation. For instance, you might map your system’s logged observation input to the prompt’s {{input}} variable, and the LLM response (observation output) to the prompt’s {{output}} variable. This mapping is crucial for ensuring the evaluation is sensible and relevant.

  • Prompt Preview: As you configure the mapping, Langfuse shows a live preview of the evaluation prompt populated with actual data. This preview uses historical data from the last 24 hours that matched your filters. You can navigate through several examples to see how their respective data fills the prompt, helping you build confidence that the mapping is correct.
  • JSONPath: If the data is nested (e.g., within a JSON object), you can use a JSONPath expression (like $.choices[0].message.content) to precisely locate it.
Filter preview

Trigger the evaluation

To see your evaluator in action, you need to either send traces (fastest) or trigger an experiment run (takes longer to setup) via the UI or SDK. Make sure to set the correct target data in the evaluator settings according to how you want to trigger the evaluation.

✨ Done! You have successfully set up an evaluator which will run on your data.

Need custom logic? Use the SDK instead—see Custom Scores or an external pipeline example.

Debug LLM-as-a-Judge Executions

Every LLM-as-a-Judge evaluator execution creates a full trace, giving you complete visibility into the evaluation process. This allows you to debug prompt issues, inspect model responses, monitor token usage, and trace evaluation history.

You can show the LLM-as-a-Judge execution traces by filtering for the environment langfuse-llm-as-a-judge in the tracing table:

Tracing table filtered to langfuse-llm-as-a-judge environment

LLM-as-a-Judge Execution Status
  • Completed: Evaluation finished successfully.
  • Error: Evaluation failed (click execution trace ID for details).
  • Delayed: Evaluation hit rate limits by the LLM provider and is being retried with exponential backoff.
  • Pending: Evaluation is queued and waiting to run.

Advanced Topics

Migrating from Trace-Level to Observation-Level Evaluators

If you have existing evaluators running on traces and want to upgrade to running on observations for better performance and reliability, check out our comprehensive Evaluator Migration Guide.

GitHub Discussions

Was this page helpful?