LLM-as-a-Judge

LLM-as-a-judge is a technique to evaluate the quality of LLM applications by using an LLM as a judge. The LLM is given a trace or a dataset entry and asked to score and reason about the output. The resulting scores include chain-of-thought reasoning as a comment.

Why use LLM-as-a-judge?

Scalable & cost‑effective: Judge thousands of outputs quickly and cheaply versus human panels.
Human‑like judgments: Captures nuance (helpfulness, safety, coherence) better than simple metrics, especially when rubric‑guided.
Repeatable comparisons: With a fixed rubric, you can rerun the same prompts to get consistent scores and short rationales.

Set up step-by-step

Create a new LLM-as-a-Judge evaluator

Navigate to the Evaluators page and click on the + Set up Evaluator button.

Set the default model

Next, you’ll define the default model used for conducting the evaluations. The default is used by every managed evaluator; custom templates may override it.

Evaluator default model

This step requires an LLM Connection to be set up. Please see LLM Connections for more information.

Setup: This default model needs to be set up once, though it can be changed at any point if needed.
Change: Existing evaluators keep evaluating with the new model—historic results stay preserved.
Additional Provider Options: You can pass additional parameters when invoking a model. Read this documentation about additional provider options for more information.
Structured Output Support: It’s crucial that the chosen default model supports structured output. This is essential for our system to correctly interpret the evaluation results from the LLM judge.

Pick an Evaluator

Now we select an evaluator. There are two main ways:

Langfuse ships a growing catalog of evaluators built and maintained by us and partners like Ragas. Each evaluator captures best-practice evaluation prompts for a specific quality dimension—e.g. Hallucination, Context-Relevance, Toxicity, Helpfulness.

Ready to use: no prompt writing required.
Continuously expanded: by adding OSS partner-maintained evaluators and more evaluator types in the future (e.g. regex-based).

Choose which Data to Evaluate

With your evaluator and model selected, you now specify which data to run the evaluations on. You can choose between running on production tracing data or Datasets during Dataset Runs.

Evaluating live production traffic allows you to monitor the performance of your LLM application in real-time.

Scope: Choose whether to run on new traces only and/or existing traces once (for backfilling). When in doubt, we recommend running on new traces.
Filter: Narrow down the evaluation to a specific subset of data you’re interested in. You can filter by trace name, tags, userId and may more. Combine filters freely.
Preview: Langfuse shows a sample of traces from the last 24 hours that match your current filters, allowing you to sanity-check your selection.
Sampling: To manage costs and evaluation throughput, you can configure the evaluator to run on a percentage (e.g., 5%) of the matched traces.

Production tracing data

Map Variables & preview Evaluation Prompt

You now need to teach Langfuse which properties of your trace or dataset item represent the actual data to populate these variables for a sensible evaluation. For instance, you might map your system’s logged trace input to the prompt’s {{input}} variable, and the LLM response ie trace output to the prompt’s {{output}} variable. This mapping is crucial for ensuring the evaluation is sensible and relevant.

Prompt Preview: As you configure the mapping, Langfuse shows a live preview of the evaluation prompt populated with actual data. This preview uses historical traces from the last 24 hours that matched your filters (from Step 3). You can navigate through several example traces to see how their respective data fills the prompt, helping you build confidence that the mapping is correct.
JSONPath: If the data is nested (e.g., within a JSON object), you can use a JSONPath expression (like $.choices[0].message.content) to precisely locate it.

Trigger the evaluation

To see your evaluator in action, you need to either send traces (fastest) or trigger an experiment run (takes longer to setup) via the UI or SDK. Make sure to set the correct target data in the evaluator settings according to how you want to trigger the evaluation.

✨ Done! You have successfully set up an evaluator which will run on your data.

Need custom logic? Use the SDK instead—see Custom Scores or an external pipeline example.

Monitor & Iterate

As our system evaluates your data it writes the results as scores. You can then:

View Logs: Check detailed logs for each evaluation, including status, execution trace IDs, and retry errors. Click the execution trace ID to view the complete LLM interaction for that specific evaluation run.
Use Dashboards: Aggregate scores over time, filter by version or environment, and track the performance of your LLM application.
Take Actions: Pause, resume, or delete an evaluator.

View evaluator data

Trace Evaluator Executions

Every LLM-as-a-Judge evaluator execution now creates a full trace, giving you complete visibility into the evaluation process. This allows you to:

Debug prompt issues: See exactly what prompt was sent to the judge LLM
Inspect model responses: View the complete response including reasoning and structured outputs
Monitor token usage: Track costs and performance for each evaluator execution
Trace evaluation history: Navigate from any score back to its source LLM interaction

How to access execution traces

There are four ways to navigate to an evaluator execution trace:

Score tooltip in trace view: For LLM-as-a-Judge scores, hover over any score badge in the trace detail view and click “View execution trace”

Score tooltip with execution trace link

Tracing table: Filter the environment to langfuse-llm-as-a-judge to view all evaluator execution traces

Tracing table filtered to langfuse-llm-as-a-judge environment

Scores table: Enable the “Execution Trace” column in the scores table to see execution trace IDs for all evaluator executions

Scores table with execution trace column

Evaluator logs table: View execution trace IDs in the evaluator logs for detailed execution history and status

Evaluator logs with execution traces

The execution trace shows the exact prompt sent to the judge, the model’s response, token usage, latency, and any errors that occurred during execution.

Understanding execution statuses

Completed: Evaluation finished successfully
Error: Evaluation failed (click execution trace ID for details)
Delayed: Evaluation hit rate limits by the LLM provider and is being retried with exponential backoff
Pending: Evaluation is queued and waiting to run

FAQ

GitHub Discussions

Overview Human Annotations

Was this page helpful?

Support