Langfuse Evaluator Library

Introducing the Langfuse Evaluator Library with prebuilt evaluators. Plus, enjoy a revamped UX with trace and variable previews for easier LLM evaluation.
On Day 6 of our Launch Week #3, we’re introducing the Langfuse Evaluator Library and major improvements to the evaluator UX.
Evaluation is core to monitor and continuously improve LLM applications. There are many ways to eval LLM applications/agents and you can flexibly record these evals as scores in Langfuse.
Langfuse Evaluator Library
The Langfuse LLM-as-a-Judge runner is all about making it easy to:
- manage evaluation templates,
- using your own models,
- defining when these evals shall be run by being able to filter and sample your production and development data, and
- interactively working with the results to improve your application.
Today, we introduce a larger library of prebuilt evaluators in partnership with RAGAS to measure context relevance, SQL semantic equivalence, hallucinations, and other key dimensions. While you can bring your own evaluation templates, the expanded library makes it easier to get started.
Revamped Evaluator UX
Langfuse LLM-as-a-Judge is flexible (see above) but also a bit complex to configure. Thus, we introduced some core UX changes to make it easier to get started:
1. Standard Eval Model
A “standard eval model” can be configured that applies to all evals (unless you override it).
2. Preview of traces that match filter conditions
Langfuse allows you to filter traces by various conditions. We now show a preview of historical traces that match the filter conditions.
3. Preview of inserted variables
When you insert variables into your eval, we now show a preview of the variables that will be inserted.
Getting Started
To get started, check out the LLM-as-a-Judge Docs or the walkthrough above.
Do you have any feedback? Please let us know via GitHub!