LLM Evaluation: Model-Based, Labeling & User Feedback
Evaluation is a critical aspect of developing and deploying LLM applications. Usually, teams use a multitude of different evaluation methods to score the performance of their AI application depending on the use case and the stage of the development process.
Why are LLM Evals Important?
LLM evaluation is crucial for improving the accuracy and robustness of language models, ultimately enhancing the user experience and trust in your AI application. It helps detect hallucinations and measure performance across diverse tasks. A structured evaluation in production is vital for continuously improving your application.
Plot evaluation results in the Langfuse Dashboard.
Langfuse provides a flexible scoring system to capture all your evaluations in one place and make them actionable.
Data Model
Langfuse uses the score
object to store evaluation metrics, it is meant to be flexible to represent any evaluation metric. Learn more about the data model.
This data model is utilized across all evaluation methods, both UI and API. It can also be accessed programmatically through the SDKs or API for custom workflows.
Common Evaluation Methods
1. Model-based Evaluation (LLM-as-a-Judge)
Model-based evaluations (LLM-as-a-judge) are a powerful tool to automatically assess LLM applications integrated with Langfuse. With this approach, an LLMs scores a particular session, trace, or LLM call in Langfuse based on factors such as accuracy, toxicity, or hallucinations.
There are two ways to run model-based evaluations in Langfuse:
- Via the Langfuse UI (beta)
- Via external evaluation pipelines using the API/SDKs
2. Manual Annotation / Data Labeling (in UI)
With manual annotations, you can annotate a subset of traces and observations by hand. This allows you to collaborate with your team and add scores via the Langfuse UI. Annotations can be used to establish a baseline for your evaluation metrics and to compare them with automated evaluations.
3. User Feedback
Capturing user feedback in your AI application can be a valuable evaluation metric. You can add explicit (e.g., thumbs up/down, 1-5 star rating) or implicit (e.g., time spent on a page, click-through rate, accepting/rejecting a model-generated output, human-in-the-loop) user feedback to your LLM traces in Langfuse.
4. Custom Evaluation via SDKs/API
Langfuse gives you full flexibility to ingest custom evaluation scores via the Langfuse SDKs or API. The scoring workflow allows you to run custom quality checks (e.g. valid structured output format) on the output of your workflows at runtime, or to run custom external evaluation workflows.