Launch Week 5 · Day 4: Code evaluators →
DocsOverview
DocsEvaluationOverview

Evaluation Overview

Evals give you a repeatable check of your LLM application's behavior. You replace guesswork with data, and catch regressions before you ship a change.

Score Analytics dashboard in Langfuse showing evaluation scores trended over time across multiple evaluators.

Evaluation runs across most of the AI engineering loop: you score live traces in production, turn interesting examples into datasets, run experiments to compare changes, and judge the results with manual or automated evaluators. It happens both online, on live production traces, and offline, before you ship a change.

🎥

Watch this walkthrough of Langfuse Evaluation and how to use it to improve your LLM application.

Getting Started

Start with the Core Concepts page. It explains how evaluators, scores, datasets, and experiments fit together in Langfuse, which makes the rest of the docs much easier to navigate.

Once you have that context, use the table below to find the right feature page:

If you want to...Use this Langfuse feature
Review and rate traces manuallyAnnotation Queues, Scores via UI
Leave open-ended notes on tracesText scores, Annotation Queues
Track recurring failure categoriesScore configs, scores
Build a reusable set of test casesDatasets
Compare prompt, model, or code changes side by sideExperiments via UI, Experiments via SDK
Block deploys on regressionsCI/CD experiments
Run deterministic checksCode Evaluators
Automatically score live production tracesLLM-as-a-Judge, Scores via API/SDK
See how scores trend over timeScore Analytics, custom dashboards

Already know what you're looking for? Browse Evaluation Methods and Experiments in the sidebar.

GitHub Discussions


Was this page helpful?

Last edited