Scores
Scores are Langfuse's universal data object for storing evaluation results. Any time you want to assign a quality judgment to an LLM output, whether by a human annotation, an LLM judge, a programmatic check, or end-user feedback, the result is stored as a score.
Every score has a name (like "correctness" or "helpfulness"), a value, and a data type. Scores also support an optional comment for additional context.
Scores can be attached to traces, observations, sessions, or dataset runs. Most commonly, scores are attached to traces to evaluate a single end-to-end interaction.
Once you have scores, they show up in score analytics, can be visualized in custom dashboards, and can be queried via the API.
When to Use Scores
Scores become useful when you want to go beyond observing what your application does and start measuring how well it does it. Common use cases:
- Collecting user feedback: Capture thumbs up/down or star ratings from your users and attach them to traces. See the user feedback guide.
- Monitoring production quality: Set up automated evaluators (like LLM-as-a-Judge) to continuously score live traces for things like hallucination, relevance, or tone.
- Running guardrails: Score whether outputs pass safety checks like PII detection, format validation, or content policy compliance.
- Comparing changes with experiments: When you change a prompt, model, or pipeline, run an experiment to score the new version against a dataset.
Score Types
Langfuse supports four score data types:
| Type | Value | Use when |
|---|---|---|
NUMERIC | Float (e.g. 0.9) | Continuous judgments like accuracy, relevance, or similarity scores |
CATEGORICAL | String from predefined categories (e.g. "correct", "partially correct") | Discrete classifications where the set of possible values is known upfront |
BOOLEAN | 0 or 1 | Pass/fail checks like hallucination detection or format validation |
TEXT | Free-form string (1-500 characters) | Open-ended annotations like reviewer notes or qualitative feedback. Often used for open coding before formalizing into quantifiable scores via axial coding. |
Text scores are designed for qualitative, open-ended scoring. Because free-form text cannot be meaningfully aggregated or compared, text scores are not supported in experiments, LLM-as-a-Judge, or score analytics.
How to Create Scores
There are four ways to add scores:
- LLM-as-a-Judge: Set up automated evaluators that score traces based on custom criteria (e.g. hallucination, tone, relevance). These can return numeric or categorical scores plus reasoning, and can run on live production traces or on experiment results.
- Scores via UI: Team members manually score traces, observations, or sessions directly in the Langfuse UI. Requires a score config to be set up first.
- Annotation Queues: Set up structured review workflows where reviewers work through batches of traces.
- Scores via API/SDK: Programmatically add scores from your application code. This is the way to go for user feedback (thumbs up/down, star ratings), guardrail results, or custom evaluation pipelines.
Should I Use Scores or Tags?
| Scores | Tags | |
|---|---|---|
| Purpose | Measure how good something is | Describe what something is |
| Data | Numeric, categorical, boolean, or text value | Simple string label |
| When added | Can be added at any time, including long after the trace was created | Set during tracing and cannot be changed afterwards |
| Used for | Quality measurement, analytics, experiments | Filtering, segmentation, organizing |
As a rule of thumb: if you already know the category at tracing time (e.g. which feature or API endpoint triggered the trace), use a tag. If you need to classify or evaluate traces later, use a score.
Score Comments
Every score supports an optional comment field. Use it to capture reasoning (e.g. why an LLM judge assigned a particular score), reviewer notes, or context that helps others understand the score value. Comments are shown alongside scores in the Langfuse UI.
Use a TEXT score instead of comments to capture standalone qualitative feedback -- comments are best for additional reasoning on an existing score.
Concepts
Learn the fundamental concepts behind LLM evaluation in Langfuse - Scores, Evaluation Methods, Datasets, and Experiments.
Analytics
Analyze and compare evaluation scores to validate reliability, uncover insights, and track quality trends in your LLM application. Visualize score distributions, measure agreement between evaluation methods, and monitor scores over time.