Scores

Scores are Langfuse's universal data object for storing evaluation results. Any time you want to assign a quality judgment to an LLM output, whether by a human annotation, an LLM judge, a programmatic check, or end-user feedback, the result is stored as a score.

Every score has a name (like "correctness" or "helpfulness"), a value, and a data type. Scores also support an optional comment for additional context.

Scores can be attached to traces, observations, sessions, or dataset runs. Most commonly, scores are attached to traces to evaluate a single end-to-end interaction.

Once you have scores, they show up in score analytics, can be visualized in custom dashboards, and can be queried via the API.

When to Use Scores

Scores become useful when you want to go beyond observing what your application does and start measuring how well it does it. Common use cases:

Collecting user feedback: Capture thumbs up/down or star ratings from your users and attach them to traces. See the user feedback guide.
Monitoring production quality: Set up automated evaluators (like LLM-as-a-Judge) to continuously score live traces for things like hallucination, relevance, or tone.
Running guardrails: Score whether outputs pass safety checks like PII detection, format validation, or content policy compliance.
Comparing changes with experiments: When you change a prompt, model, or pipeline, run an experiment to score the new version against a dataset.

Score Types

Langfuse supports four score data types:

Type	Value	Use when
`NUMERIC`	Float (e.g. `0.9`)	Continuous judgments like accuracy, relevance, or similarity scores
`CATEGORICAL`	String from predefined categories (e.g. `"correct"`, `"partially correct"`)	Discrete classifications where the set of possible values is known upfront
`BOOLEAN`	`0` or `1`	Pass/fail checks like hallucination detection or format validation
`TEXT`	Free-form string (1-500 characters)	Open-ended annotations like reviewer notes or qualitative feedback. Often used for open coding before formalizing into quantifiable scores via axial coding.

Text scores are designed for qualitative, open-ended scoring. Because free-form text cannot be meaningfully aggregated or compared, text scores are not supported in experiments, LLM-as-a-Judge, or score analytics.

How to Create Scores

There are five ways to add scores:

LLM-as-a-Judge: Set up automated evaluators that score traces based on custom criteria (e.g. hallucination, tone, relevance). These can return numeric or categorical scores plus reasoning, and can run on live production traces or on experiment results.
Code evaluators: Run custom Python or TypeScript evaluators in Langfuse for deterministic checks such as exact match, JSON validation, or custom business rules.
Scores via UI: Team members manually score traces, observations, or sessions directly in the Langfuse UI. Requires a score config to be set up first.
Annotation Queues: Set up structured review workflows where reviewers work through batches of traces.
Scores via API/SDK: Programmatically add scores from your application code. This is the way to go for user feedback (thumbs up/down, star ratings), guardrail results, or custom evaluation pipelines.

Should I Use Scores or Tags?

	Scores	Tags
Purpose	Measure how good something is	Describe what something is
Data	Numeric, categorical, boolean, or text value	Simple string label
When added	Can be added at any time, including long after the trace was created	Set during tracing and cannot be changed afterwards
Used for	Quality measurement, analytics, experiments	Filtering, segmentation, organizing

As a rule of thumb: if you already know the category at tracing time (e.g. which feature or API endpoint triggered the trace), use a tag. If you need to classify or evaluate traces later, use a score.

Score Comments

Every score supports an optional comment field. Use it to capture reasoning (e.g. why an LLM judge assigned a particular score), reviewer notes, or context that helps others understand the score value. Comments are shown alongside scores in the Langfuse UI.

Use a TEXT score instead of comments to capture standalone qualitative feedback -- comments are best for additional reasoning on an existing score.

Was this page helpful?

Scores

When to Use Scores

Score Types

How to Create Scores

Should I Use Scores or Tags?

Score Comments

On this page