DocsEvaluationConcepts

Evaluation Concepts

  • Scores are a flexible data object that can be used to store any evaluation metric and link it to other objects in Langfuse.
  • Evaluation Methods are functions or tools to assign scores to other objects.
  • Datasets are a collection of inputs and, optionally, expected outputs that can be used during Experiments.
  • Experiments loop over your dataset, trigger your application on each item and optionally apply evaluation methods to the results.

Scores

Scores serve as objects for storing evaluation metrics in Langfuse. Here are its core properties:

  • Scores reference a Trace, Observation, Session, or DatasetRun
  • Each Score references exactly one of the above objects.
  • Scores are either numeric, categorical, or boolean.
  • Scores can optionally be linked to a ScoreConfig to ensure they comply with a specific schema.
Common Use
LevelDescription
TraceUsed for evaluation of a single interaction. (most common)
ObservationUsed for evaluation of a single observation below the trace level.
SessionUsed for comprehensive evaluation of outputs across multiple interactions.
Dataset RunUsed for performance scores of a Dataset Run. See Dataset Runs for context.
Score object
AttributeTypeDescription
namestringName of the score, e.g. user_feedback, hallucination_eval
valuenumberOptional: Numeric value of the score. Always defined for numeric and boolean scores. Optional for categorical scores.
stringValuestringOptional: String equivalent of the score’s numeric value for boolean and categorical data types. Automatically set for categorical scores based on the config if the configId is provided.
traceIdstringOptional: Id of the trace the score relates to
observationIdstringOptional: Observation (e.g. LLM call) the score relates to
sessionIdstringOptional: Id of the session the score relates to
datasetRunIdstringOptional: Id of the dataset run the score relates to
commentstringOptional: Evaluation comment, commonly used for user feedback, eval reasoning output or internal notes
idstringUnique identifier of the score. Auto-generated by SDKs. Optionally can also be used as an idempotency key to update scores.
sourcestringAutomatically set based on the source of the score. Can be either API, EVAL, or ANNOTATION
dataTypestringAutomatically set based on the config data type when the configId is provided. Otherwise can be defined manually as NUMERIC, CATEGORICAL or BOOLEAN
configIdstringOptional: Score config id to ensure that the score follows a specific schema. Can be defined in the Langfuse UI or via API. When provided the score’s dataType is automatically set based on the config

Score Configs

Score configs are used to ensure that your scores follow a specific schema. Using score configs allows you to standardize your scoring schema across your team and ensure that scores are consistent and comparable for future analysis.

You can define a scoreConfig in the Langfuse UI or via our API (how to guide here) Configs are immutable but can be archived (and restored anytime).

A score config includes:

  • Score name
  • Data type: NUMERIC, CATEGORICAL, BOOLEAN
  • Constraints on score value range (Min/Max for numerical, Custom categories for categorical data types
Score Config object
AttributeTypeDescription
idstringUnique identifier of the score config.
namestringName of the score config, e.g. user_feedback, hallucination_eval
dataTypestringCan be either NUMERIC, CATEGORICAL or BOOLEAN
isArchivedbooleanWhether the score config is archived. Defaults to false
minValuenumberOptional: Sets minimum value for numerical scores. If not set, the minimum value defaults to -∞
maxValuenumberOptional: Sets maximum value for numerical scores. If not set, the maximum value defaults to +∞
categorieslistOptional: Defines categories for categorical scores. List of objects with label value pairs
descriptionstringOptional: Provides further description of the score configuration

Evaluation Methods

Evaluation methods let you assign evaluation scores to traces, observations, sessions, or dataset runs.

You can use the following evaluation methods to add scores:

Experiments

Experiments are used to loop your LLM application through Datasets (local or hosted on Langfuse) and optionally apply Evaluation Methods to the results. This lets you strategically evaluate your application and compare the performance of different inputs, prompts, models, or other parameters side-by-side against controlled conditions.

Langfuse supports Experiments via SDK and Experiments via UI. Experiments via UI rely on Dataset, Prompts and optionally LLM-as-a-Judge Evaluators all being on the Langfuse platform and can be thus triggered and executed directly on the platform. Experiments via SDK are fully flexible and can be triggered from any external system.

Learn more about the Experiments Data Model.

Was this page helpful?