← Back to changelog
Marlies Mayerhofer
May 7, 2025
Dataset Run Level Scores

Score dataset runs to assess the overall quality of each run
Langfuse now supports dataset-experiment-run-level scores, enabling comprehensive evaluation of experiment runs.
What’s New
- Run-Level Scoring: Create and manage scores at the run level for holistic evaluation of experiment runs
- Experiment Metrics Support: Easily ingest overall experiment metrics such as precision, recall, and F1-scores
- Flexible API Design: Updated APIs to accommodate both trace-level and dataset-experiment-run-level scoring needs
- UI Enhancements: Visual indicators and aggregates for run scores throughout the interface
API Updates
We have extended our new v2 api and will continue to support the v1 api for the foreseeable future. POST and DELETE APIs will support both trace and dataset-experiment-run level scores across v1 and v2.
For GET APIs:
- V1 API: Only supports trace level scores, therefore requires
traceId
- to remain backwards compatible - V2 API: One and one only of
traceId
,sessionId
ordatasetRunId
is now required when creating scores
Why Run-Level Scores Matter
Run-level scores are particularly valuable for applications where you need to evaluate experiment performance across multiple custom test cases to find an overall metric or passing score. This enables more accurate evaluation of:
- Overall system performance across dataset runs
- Aggregate performance metrics like precision, recall, and F1-scores
- Comparative analysis between different model versions or parameters