How to Retrieve Experiment Scores?
Terminology Note: “Experiment” and “dataset run” are used interchangeably throughout Langfuse. We are moving toward deprecating the term “dataset run” in favor of “experiment”, but both terms currently refer to the same concept.
Langfuse supports two types of experiment scores:
- Experiment-level scores: Overall metrics for the entire experiment run (e.g., precision, recall, F1-scores). These scores are immutable and represent aggregate performance. Learn more about run-level scores.
- Experiment-item-level scores: Scores for individual items within an experiment (e.g., per-generated-output evaluations).
Via API/SDK
Experiment-Level Scores
Support coming soon: Fetch experiment-level scores using the Langfuse SDK or scores API with the datasetRunId parameter.
See the Scores Data Model for details on score properties.
Experiment-Item-Level Scores
Current Workaround: The method below is a workaround for retrieving experiment-item-level scores. We recommend:
- Using the Experiment Runner SDK which provides direct access to all scores in context
- We may add a dedicated API route for experiment scores/metrics in the near future
To retrieve experiment-item-level scores programmatically:
Step 1: Fetch the experiment run
Get the experiment run details including all trace IDs:
from langfuse import Langfuse
from urllib.parse import quote
langfuse = Langfuse()
dataset_name = "your-dataset-name"
run_name = "your-run-name"
# URL encode names if they contain special characters
encoded_dataset_name = quote(dataset_name, safe="")
encoded_run_name = quote(run_name, safe="")
# Fetch experiment run
run = langfuse.get_dataset_run(
dataset_name=encoded_dataset_name,
run_name=encoded_run_name
)
# Extract trace IDs
trace_ids = [item["trace_id"] for item in run["dataset_run_items"]]Step 2: Fetch scores for each trace
Use the trace IDs to retrieve scores for each experiment item:
# Fetch trace details including scores
for trace_id in trace_ids:
trace = langfuse.get_trace(trace_id)
scores = trace["scores"]
print(f"Trace {trace_id}: {scores}")Recommended: Use Experiment Runner SDK
For a better developer experience, use the Experiment Runner SDK which provides built-in access to all experiment scores and results:
from langfuse import get_client
langfuse = get_client()
# Run experiment with automatic score collection
result = langfuse.run_experiment(
name="my-experiment",
data=my_dataset,
task=my_task,
evaluators=[my_evaluator] # optional
)
# Access all scores directly
print(result.format()) # includes all scores in formatted output