FAQ

How to Retrieve Experiment Scores?

Terminology Note: “Experiment” and “dataset run” are used interchangeably throughout Langfuse. We are moving toward deprecating the term “dataset run” in favor of “experiment”, but both terms currently refer to the same concept.

Langfuse supports two types of experiment scores:

  1. Experiment-level scores: Overall metrics for the entire experiment run (e.g., precision, recall, F1-scores). These scores are immutable and represent aggregate performance. Learn more about run-level scores.
  2. Experiment-item-level scores: Scores for individual items within an experiment (e.g., per-generated-output evaluations).

Via API/SDK

Experiment-Level Scores

Support coming soon: Fetch experiment-level scores using the Langfuse SDK or scores API with the datasetRunId parameter. See the Scores Data Model for details on score properties.

Experiment-Item-Level Scores

⚠️

Current Workaround: The method below is a workaround for retrieving experiment-item-level scores. We recommend:

  1. Using the Experiment Runner SDK which provides direct access to all scores in context
  2. We may add a dedicated API route for experiment scores/metrics in the near future

To retrieve experiment-item-level scores programmatically:

Step 1: Fetch the experiment run

Get the experiment run details including all trace IDs:

from langfuse import Langfuse
from urllib.parse import quote
 
langfuse = Langfuse()
 
dataset_name = "your-dataset-name"
run_name = "your-run-name"
 
# URL encode names if they contain special characters
encoded_dataset_name = quote(dataset_name, safe="")
encoded_run_name = quote(run_name, safe="")
 
# Fetch experiment run
run = langfuse.get_dataset_run(
    dataset_name=encoded_dataset_name,
    run_name=encoded_run_name
)
 
# Extract trace IDs
trace_ids = [item["trace_id"] for item in run["dataset_run_items"]]

Step 2: Fetch scores for each trace

Use the trace IDs to retrieve scores for each experiment item:

# Fetch trace details including scores
for trace_id in trace_ids:
    trace = langfuse.get_trace(trace_id)
    scores = trace["scores"]
 
    print(f"Trace {trace_id}: {scores}")

For a better developer experience, use the Experiment Runner SDK which provides built-in access to all experiment scores and results:

from langfuse import get_client
 
langfuse = get_client()
 
# Run experiment with automatic score collection
result = langfuse.run_experiment(
    name="my-experiment",
    data=my_dataset,
    task=my_task,
    evaluators=[my_evaluator]  # optional
)
 
# Access all scores directly
print(result.format())  # includes all scores in formatted output
Was this page helpful?