Evaluations
The Python SDK provides ways to evaluate your application. You can add custom scores to your traces and observations, or use the SDK to execute Dataset Runs.
This page shows the evaluation methods that are supported by the Python SDK. Please refer to the Evaluation documentation for more information on how to evaluate your application in Langfuse.
Create Scores
span_or_generation_obj.score()
: Scores the specific observation object.span_or_generation_obj.score_trace()
: Scores the entire trace to which the object belongs.
from langfuse import get_client
langfuse = get_client()
with langfuse.start_as_current_generation(name="summary_generation") as gen:
# ... LLM call ...
gen.update(output="summary text...")
# Score this specific generation
gen.score(name="conciseness", value=0.8, data_type="NUMERIC")
# Score the overall trace
gen.score_trace(name="user_feedback_rating", value="positive", data_type="CATEGORICAL")
Score Parameters:
Parameter | Type | Description |
---|---|---|
name | str | Name of the score (e.g., “relevance”, “accuracy”). Required. |
value | Union[float, str] | Score value. Float for NUMERIC /BOOLEAN , string for CATEGORICAL . Required. |
trace_id | str | ID of the trace to associate with (for create_score ). Required. |
observation_id | Optional[str] | ID of the specific observation to score (for create_score ). |
session_id | Optional[str] | ID of the specific session to score (for create_score ). |
score_id | Optional[str] | Custom ID for the score (auto-generated if None). |
data_type | Optional[ScoreDataType] | "NUMERIC" , "BOOLEAN" , or "CATEGORICAL" . Inferred if not provided based on value type and score config on server. |
comment | Optional[str] | Optional comment or explanation for the score. |
config_id | Optional[str] | Optional ID of a pre-defined score configuration in Langfuse. |
See Scoring for more details.
Dataset Runs
Langfuse Datasets are essential for evaluating and testing your LLM applications by allowing you to manage collections of inputs and their expected outputs.
Create a Dataset
- Creating: You can programmatically create new datasets with
langfuse.create_dataset(...)
and add items to them usinglangfuse.create_dataset_item(...)
. - Fetching: Retrieve a dataset and its items using
langfuse.get_dataset(name: str)
. This returns aDatasetClient
instance, which contains a list ofDatasetItemClient
objects (accessible viadataset.items
). EachDatasetItemClient
holds theinput
,expected_output
, andmetadata
for an individual data point.
from langfuse import get_client
langfuse = get_client()
# Fetch an existing dataset
dataset = langfuse.get_dataset(name="my-eval-dataset")
for item in dataset.items:
print(f"Input: {item.input}, Expected: {item.expected_output}")
# Briefly: Creating a dataset and an item
new_dataset = langfuse.create_dataset(name="new-summarization-tasks")
langfuse.create_dataset_item(
dataset_name="new-summarization-tasks",
input={"text": "Long article..."},
expected_output={"summary": "Short summary."}
)
Execute a Dataset Run
After fetching your dataset, you can execute a run against it. This will create a new trace for each item in the dataset. Please refer to the Remote Dataset Run documentation for more details.
The most powerful way to use datasets is by linking your application’s executions (traces) to specific dataset items when performing an evaluation run. The DatasetItemClient.run()
method provides a context manager to streamline this process.
How item.run()
works:
When you use with item.run(run_name="your_eval_run_name") as root_span:
:
- Trace Creation: A new Langfuse trace is initiated specifically for processing this dataset item within the context of the named run.
- Trace Naming & Metadata:
- The trace is automatically named (e.g., “Dataset run: your_eval_run_name”).
- Essential metadata is added to this trace, including
dataset_item_id
(the ID ofitem
),run_name
, anddataset_id
.
- DatasetRunItem Linking: The SDK makes an API call to Langfuse to create a
DatasetRunItem
. This backend object formally links:- The
dataset_item_id
- The
trace_id
of the newly created trace - The provided
run_name
- Any
run_metadata
orrun_description
you pass toitem.run()
. This linkage is what populates the “Runs” tab for your dataset in the Langfuse UI, allowing you to see all traces associated with a particular evaluation run.
- The
- Contextual Span: The context manager yields
root_span
, which is aLangfuseSpan
object representing the root span of this new trace. - Automatic Nesting: Any Langfuse observations (spans or generations) created inside the
with
block will automatically become children ofroot_span
and thus part of the trace linked to this dataset item and run.
Example:
from langfuse import get_client
langfuse = get_client()
dataset_name = "qna-eval"
current_run_name = "qna_model_v3_run_05_20" # Identifies this specific evaluation run
# Assume 'my_qna_app' is your instrumented application function
def my_qna_app(question: str, context: str, item_id: str, run_name: str):
with langfuse.start_as_current_generation(
name="qna-llm-call",
input={"question": question, "context": context},
metadata={"item_id": item_id, "run": run_name}, # Example metadata for the generation
model="gpt-4o"
) as generation:
# Simulate LLM call
answer = f"Answer to '{question}' using context." # Replace with actual LLM call
generation.update(output={"answer": answer})
# Update the trace with the input and output
generation.update_trace(
input={"question": question, "context": context},
output={"answer": answer},
)
return answer
dataset = langfuse.get_dataset(name=dataset_name) # Fetch your pre-populated dataset
for item in dataset.items:
print(f"Running evaluation for item: {item.id} (Input: {item.input})")
# Use the item.run() context manager
with item.run(
run_name=current_run_name,
run_metadata={"model_provider": "OpenAI", "temperature_setting": 0.7},
run_description="Evaluation run for Q&A model v3 on May 20th"
) as root_span: # root_span is the root span of the new trace for this item and run.
# All subsequent langfuse operations within this block are part of this trace.
# Call your application logic
generated_answer = my_qna_app(
question=item.input["question"],
context=item.input["context"],
item_id=item.id,
run_name=current_run_name
)
print(f" Item {item.id} processed. Trace ID: {root_span.trace_id}")
# Optionally, score the result against the expected output
if item.expected_output and generated_answer == item.expected_output.get("answer"):
root_span.score_trace(name="exact_match", value=1.0)
else:
root_span.score_trace(name="exact_match", value=0.0)
print(f"\nFinished processing dataset '{dataset_name}' for run '{current_run_name}'.")
By using item.run()
, you ensure each dataset item’s processing is neatly encapsulated in its own trace, and these traces are aggregated under the specified run_name
in the Langfuse UI. This allows for systematic review of results, comparison across runs, and deep dives into individual processing traces.