Evaluations

The Python SDK provides ways to evaluate your application. You can add custom scores to your traces and observations, or use the SDK to execute Dataset Runs.

This page shows the evaluation methods that are supported by the Python SDK. Please refer to the Evaluation documentation for more information on how to evaluate your application in Langfuse.

Create Scores

  • span_or_generation_obj.score(): Scores the specific observation object.
  • span_or_generation_obj.score_trace(): Scores the entire trace to which the object belongs.
from langfuse import get_client
 
langfuse = get_client()
 
with langfuse.start_as_current_generation(name="summary_generation") as gen:
    # ... LLM call ...
    gen.update(output="summary text...")
    # Score this specific generation
    gen.score(name="conciseness", value=0.8, data_type="NUMERIC")
    # Score the overall trace
    gen.score_trace(name="user_feedback_rating", value="positive", data_type="CATEGORICAL")

Score Parameters:

ParameterTypeDescription
namestrName of the score (e.g., “relevance”, “accuracy”). Required.
valueUnion[float, str]Score value. Float for NUMERIC/BOOLEAN, string for CATEGORICAL. Required.
trace_idstrID of the trace to associate with (for create_score). Required.
observation_idOptional[str]ID of the specific observation to score (for create_score).
session_idOptional[str]ID of the specific session to score (for create_score).
score_idOptional[str]Custom ID for the score (auto-generated if None).
data_typeOptional[ScoreDataType]"NUMERIC", "BOOLEAN", or "CATEGORICAL". Inferred if not provided based on value type and score config on server.
commentOptional[str]Optional comment or explanation for the score.
config_idOptional[str]Optional ID of a pre-defined score configuration in Langfuse.

See Scoring for more details.

Dataset Runs

Langfuse Datasets are essential for evaluating and testing your LLM applications by allowing you to manage collections of inputs and their expected outputs.

Create a Dataset

  • Creating: You can programmatically create new datasets with langfuse.create_dataset(...) and add items to them using langfuse.create_dataset_item(...).
  • Fetching: Retrieve a dataset and its items using langfuse.get_dataset(name: str). This returns a DatasetClient instance, which contains a list of DatasetItemClient objects (accessible via dataset.items). Each DatasetItemClient holds the input, expected_output, and metadata for an individual data point.
from langfuse import get_client
 
langfuse = get_client()
 
# Fetch an existing dataset
dataset = langfuse.get_dataset(name="my-eval-dataset")
for item in dataset.items:
    print(f"Input: {item.input}, Expected: {item.expected_output}")
 
# Briefly: Creating a dataset and an item
new_dataset = langfuse.create_dataset(name="new-summarization-tasks")
langfuse.create_dataset_item(
    dataset_name="new-summarization-tasks",
    input={"text": "Long article..."},
    expected_output={"summary": "Short summary."}
)

Execute a Dataset Run

After fetching your dataset, you can execute a run against it. This will create a new trace for each item in the dataset. Please refer to the Remote Dataset Run documentation for more details.

The most powerful way to use datasets is by linking your application’s executions (traces) to specific dataset items when performing an evaluation run. The DatasetItemClient.run() method provides a context manager to streamline this process.

How item.run() works:

When you use with item.run(run_name="your_eval_run_name") as root_span::

  1. Trace Creation: A new Langfuse trace is initiated specifically for processing this dataset item within the context of the named run.
  2. Trace Naming & Metadata:
    • The trace is automatically named (e.g., “Dataset run: your_eval_run_name”).
    • Essential metadata is added to this trace, including dataset_item_id (the ID of item), run_name, and dataset_id.
  3. DatasetRunItem Linking: The SDK makes an API call to Langfuse to create a DatasetRunItem. This backend object formally links:
    • The dataset_item_id
    • The trace_id of the newly created trace
    • The provided run_name
    • Any run_metadata or run_description you pass to item.run(). This linkage is what populates the “Runs” tab for your dataset in the Langfuse UI, allowing you to see all traces associated with a particular evaluation run.
  4. Contextual Span: The context manager yields root_span, which is a LangfuseSpan object representing the root span of this new trace.
  5. Automatic Nesting: Any Langfuse observations (spans or generations) created inside the with block will automatically become children of root_span and thus part of the trace linked to this dataset item and run.

Example:

from langfuse import get_client
 
langfuse = get_client()
dataset_name = "qna-eval"
current_run_name = "qna_model_v3_run_05_20" # Identifies this specific evaluation run
 
# Assume 'my_qna_app' is your instrumented application function
def my_qna_app(question: str, context: str, item_id: str, run_name: str):
    with langfuse.start_as_current_generation(
        name="qna-llm-call",
        input={"question": question, "context": context},
        metadata={"item_id": item_id, "run": run_name}, # Example metadata for the generation
        model="gpt-4o"
    ) as generation:
        # Simulate LLM call
        answer = f"Answer to '{question}' using context." # Replace with actual LLM call
        generation.update(output={"answer": answer})
 
        # Update the trace with the input and output
        generation.update_trace(
            input={"question": question, "context": context},
            output={"answer": answer},
        )
 
        return answer
 
dataset = langfuse.get_dataset(name=dataset_name) # Fetch your pre-populated dataset
 
for item in dataset.items:
    print(f"Running evaluation for item: {item.id} (Input: {item.input})")
 
    # Use the item.run() context manager
    with item.run(
        run_name=current_run_name,
        run_metadata={"model_provider": "OpenAI", "temperature_setting": 0.7},
        run_description="Evaluation run for Q&A model v3 on May 20th"
    ) as root_span: # root_span is the root span of the new trace for this item and run.
        # All subsequent langfuse operations within this block are part of this trace.
 
        # Call your application logic
        generated_answer = my_qna_app(
            question=item.input["question"],
            context=item.input["context"],
            item_id=item.id,
            run_name=current_run_name
        )
 
        print(f"  Item {item.id} processed. Trace ID: {root_span.trace_id}")
 
        # Optionally, score the result against the expected output
        if item.expected_output and generated_answer == item.expected_output.get("answer"):
            root_span.score_trace(name="exact_match", value=1.0)
        else:
            root_span.score_trace(name="exact_match", value=0.0)
 
print(f"\nFinished processing dataset '{dataset_name}' for run '{current_run_name}'.")

By using item.run(), you ensure each dataset item’s processing is neatly encapsulated in its own trace, and these traces are aggregated under the specified run_name in the Langfuse UI. This allows for systematic review of results, comparison across runs, and deep dives into individual processing traces.

Was this page helpful?