Evaluations

The Python SDK provides ways to evaluate your application. You can add custom scores to your traces and observations, or use the SDK to execute Dataset Runs.

This page shows the evaluation methods that are supported by the Python SDK. Please refer to the Evaluation documentation for more information on how to evaluate your application in Langfuse.

Create Scores

span_or_generation_obj.score(): Scores the specific observation object.
span_or_generation_obj.score_trace(): Scores the entire trace to which the object belongs.

from langfuse import get_client
 
langfuse = get_client()
 
with langfuse.start_as_current_observation(as_type="generation", name="summary_generation") as gen:
    # ... LLM call ...
    gen.update(output="summary text...")
    # Score this specific generation
    gen.score(name="conciseness", value=0.8, data_type="NUMERIC")
    # Score the overall trace
    gen.score_trace(name="user_feedback_rating", value="positive", data_type="CATEGORICAL")

langfuse.score_current_span(): Scores the currently active observation in the context.
langfuse.score_current_trace(): Scores the trace of the currently active observation.

from langfuse import get_client
 
langfuse = get_client()
 
with langfuse.start_as_current_observation(as_type="span", name="complex_task") as task_span:
    # ... perform task ...
    langfuse.score_current_span(name="task_component_quality", value=True, data_type="BOOLEAN")
    # ...
    if task_is_fully_successful:
         langfuse.score_current_trace(name="overall_success", value=1.0, data_type="NUMERIC")

Creates a score for a specified trace_id and optionally observation_id.
Useful when IDs are known, or for scoring after the trace/observation has completed.

from langfuse import get_client
 
langfuse = get_client()
 
langfuse.create_score(
    name="fact_check_accuracy",
    value=0.95, # Can be float for NUMERIC/BOOLEAN, string for CATEGORICAL
    trace_id="abcdef1234567890abcdef1234567890",
    observation_id="1234567890abcdef", # Optional: if scoring a specific observation
    session_id="session_123", # Optional: if scoring a specific session
    data_type="NUMERIC", # "NUMERIC", "BOOLEAN", "CATEGORICAL"
    comment="Source verified for 95% of claims."
)

Score Parameters:

Parameter	Type	Description
`name`	`str`	Name of the score (e.g., “relevance”, “accuracy”). Required.
`value`	`Union[float, str]`	Score value. Float for `NUMERIC`/`BOOLEAN`, string for `CATEGORICAL`. Required.
`trace_id`	`str`	ID of the trace to associate with (for `create_score`). Required.
`observation_id`	`Optional[str]`	ID of the specific observation to score (for `create_score`).
`session_id`	`Optional[str]`	ID of the specific session to score (for `create_score`).
`score_id`	`Optional[str]`	Custom ID for the score (auto-generated if None).
`data_type`	`Optional[ScoreDataType]`	`"NUMERIC"`, `"BOOLEAN"`, or `"CATEGORICAL"`. Inferred if not provided based on value type and score config on server.
`comment`	`Optional[str]`	Optional comment or explanation for the score.
`config_id`	`Optional[str]`	Optional ID of a pre-defined score configuration in Langfuse.

See Scoring for more details.

Dataset Runs

Langfuse Datasets are essential for evaluating and testing your LLM applications by allowing you to manage collections of inputs and their expected outputs.

Create a Dataset

Creating: You can programmatically create new datasets with langfuse.create_dataset(...) and add items to them using langfuse.create_dataset_item(...).
Fetching: Retrieve a dataset and its items using langfuse.get_dataset(name: str). This returns a DatasetClient instance, which contains a list of DatasetItemClient objects (accessible via dataset.items). Each DatasetItemClient holds the input, expected_output, and metadata for an individual data point.

from langfuse import get_client
 
langfuse = get_client()
 
# Fetch an existing dataset
dataset = langfuse.get_dataset(name="my-eval-dataset")
for item in dataset.items:
    print(f"Input: {item.input}, Expected: {item.expected_output}")
 
# Briefly: Creating a dataset and an item
new_dataset = langfuse.create_dataset(name="new-summarization-tasks")
langfuse.create_dataset_item(
    dataset_name="new-summarization-tasks",
    input={"text": "Long article..."},
    expected_output={"summary": "Short summary."}
)

Run experiment on dataset

After fetching your dataset, you can execute a run against it. This will create a new trace for each item in the dataset. Please refer to the Experiments via SDK documentation for more details.

Instrumentation Advanced usage

Was this page helpful?

Support