DocsEvaluationExperimentsExperiments via SDK

Experiments via SDK

Experiments via SDK are used to programmatically loop your applications or prompts through a dataset and optionally apply Evaluation Methods to the results. You can use a dataset hosted on Langfuse or a local dataset as the foundation for your experiment.

See also the JS/TS SDK reference and the Python SDK reference for more details on running experiments via the SDK.

Why use Experiments via SDK?

  • Full flexibility to use your own application logic
  • Use custom scoring functions to evaluate the outputs of a single item and the full run
  • Run multiple experiments on the same dataset in parallel
  • Easy to integrate with your existing evaluation infrastructure

Experiment runner SDK

Both the Python and JS/TS SDKs provide a high-level abstraction for running an experiment on a dataset. The dataset can be both local or hosted on Langfuse. Using the Experiment runner is the recommended way to run an experiment on a dataset with our SDK.

The experiment runner automatically handles:

  • Concurrent execution of tasks with configurable limits
  • Automatic tracing of all executions for observability
  • Flexible evaluation with both item-level and run-level evaluators
  • Error isolation so individual failures don’t stop the experiment
  • Dataset integration for easy comparison and tracking

The experiment runner SDK supports both datasets hosted on Langfuse and datasets hosted locally. If you are using a dataset hosted on Langfuse for your experiment, the SDK will automatically create a dataset run for you that you can inspect and compare in the Langfuse UI. For locally hosted datasets not on Langfuse, only traces and scores (if evaluations are used) are tracked in Langfuse.

Basic Usage

Start with the simplest possible experiment to test your task function on local data. If you already have a dataset in Langfuse, see here.

from langfuse import get_client
from langfuse.openai import OpenAI
 
# Initialize client
langfuse = get_client()
 
# Define your task function
def my_task(*, item, **kwargs):
    question = item["input"]
    response = OpenAI().chat.completions.create(
        model="gpt-4.1", messages=[{"role": "user", "content": question}]
    )
 
    return response.choices[0].message.content
 
 
# Run experiment on local data
local_data = [
    {"input": "What is the capital of France?"},
    {"input": "What is the capital of Germany?"},
]
 
result = langfuse.run_experiment(
    name="Geography Quiz",
    description="Testing basic functionality",
    data=local_data,
    task=my_task,
)
 
# Use format method to display results
print(result.format())

When running experiments on local data, only traces are created in Langfuse - no dataset runs are generated. Each task execution creates an individual trace for observability and debugging.

Usage with Langfuse Datasets

Run experiments directly on datasets stored in Langfuse for automatic tracing and comparison.

# Get dataset from Langfuse
dataset = langfuse.get_dataset("my-evaluation-dataset")
 
# Run experiment directly on the dataset
result = dataset.run_experiment(
    name="Production Model Test",
    description="Monthly evaluation of our production model",
    task=my_task # see above for the task definition
)
 
# Use format method to display results
print(result.format())

When using Langfuse datasets, dataset runs are automatically created in Langfuse and are available for comparison in the UI. This enables tracking experiment performance over time and comparing different approaches on the same dataset.

Advanced Features

Enhance your experiments with evaluators and advanced configuration options.

Evaluators

Evaluators assess the quality of task outputs at the item level. They receive the input, metadata, output, and expected output for each item and return evaluation metrics that are reported as scores on the traces in Langfuse.

from langfuse import Evaluation
 
# Define evaluation functions
def accuracy_evaluator(*, input, output, expected_output, metadata, **kwargs):
    if expected_output and expected_output.lower() in output.lower():
        return Evaluation(name="accuracy", value=1.0, comment="Correct answer found")
 
    return Evaluation(name="accuracy", value=0.0, comment="Incorrect answer")
 
def length_evaluator(*, input, output, **kwargs):
    return Evaluation(name="response_length", value=len(output), comment=f"Response has {len(output)} characters")
 
# Use multiple evaluators
result = langfuse.run_experiment(
    name="Multi-metric Evaluation",
    data=test_data,
    task=my_task,
    evaluators=[accuracy_evaluator, length_evaluator]
)
 
print(result.format())

Run-level Evaluators

Run-level evaluators assess the full experiment results and compute aggregate metrics. When run on Langfuse datasets, these scores are attached to the full dataset run for tracking overall experiment performance.

from langfuse import Evaluation
 
def average_accuracy(*, item_results, **kwargs):
    """Calculate average accuracy across all items"""
    accuracies = [
        eval.value for result in item_results
        for eval in result.evaluations
        if eval.name == "accuracy"
    ]
 
    if not accuracies:
        return Evaluation(name="avg_accuracy", value=None)
 
    avg = sum(accuracies) / len(accuracies)
 
    return Evaluation(name="avg_accuracy", value=avg, comment=f"Average accuracy: {avg:.2%}")
 
result = langfuse.run_experiment(
    name="Comprehensive Analysis",
    data=test_data,
    task=my_task,
    evaluators=[accuracy_evaluator],
    run_evaluators=[average_accuracy]
)
 
print(result.format())

Async Tasks and Evaluators

Both task functions and evaluators can be asynchronous.

import asyncio
from langfuse.openai import AsyncOpenAI
 
async def async_llm_task(*, item, **kwargs):
    """Async task using OpenAI"""
    client = AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": item["input"]}]
    )
 
    return response.choices[0].message.content
 
# Works seamlessly with async functions
result = langfuse.run_experiment(
    name="Async Experiment",
    data=test_data,
    task=async_llm_task,
    max_concurrency=5  # Control concurrent API calls
)
 
print(result.format())

Configuration Options

Customize experiment behavior with various configuration options.

result = langfuse.run_experiment(
    name="Configurable Experiment",
    run_name="Custom Run Name", # will be dataset run name if dataset is used
    description="Experiment with custom configuration",
    data=test_data,
    task=my_task,
    evaluators=[accuracy_evaluator],
    run_evaluators=[average_accuracy],
    max_concurrency=10,  # Max concurrent executions
    metadata={  # Attached to all traces
        "model": "gpt-4",
        "temperature": 0.7,
        "version": "v1.2.0"
    }
)
 
print(result.format())

Testing in CI Environments

Integrate the experiment runner with testing frameworks like Pytest and Vitest to run automated evaluations in your CI pipeline. Use evaluators to create assertions that can fail tests based on evaluation results.

# test_geography_experiment.py
import pytest
from langfuse import get_client, Evaluation
from langfuse.openai import OpenAI
 
# Test data for European capitals
test_data = [
    {"input": "What is the capital of France?", "expected": "Paris"},
    {"input": "What is the capital of Germany?", "expected": "Berlin"},
    {"input": "What is the capital of Spain?", "expected": "Madrid"},
]
 
def geography_task(*, item, **kwargs):
    """Task function that answers geography questions"""
    question = item["input"]
    response = OpenAI().chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content
 
def accuracy_evaluator(*, input, output, expected_output, **kwargs):
    """Evaluator that checks if the expected answer is in the output"""
    if expected_output and expected_output.lower() in output.lower():
        return Evaluation(name="accuracy", value=1.0)
 
    return Evaluation(name="accuracy", value=0.0)
 
def average_accuracy_evaluator(*, item_results, **kwargs):
    """Run evaluator that calculates average accuracy across all items"""
    accuracies = [
        eval.value for result in item_results
        for eval in result.evaluations if eval.name == "accuracy"
    ]
 
    if not accuracies:
        return Evaluation(name="avg_accuracy", value=None)
 
    avg = sum(accuracies) / len(accuracies)
 
    return Evaluation(name="avg_accuracy", value=avg, comment=f"Average accuracy: {avg:.2%}")
 
@pytest.fixture
def langfuse_client():
    """Initialize Langfuse client for testing"""
    return get_client()
 
def test_geography_accuracy_passes(langfuse_client):
    """Test that passes when accuracy is above threshold"""
    result = langfuse_client.run_experiment(
        name="Geography Test - Should Pass",
        data=test_data,
        task=geography_task,
        evaluators=[accuracy_evaluator],
        run_evaluators=[average_accuracy_evaluator]
    )
 
    # Access the run evaluator result directly
    avg_accuracy = next(
        eval.value for eval in result.run_evaluations
        if eval.name == "avg_accuracy"
    )
 
    # Assert minimum accuracy threshold
    assert avg_accuracy >= 0.8, f"Average accuracy {avg_accuracy:.2f} below threshold 0.8"
 
def test_geography_accuracy_fails(langfuse_client):
    """Example test that demonstrates failure conditions"""
    # Use a weaker model or harder questions to demonstrate test failure
    def failing_task(*, item, **kwargs):
        # Simulate a task that gives wrong answers
        return "I don't know"
 
    result = langfuse_client.run_experiment(
        name="Geography Test - Should Fail",
        data=test_data,
        task=failing_task,
        evaluators=[accuracy_evaluator],
        run_evaluators=[average_accuracy_evaluator]
    )
 
    # Access the run evaluator result directly
    avg_accuracy = next(
        eval.value for eval in result.run_evaluations
        if eval.name == "avg_accuracy"
    )
 
    # This test will fail because the task gives wrong answers
    with pytest.raises(AssertionError):
        assert avg_accuracy >= 0.8, f"Expected test to fail with low accuracy: {avg_accuracy:.2f}"

These examples show how to use the experiment runner’s evaluation results to create meaningful test assertions in your CI pipeline. Tests can fail when accuracy drops below acceptable thresholds, ensuring model quality standards are maintained automatically.

Autoevals Integration

Access pre-built evaluation functions through the autoevals library integration.

The Python SDK supports AutoEvals evaluators through direct integration:

from langfuse.experiment import create_evaluator_from_autoevals
from autoevals.llm import Factuality
 
evaluator = create_evaluator_from_autoevals(Factuality())
 
result = langfuse.run_experiment(
    name="Autoevals Integration Test",
    data=test_data,
    task=my_task,
    evaluators=[evaluator]
)
 
print(result.format())

Low-level SDK methods

If you need more control over the dataset run, you can use the low-level SDK methods in order to loop through the dataset items and execute your application logic.

Load the dataset

Use the Python or JS/TS SDK to load the dataset.

from langfuse import get_client
 
dataset = get_client().get_dataset("<dataset_name>")

Instrument your application

First we create our application runner helper function. This function will be called for every dataset item in the next step. If you use Langfuse for production observability, you do not need to change your application code.

ℹ️

For a dataset run, it is important that your application creates Langfuse traces for each execution so they can be linked to the dataset item. Please refer to the integrations page for details on how to instrument the framework you are using.

Assume you already have a Langfuse-instrumented LLM-app:

app.py
from langfuse import get_client, observe
from langfuse.openai import OpenAI
 
@observe
def my_llm_function(question: str):
    response = OpenAI().chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": question}]
    )
    output = response.choices[0].message.content
 
    # Update trace input / output
    get_client().update_current_trace(input=question, output=output)
 
    return output

See Python SDK docs for more details.

Run experiment on dataset

When running an experiment on a dataset, the application that shall be tested is executed for each item in the dataset. The execution trace is then linked to the dataset item. This allows you to compare different runs of the same application on the same dataset. Each experiment is identified by a run_name.

You may then execute that LLM-app for each dataset item to create a dataset run:

execute_dataset.py
from langfuse import get_client
from .app import my_llm_application
 
# Load the dataset
dataset = get_client().get_dataset("<dataset_name>")
 
# Loop over the dataset items
for item in dataset.items:
    # Use the item.run() context manager for automatic trace linking
    with item.run(
        run_name="<run_name>",
        run_description="My first run",
        run_metadata={"model": "llama3"},
    ) as root_span:
        # Execute your LLM-app against the dataset item input
        output = my_llm_application.run(item.input)
 
        # Optionally: Add scores computed in your experiment runner, e.g. json equality check
        root_span.score_trace(
            name="<example_eval>",
            value=my_eval_fn(item.input, output, item.expected_output),
            comment="This is a comment",  # optional, useful to add reasoning
        )
 
# Flush the langfuse client to ensure all data is sent to the server at the end of the experiment run
get_client().flush()

See Python SDK docs for details on the new OpenTelemetry-based SDK.

If you want to learn more about how adding evaluation scores from the code works, please refer to the docs:

Optionally: Run Evals in Langfuse

In the code above, we show how to add scores to the dataset run from your experiment code.

Alternatively, you can run evals in Langfuse. This is useful if you want to use the LLM-as-a-judge feature to evaluate the outputs of the dataset runs. We have recorded a 10 min walkthrough on how this works end-to-end.

Compare dataset runs

After each experiment run on a dataset, you can check the aggregated score in the dataset runs table and compare results side-by-side.

Optional: Trigger SDK Experiment from UI

When setting up Experiments via SDK, it can be useful to allow triggering the experiment runs from the Langfuse UI.

You need to set up a webhook to receive the trigger request from Langfuse.

  • Navigate to Your Project > Datasets
  • Click on the dataset you want to set up a remote experiment trigger for

New Experiment Button

Open the setup page

Click on Start Experiment to open the setup page

New Experiment Button

Click on below Custom Experiment

New Experiment Button

Configure the webhook

Enter the URL of your external evaluation service that will receive the webhook when experiments are triggered. Specify a default config that will be sent to your webhook. Users can modify this when triggering experiments.

New Experiment Button

Trigger experiments

Once configured, team members can trigger remote experiments via the Run button under the Custom Experiment option. Langfuse will send the dataset metadata (ID and name) along with any custom configuration to your webhook.

New Experiment Button

Typical workflow: Your webhook receives the request, fetches the dataset from Langfuse, runs your application against the dataset items, evaluates the results, and ingests the scores back into Langfuse as a new Experiment run.

Was this page helpful?