Experiments via SDK

Experiments via SDK are used to programmatically loop your applications or prompts through a dataset and optionally apply Evaluation Methods to the results. You can use a dataset hosted on Langfuse or a local dataset as the foundation for your experiment.

See also the JS/TS SDK reference and the Python SDK reference for more details on running experiments via the SDK.

Why use Experiments via SDK?

Full flexibility to use your own application logic
Use custom scoring functions to evaluate the outputs of a single item and the full run
Run multiple experiments on the same dataset in parallel
Easy to integrate with your existing evaluation infrastructure

Experiment runner SDK

Both the Python and JS/TS SDKs provide a high-level abstraction for running an experiment on a dataset. The dataset can be both local or hosted on Langfuse. Using the Experiment runner is the recommended way to run an experiment on a dataset with our SDK.

The experiment runner automatically handles:

Concurrent execution of tasks with configurable limits
Automatic tracing of all executions for observability
Flexible evaluation with both item-level and run-level evaluators
Error isolation so individual failures don’t stop the experiment
Dataset integration for easy comparison and tracking

The experiment runner SDK supports both datasets hosted on Langfuse and datasets hosted locally. If you are using a dataset hosted on Langfuse for your experiment, the SDK will automatically create a dataset run for you that you can inspect and compare in the Langfuse UI. For locally hosted datasets not on Langfuse, only traces and scores (if evaluations are used) are tracked in Langfuse.

Basic Usage

Start with the simplest possible experiment to test your task function on local data. If you already have a dataset in Langfuse, see here.

from langfuse import get_client
from langfuse.openai import OpenAI
 
# Initialize client
langfuse = get_client()
 
# Define your task function
def my_task(*, item, **kwargs):
    question = item["input"]
    response = OpenAI().chat.completions.create(
        model="gpt-4.1", messages=[{"role": "user", "content": question}]
    )
 
    return response.choices[0].message.content
 
 
# Run experiment on local data
local_data = [
    {"input": "What is the capital of France?", "expected_output": "Paris"},
    {"input": "What is the capital of Germany?", "expected_output": "Berlin"},
]
 
result = langfuse.run_experiment(
    name="Geography Quiz",
    description="Testing basic functionality",
    data=local_data,
    task=my_task,
)
 
# Use format method to display results
print(result.format())

When running experiments on local data, only traces are created in Langfuse - no dataset runs are generated. Each task execution creates an individual trace for observability and debugging.

Usage with Langfuse Datasets

Run experiments directly on datasets stored in Langfuse for automatic tracing and comparison.

from langfuse import get_client
from langfuse.openai import OpenAI
 
# Initialize client
langfuse = get_client()
 
# Define your task function
def my_task(*, item, **kwargs):
    question = item.input # `run_experiment` passes a `DatasetItemClient` to the task function. The input of the dataset item is available as `item.input`.
    response = OpenAI().chat.completions.create(
        model="gpt-4.1", messages=[{"role": "user", "content": question}]
    )
 
    return response.choices[0].message.content
 
# Get dataset from Langfuse
dataset = langfuse.get_dataset("my-evaluation-dataset")
 
# Run experiment directly on the dataset
result = dataset.run_experiment(
    name="Production Model Test",
    description="Monthly evaluation of our production model",
    task=my_task # see above for the task definition
)
 
# Use format method to display results
print(result.format())

When using Langfuse datasets, dataset runs are automatically created in Langfuse and are available for comparison in the UI. This enables tracking experiment performance over time and comparing different approaches on the same dataset.

Experiments always run on the latest dataset version at experiment time. Support for running experiments on specific dataset versions will be added to the SDK shortly.

Advanced Features

Enhance your experiments with evaluators and advanced configuration options.

Evaluators

Evaluators assess the quality of task outputs at the item level. They receive the input, metadata, output, and expected output for each item and return evaluation metrics that are reported as scores on the traces in Langfuse.

from langfuse import Evaluation
 
# Define evaluation functions
def accuracy_evaluator(*, input, output, expected_output, metadata, **kwargs):
    if expected_output and expected_output.lower() in output.lower():
        return Evaluation(name="accuracy", value=1.0, comment="Correct answer found")
 
    return Evaluation(name="accuracy", value=0.0, comment="Incorrect answer")
 
def length_evaluator(*, input, output, **kwargs):
    return Evaluation(name="response_length", value=len(output), comment=f"Response has {len(output)} characters")
 
# Use multiple evaluators
result = langfuse.run_experiment(
    name="Multi-metric Evaluation",
    data=test_data,
    task=my_task,
    evaluators=[accuracy_evaluator, length_evaluator]
)
 
print(result.format())

Run-level Evaluators

Run-level evaluators assess the full experiment results and compute aggregate metrics. When run on Langfuse datasets, these scores are attached to the full dataset run for tracking overall experiment performance.

from langfuse import Evaluation
 
def average_accuracy(*, item_results, **kwargs):
    """Calculate average accuracy across all items"""
    accuracies = [
        eval.value for result in item_results
        for eval in result.evaluations
        if eval.name == "accuracy"
    ]
 
    if not accuracies:
        return Evaluation(name="avg_accuracy", value=None)
 
    avg = sum(accuracies) / len(accuracies)
 
    return Evaluation(name="avg_accuracy", value=avg, comment=f"Average accuracy: {avg:.2%}")
 
result = langfuse.run_experiment(
    name="Comprehensive Analysis",
    data=test_data,
    task=my_task,
    evaluators=[accuracy_evaluator],
    run_evaluators=[average_accuracy]
)
 
print(result.format())

Async Tasks and Evaluators

Both task functions and evaluators can be asynchronous.

import asyncio
from langfuse.openai import AsyncOpenAI
 
async def async_llm_task(*, item, **kwargs):
    """Async task using OpenAI"""
    client = AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": item["input"]}]
    )
 
    return response.choices[0].message.content
 
# Works seamlessly with async functions
result = langfuse.run_experiment(
    name="Async Experiment",
    data=test_data,
    task=async_llm_task,
    max_concurrency=5  # Control concurrent API calls
)
 
print(result.format())

Configuration Options

Customize experiment behavior with various configuration options.

result = langfuse.run_experiment(
    name="Configurable Experiment",
    run_name="Custom Run Name", # will be dataset run name if dataset is used
    description="Experiment with custom configuration",
    data=test_data,
    task=my_task,
    evaluators=[accuracy_evaluator],
    run_evaluators=[average_accuracy],
    max_concurrency=10,  # Max concurrent executions
    metadata={  # Attached to all traces
        "model": "gpt-4",
        "temperature": 0.7,
        "version": "v1.2.0"
    }
)
 
print(result.format())

Testing in CI Environments

Integrate the experiment runner with testing frameworks like Pytest and Vitest to run automated evaluations in your CI pipeline. Use evaluators to create assertions that can fail tests based on evaluation results.

# test_geography_experiment.py
import pytest
from langfuse import get_client, Evaluation
from langfuse.openai import OpenAI
 
# Test data for European capitals
test_data = [
    {"input": "What is the capital of France?", "expected_output": "Paris"},
    {"input": "What is the capital of Germany?", "expected_output": "Berlin"},
    {"input": "What is the capital of Spain?", "expected_output": "Madrid"},
]
 
def geography_task(*, item, **kwargs):
    """Task function that answers geography questions"""
    question = item["input"]
    response = OpenAI().chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content
 
def accuracy_evaluator(*, input, output, expected_output, **kwargs):
    """Evaluator that checks if the expected answer is in the output"""
    if expected_output and expected_output.lower() in output.lower():
        return Evaluation(name="accuracy", value=1.0)
 
    return Evaluation(name="accuracy", value=0.0)
 
def average_accuracy_evaluator(*, item_results, **kwargs):
    """Run evaluator that calculates average accuracy across all items"""
    accuracies = [
        eval.value for result in item_results
        for eval in result.evaluations if eval.name == "accuracy"
    ]
 
    if not accuracies:
        return Evaluation(name="avg_accuracy", value=None)
 
    avg = sum(accuracies) / len(accuracies)
 
    return Evaluation(name="avg_accuracy", value=avg, comment=f"Average accuracy: {avg:.2%}")
 
@pytest.fixture
def langfuse_client():
    """Initialize Langfuse client for testing"""
    return get_client()
 
def test_geography_accuracy_passes(langfuse_client):
    """Test that passes when accuracy is above threshold"""
    result = langfuse_client.run_experiment(
        name="Geography Test - Should Pass",
        data=test_data,
        task=geography_task,
        evaluators=[accuracy_evaluator],
        run_evaluators=[average_accuracy_evaluator]
    )
 
    # Access the run evaluator result directly
    avg_accuracy = next(
        eval.value for eval in result.run_evaluations
        if eval.name == "avg_accuracy"
    )
 
    # Assert minimum accuracy threshold
    assert avg_accuracy >= 0.8, f"Average accuracy {avg_accuracy:.2f} below threshold 0.8"
 
def test_geography_accuracy_fails(langfuse_client):
    """Example test that demonstrates failure conditions"""
    # Use a weaker model or harder questions to demonstrate test failure
    def failing_task(*, item, **kwargs):
        # Simulate a task that gives wrong answers
        return "I don't know"
 
    result = langfuse_client.run_experiment(
        name="Geography Test - Should Fail",
        data=test_data,
        task=failing_task,
        evaluators=[accuracy_evaluator],
        run_evaluators=[average_accuracy_evaluator]
    )
 
    # Access the run evaluator result directly
    avg_accuracy = next(
        eval.value for eval in result.run_evaluations
        if eval.name == "avg_accuracy"
    )
 
    # This test will fail because the task gives wrong answers
    with pytest.raises(AssertionError):
        assert avg_accuracy >= 0.8, f"Expected test to fail with low accuracy: {avg_accuracy:.2f}"

// test/geography-experiment.test.ts
import { describe, it, expect, beforeAll, afterAll } from "vitest";
import { OpenAI } from "openai";
import { NodeSDK } from "@opentelemetry/sdk-node";
import { LangfuseClient, ExperimentItem } from "@langfuse/client";
import { observeOpenAI } from "@langfuse/openai";
import { LangfuseSpanProcessor } from "@langfuse/otel";
 
// Test data for European capitals
const testData: ExperimentItem[] = [
  { input: "What is the capital of France?", expectedOutput: "Paris" },
  { input: "What is the capital of Germany?", expectedOutput: "Berlin" },
  { input: "What is the capital of Spain?", expectedOutput: "Madrid" },
];
 
let otelSdk: NodeSDK;
let langfuse: LangfuseClient;
 
beforeAll(async () => {
  // Initialize OpenTelemetry
  otelSdk = new NodeSDK({ spanProcessors: [new LangfuseSpanProcessor()] });
  otelSdk.start();
 
  // Initialize Langfuse client
  langfuse = new LangfuseClient();
});
 
afterAll(async () => {
  // Clean shutdown
  await otelSdk.shutdown();
});
 
const geographyTask = async (item: ExperimentItem) => {
  const question = item.input;
  const response = await observeOpenAI(new OpenAI()).chat.completions.create({
    model: "gpt-4",
    messages: [{ role: "user", content: question }],
  });
 
  return response.choices[0].message.content;
};
 
const accuracyEvaluator = async ({ input, output, expectedOutput }) => {
  if (
    expectedOutput &&
    output.toLowerCase().includes(expectedOutput.toLowerCase())
  ) {
    return { name: "accuracy", value: 1 };
  }
  return { name: "accuracy", value: 0 };
};
 
const averageAccuracyEvaluator = async ({ itemResults }) => {
  // Calculate average accuracy across all items
  const accuracies = itemResults
    .flatMap((result) => result.evaluations)
    .filter((evaluation) => evaluation.name === "accuracy")
    .map((evaluation) => evaluation.value as number);
 
  if (accuracies.length === 0) {
    return { name: "avg_accuracy", value: null };
  }
 
  const avg = accuracies.reduce((sum, val) => sum + val, 0) / accuracies.length;
  return {
    name: "avg_accuracy",
    value: avg,
    comment: `Average accuracy: ${(avg * 100).toFixed(1)}%`,
  };
};
 
describe("Geography Experiment Tests", () => {
  it("should pass when accuracy is above threshold", async () => {
    const result = await langfuse.experiment.run({
      name: "Geography Test - Should Pass",
      data: testData,
      task: geographyTask,
      evaluators: [accuracyEvaluator],
      runEvaluators: [averageAccuracyEvaluator],
    });
 
    // Access the run evaluator result directly
    const avgAccuracy = result.runEvaluations.find(
      (eval) => eval.name === "avg_accuracy"
    )?.value as number;
 
    // Assert minimum accuracy threshold
    expect(avgAccuracy).toBeGreaterThanOrEqual(0.8);
  }, 30_000); // 30 second timeout for API calls
 
  it("should fail when accuracy is below threshold", async () => {
    // Task that gives wrong answers to demonstrate test failure
    const failingTask = async (item: ExperimentItem) => {
      return "I don't know";
    };
 
    const result = await langfuse.experiment.run({
      name: "Geography Test - Should Fail",
      data: testData,
      task: failingTask,
      evaluators: [accuracyEvaluator],
      runEvaluators: [averageAccuracyEvaluator],
    });
 
    // Access the run evaluator result directly
    const avgAccuracy = result.runEvaluations.find(
      (eval) => eval.name === "avg_accuracy"
    )?.value as number;
 
    // This test will fail because the task gives wrong answers
    expect(() => {
      expect(avgAccuracy).toBeGreaterThanOrEqual(0.8);
    }).toThrow();
  }, 30_000);
});

These examples show how to use the experiment runner’s evaluation results to create meaningful test assertions in your CI pipeline. Tests can fail when accuracy drops below acceptable thresholds, ensuring model quality standards are maintained automatically.

Autoevals Integration

Access pre-built evaluation functions through the autoevals library integration.

The Python SDK supports AutoEvals evaluators through direct integration:

from langfuse.experiment import create_evaluator_from_autoevals
from autoevals.llm import Factuality
 
evaluator = create_evaluator_from_autoevals(Factuality())
 
result = langfuse.run_experiment(
    name="Autoevals Integration Test",
    data=test_data,
    task=my_task,
    evaluators=[evaluator]
)
 
print(result.format())

Low-level SDK methods

If you need more control over the dataset run, you can use the low-level SDK methods in order to loop through the dataset items and execute your application logic.

Load the dataset

Use the Python or JS/TS SDK to load the dataset.

from langfuse import get_client
 
dataset = get_client().get_dataset("<dataset_name>")

Instrument your application

First we create our application runner helper function. This function will be called for every dataset item in the next step. If you use Langfuse for production observability, you do not need to change your application code.

ℹ️

For a dataset run, it is important that your application creates Langfuse traces for each execution so they can be linked to the dataset item. Please refer to the integrations page for details on how to instrument the framework you are using.

Assume you already have a Langfuse-instrumented LLM-app:

app.py

from langfuse import get_client, observe
from langfuse.openai import OpenAI
 
@observe
def my_llm_function(question: str):
    response = OpenAI().chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": question}]
    )
    output = response.choices[0].message.content
 
    # Update trace input / output
    get_client().update_current_trace(input=question, output=output)
 
    return output

See Python SDK docs for more details.

Please make sure you have the Langfuse SDK set up for tracing of your application. If you use Langfuse for observability, this is the same setup.

Example:

app.ts

import { OpenAI } from "openai"
 
import { LangfuseClient } from "@langfuse/client";
import { startActiveObservation } from "@langfuse/tracing";
import { observeOpenAI } from "@langfuse/openai";
 
const myLLMApplication = async (input: string) => {
  return startActiveObservation("my-llm-application", async (span) => {
    const output = await observeOpenAI(new OpenAI()).chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: input }],
    });
 
    span.update({ input, output: output.choices[0].message.content });
 
    // return reference to span and output
    // will be simplified in a future version of the SDK
    return [span, output] as const;
  }
};

app.py

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
 
def my_langchain_chain(question, langfuse_handler):
  llm = ChatOpenAI(model_name="gpt-4o")
  prompt = ChatPromptTemplate.from_template("Answer the question: {question}")
  chain = prompt | llm
 
  response = chain.invoke(
      {"question": question},
      config={"callbacks": [langfuse_handler]})
 
  return response

app.ts

import { CallbackHandler } from "@langfuse/langchain";
 
const myLLMApplication = async (input: string) => {
  return startActiveObservation('my_llm_application', async (span) => {
    // ... your Langchain code ...
    const langfuseHandler = new CallbackHandler();
    const output = await chain.invoke({ input }, { callbacks: [langfuseHandler] });
 
    span.update({ input, output });
 
    // return reference to span and output
    // will be simplified in a future version of the SDK
    return [span, output] as const;
  }
}

Please refer to the Vercel AI SDK docs for details on how to use the Vercel AI SDK with Langfuse.

app.ts

const runMyLLMApplication = async (input: string, traceId: string) => {
  return startActiveObservation("my_llm_application", async (span) => {
    const output = await generateText({
      model: openai("gpt-4o"),
      maxTokens: 50,
      prompt: input,
      experimental_telemetry: {
        isEnabled: true,
        functionId: "vercel-ai-sdk-example-trace",
      },
    });
 
    span.update({ input, output: output.text });
 
    // return reference to span and output
    // will be simplified in a future version of the SDK
    return [span, output] as const;
  }
};

Run experiment on dataset

When running an experiment on a dataset, the application that shall be tested is executed for each item in the dataset. The execution trace is then linked to the dataset item. This allows you to compare different runs of the same application on the same dataset. Each experiment is identified by a run_name.

You may then execute that LLM-app for each dataset item to create a dataset run:

execute_dataset.py

from langfuse import get_client
from .app import my_llm_application
 
# Load the dataset
dataset = get_client().get_dataset("<dataset_name>")
 
# Loop over the dataset items
for item in dataset.items:
    # Use the item.run() context manager for automatic trace linking
    with item.run(
        run_name="<run_name>",
        run_description="My first run",
        run_metadata={"model": "llama3"},
    ) as root_span:
        # Execute your LLM-app against the dataset item input
        output = my_llm_application.run(item.input)
 
        # Optionally: Add scores computed in your experiment runner, e.g. json equality check
        root_span.score_trace(
            name="<example_eval>",
            value=my_eval_fn(item.input, output, item.expected_output),
            comment="This is a comment",  # optional, useful to add reasoning
        )
 
# Flush the langfuse client to ensure all data is sent to the server at the end of the experiment run
get_client().flush()

See Python SDK docs for details on the new OpenTelemetry-based SDK.

import { LangfuseClient } from "@langfuse/client";
 
const langfuse = new LangfuseClient();
 
for (const item of dataset.items) {
  // execute application function and get langfuseObject (trace/span/generation/event, and other observation types: see /docs/observability/features/observation-types)
  // output also returned as it is used to evaluate the run
  // you can also link using ids, see sdk reference for details
  const [span, output] = await myLlmApplication.run(item.input);
 
  // link the execution trace to the dataset item and give it a run_name
  await item.link(span, "<run_name>", {
    description: "My first run", // optional run description
    metadata: { model: "llama3" }, // optional run metadata
  });
 
  // Optionally: Add scores
  langfuse.score.trace(span, {
    name: "<score_name>",
    value: myEvalFunction(item.input, output, item.expectedOutput),
    comment: "This is a comment", // optional, useful to add reasoning
  });
}
 
// Flush the langfuse client to ensure all score data is sent to the server at the end of the experiment run
await langfuse.flush();

from langfuse import get_client
from langfuse.langchain import CallbackHandler
#from .app import my_llm_application
 
# Load the dataset
dataset = get_client().get_dataset("<dataset_name>")
 
# Initialize the Langfuse handler
langfuse_handler = CallbackHandler()
 
# Loop over the dataset items
for item in dataset.items:
    # Use the item.run() context manager for automatic trace linking
    with item.run(
        run_name="<run_name>",
        run_description="My first run",
        run_metadata={"model": "llama3"},
    ) as root_span:
        # Execute your LLM-app against the dataset item input
        output = my_langchain_chain(item.input, langfuse_handler)
 
        # Update top-level trace input and output
        root_span.update_trace(input=item.input, output=output.content)
 
        # Optionally: Add scores computed in your experiment runner, e.g. json equality check
        root_span.score_trace(
            name="<example_eval>",
            value=my_eval_fn(item.input, output, item.expected_output),
            comment="This is a comment",  # optional, useful to add reasoning
        )
 
# Flush the langfuse client to ensure all data is sent to the server at the end of the experiment run
get_client().flush()

import { LangfuseClient } from "@langfuse/client";
import { CallbackHandler } from "@langfuse/langchain";
...
 
const langfuse = new LangfuseClient()
const runName = "my-dataset-run";
for (const item of dataset.items) {
  const [span, output] = await startActiveObservation('my_llm_application', async (span) => {
    // ... your Langchain code ...
    const langfuseHandler = new CallbackHandler();
    const output = await chain.invoke({ input: item.input }, { callbacks: [langfuseHandler] });
 
    span.update({ input: item.input, output });
 
    return [span, output] as const;
  });
 
  await item.link(span, runName)
 
  // Optionally: Add scores
  langfuse.score.trace(span, {
    name: "test-score",
    value: 0.5,
  });
}
 
await langfuse.flush();

import { LangfuseClient } from "@langfuse/client";
 
const langfuse = new LangfuseClient();
 
// iterate over the dataset items
for (const item of dataset.items) {
  // run application on the dataset item input
  const [span, output] = await runMyLLMApplication(item.input, trace.id);
 
  // link the execution trace to the dataset item and give it a run_name
  await item.link(span, "<run_name>", {
    description: "My first run", // optional run description
    metadata: { model: "gpt-4o" }, // optional run metadata
  });
 
  // Optionally: Add scores
  langfuse.score.trace(span, {
    name: "<score_name>",
    value: myEvalFunction(item.input, output, item.expectedOutput),
    comment: "This is a comment", // optional, useful to add reasoning
  });
}
 
// Flush the langfuse client to ensure all score data is sent to the server at the end of the experiment run
await langfuse.flush();

If you want to learn more about how adding evaluation scores from the code works, please refer to the docs:

Add custom scores

Optionally: Run Evals in Langfuse

In the code above, we show how to add scores to the dataset run from your experiment code.

Alternatively, you can run evals in Langfuse. This is useful if you want to use the LLM-as-a-judge feature to evaluate the outputs of the dataset runs. We have recorded a 10 min walkthrough on how this works end-to-end.

Set up LLM-as-a-judge

Compare dataset runs

After each experiment run on a dataset, you can check the aggregated score in the dataset runs table and compare results side-by-side.

Optional: Trigger SDK Experiment from UI

When setting up Experiments via SDK, it can be useful to allow triggering the experiment runs from the Langfuse UI.

You need to set up a webhook to receive the trigger request from Langfuse.

Navigate to the dataset

Navigate to Your Project > Datasets
Click on the dataset you want to set up a remote experiment trigger for

Open the setup page

Click on Start Experiment to open the setup page

Click on ⚡ below Custom Experiment

New Experiment Button

Configure the webhook

Enter the URL of your external evaluation service that will receive the webhook when experiments are triggered. Specify a default config that will be sent to your webhook. Users can modify this when triggering experiments.

New Experiment Button

Trigger experiments

Once configured, team members can trigger remote experiments via the Run button under the Custom Experiment option. Langfuse will send the dataset metadata (ID and name) along with any custom configuration to your webhook.

New Experiment Button

Typical workflow: Your webhook receives the request, fetches the dataset from Langfuse, runs your application against the dataset items, evaluates the results, and ingests the scores back into Langfuse as a new Experiment run.

Datasets Experiments via UI

Was this page helpful?

Support