Remote Dataset Runs
Once you created a dataset, you can use the dataset to test how your application performs on different inputs. Remote Dataset Runs are used to programmatically loop your applications or prompts through a dataset and optionally apply Evaluation Methods to the results.
They are called “Remote Dataset Runs” because they can make use of “remote” or external logic and code.
Optionally, you can also trigger Remote Dataset Runs via the Langfuse UI which will call them via a webhook.
Why use Remote Dataset Runs?
- Full flexibility to use your own application logic
- Use custom scoring functions to evaluate the outputs
- Run multiple experiments on the same dataset in parallel
- Easy to integrate with your existing evaluation infrastructure
Setup & Run via SDK
Sequence Diagram
Instrument your application
First we create our application runner helper function. This function will be called for every dataset item in the next step. If you use Langfuse for production observability, you do not need to change your application code.
For a dataset run, it is important that your application creates Langfuse traces for each execution so they can be linked to the dataset item. Please refer to the integrations page for details on how to instrument the framework you are using.
Assume you already have a Langfuse-instrumented LLM-app:
from langfuse import get_client, observe
from langfuse.openai import OpenAI
@observe
def my_llm_function(question: str):
response = OpenAI().chat.completions.create(
model="gpt-4o", messages=[{"role": "user", "content": question}]
)
output = response.choices[0].message.content
# Update trace input / output
get_client().update_current_trace(input=question, output=output)
return output
See Python SDK v3 docs for more details.
Run experiment on dataset
When running an experiment on a dataset, the application that shall be tested is executed for each item in the dataset. The execution trace is then linked to the dataset item. This allows you to compare different runs of the same application on the same dataset. Each experiment is identified by a run_name
.
You may then execute that LLM-app for each dataset item to create a dataset run:
from langfuse import get_client
from .app import my_llm_application
# Load the dataset
dataset = get_client().get_dataset("<dataset_name>")
# Loop over the dataset items
for item in dataset.items:
# Use the item.run() context manager for automatic trace linking
with item.run(
run_name="<run_name>",
run_description="My first run",
run_metadata={"model": "llama3"},
) as root_span:
# Execute your LLM-app against the dataset item input
output = my_llm_application.run(item.input)
# Optionally: Add scores computed in your experiment runner, e.g. json equality check
root_span.score_trace(
name="<example_eval>",
value=my_eval_fn(item.input, output, item.expected_output),
comment="This is a comment", # optional, useful to add reasoning
)
# Flush the langfuse client to ensure all data is sent to the server at the end of the experiment run
get_client().flush()
See Python SDK v3 docs for details on the new OpenTelemetry-based SDK.
If you want to learn more about how adding evaluation scores from the code works, please refer to the docs:
Optionally: Run Evals in Langfuse
In the code above, we show how to add scores to the dataset run from your experiment code.
Alternatively, you can run evals in Langfuse. This is useful if you want to use the LLM-as-a-judge feature to evaluate the outputs of the dataset runs. We have recorded a 10 min walkthrough on how this works end-to-end.
Compare dataset runs
After each experiment run on a dataset, you can check the aggregated score in the dataset runs table and compare results side-by-side.
Optional: Trigger Remote Dataset Runs via UI
When setting up Remote Dataset Runs via the SDK, it can be useful to expose a trigger in the Langfuse UI that helps you trigger the experiment runs.
You need to set up a webhook to receive the trigger request from Langfuse.
Navigate to the dataset
- Navigate to
Your Project
>Datasets
- Click on the dataset you want to set up a remote experiment trigger for
Open the setup page
Click on Start Experiment
to open the setup page

Click on ⚡
below Custom Experiment
Configure the webhook
Enter the URL of your external evaluation service that will receive the webhook when experiments are triggered. Specify a default config that will be sent to your webhook. Users can modify this when triggering experiments.
Trigger experiments
Once configured, team members can trigger remote experiments via the Run
button under the Custom Experiment option. Langfuse will send the dataset metadata (ID and name) along with any custom configuration to your webhook.
Typical workflow: Your webhook receives the request, fetches the dataset from Langfuse, runs your application against the dataset items, evaluates the results, and ingests the scores back into Langfuse as a new Dataset Run.