Dataset Runs via the SDK
Once you created a dataset, you can use the dataset to test how your application performs on different inputs.
Why use dataset runs via the SDK?
- Full flexibility to use your own application logic
- Use custom scoring functions to evaluate the outputs
- Run multiple experiments on the same dataset in parallel
Get started
Load the dataset
Use the Python or JS/TS SDK to load the dataset.
from langfuse import get_client
dataset = get_client().get_dataset("<dataset_name>")
Instrument your application
First we create our application runner helper function. This function will be called for every dataset item in the next step.
For a dataset run, it is important that your application creates Langfuse traces for each execution so they can be linked to the dataset item. Please refer to the integrations page for details on how to instrument the framework you are using.
Assume you already have a Langfuse-instrumented LLM-app:
from langfuse import get_client, observe
from langfuse.openai import OpenAI
@observe
def my_llm_function(question: str):
response = OpenAI().chat.completions.create(
model="gpt-4o", messages=[{"role": "user", "content": question}]
)
output = response.choices[0].message.content
# Update trace input / output
get_client().update_current_trace(input=question, output=output)
return output
See Python SDK v3 docs for more details.
Loop over dataset items
When running an experiment on a dataset, the application that shall be tested is executed for each item in the dataset. The execution trace is then linked to the dataset item. This allows you to compare different runs of the same application on the same dataset. Each experiment is identified by a run_name
.
You may then execute that LLM-app for each dataset item to create a dataset run:
from langfuse import get_client
from .app import my_llm_application
for item in dataset.items:
# Use the item.run() context manager for automatic trace linking
with item.run(
run_name="<run_name>",
run_description="My first run",
run_metadata={"model": "llama3"},
) as root_span:
# Execute your LLM-app against the dataset item input
output = my_llm_application.run(item.input)
# Optionally: Add scores
root_span.score_trace(
name="<example_eval>",
value=my_eval_fn(item.input, output, item.expected_output),
comment="This is a comment", # optional, useful to add reasoning
)
# Flush the langfuse client to ensure all data is sent to the server at the end of the experiment run
get_client().flush()
See Python SDK v3 docs for details on the new OpenTelemetry-based SDK.
Optionally: Add scores
Optionally, the output of the application can be evaluated to compare different runs more easily. More details on scores/evals here. Options:
Use any evaluation function and directly add a score while looping over the dataset items. See above for implementation details.
Compare dataset runs
After each experiment run on a dataset, you can check the aggregated score in the dataset runs table and compare results side-by-side.