Dataset Runs via the SDK

Once you created a dataset, you can use the dataset to test how your application performs on different inputs.

Why use dataset runs via the SDK?

Full flexibility to use your own application logic
Use custom scoring functions to evaluate the outputs
Run multiple experiments on the same dataset in parallel

Get started

Load the dataset

Use the Python or JS/TS SDK to load the dataset.

from langfuse import get_client
 
dataset = get_client().get_dataset("<dataset_name>")

Instrument your application

First we create our application runner helper function. This function will be called for every dataset item in the next step.

ℹ️

For a dataset run, it is important that your application creates Langfuse traces for each execution so they can be linked to the dataset item. Please refer to the integrations page for details on how to instrument the framework you are using.

Assume you already have a Langfuse-instrumented LLM-app:

app.py

from langfuse import get_client, observe
from langfuse.openai import OpenAI
 
@observe
def my_llm_function(question: str):
    response = OpenAI().chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": question}]
    )
    output = response.choices[0].message.content
 
    # Update trace input / output
    get_client().update_current_trace(input=question, output=output)
 
    return output

See Python SDK v3 docs for more details.

app.ts

const trace = langfuse.trace({
  name: "my-AI-application-endpoint",
});
 
// Example generation creation
const generation = trace.generation({
  name: "chat-completion",
  model: "gpt-4o",
  input: messages,
});
 
// Application code
const chatCompletion = await llm.respond(prompt);
 
// End generation - sets endTime
generation.end({
  output: chatCompletion,
});

app.py

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
 
def my_llm_function(question):
  llm = ChatOpenAI(model_name="gpt-4o")
  prompt = ChatPromptTemplate.from_template("Answer the question: {question}")
  chain = prompt | llm
 
  response = chain.invoke(
      {"question": question}, 
      config={"callbacks": [langfuse_handler]})
  
  return response

app.ts

import { CallbackHandler } from "langfuse-langchain";
// Deno: import CallbackHandler from "https://esm.sh/langfuse-langchain";
 
const langfuseHandler = new CallbackHandler();
 
// Your Langchain code
 
// Add Langfuse handler as callback to `run` or `invoke`
await chain.invoke({ input: "<user_input>" }, { callbacks: [langfuseHandler] });

Please refer to the Vercel AI SDK docs for details on how to use the Vercel AI SDK with Langfuse.

app.ts

const runMyLLMApplication = async (input: string, traceId: string) => {
  const output = await generateText({
    model: openai("gpt-4o"),
    maxTokens: 50,
    prompt: input,
    experimental_telemetry: {
      isEnabled: true,
      functionId: "vercel-ai-sdk-example-trace",
      metadata: {
        langfuseTraceId: traceId,
        langfuseUpdateParent: true, // Update the parent trace with execution results as the trace was created manually to enable linking
      },
    },
  });
  return output;
};

Loop over dataset items

When running an experiment on a dataset, the application that shall be tested is executed for each item in the dataset. The execution trace is then linked to the dataset item. This allows you to compare different runs of the same application on the same dataset. Each experiment is identified by a run_name.

You may then execute that LLM-app for each dataset item to create a dataset run:

execute_dataset.py

from langfuse import get_client
from .app import my_llm_application
 
for item in dataset.items:
    # Use the item.run() context manager for automatic trace linking
    with item.run(
        run_name="<run_name>",
        run_description="My first run",
        run_metadata={"model": "llama3"},
    ) as root_span:
        # Execute your LLM-app against the dataset item input
        output = my_llm_application.run(item.input)
 
        # Optionally: Add scores
        root_span.score_trace(
            name="<example_eval>",
            value=my_eval_fn(item.input, output, item.expected_output),
            comment="This is a comment",  # optional, useful to add reasoning
        )
 
# Flush the langfuse client to ensure all data is sent to the server at the end of the experiment run
get_client().flush()

See Python SDK v3 docs for details on the new OpenTelemetry-based SDK.

for (const item of dataset.items) {
  // execute application function and get langfuseObject (trace/span/generation/event)
  // output also returned as it is used to evaluate the run
  // you can also link using ids, see sdk reference for details
  const [langfuseObject, output] = await myLlmApplication.run(item.input);
 
  // link the execution trace to the dataset item and give it a run_name
  await item.link(langfuseObject, "<run_name>", {
    description: "My first run", // optional run description
    metadata: { model: "llama3" }, // optional run metadata
  });
 
  // Optionally: Add scores
  langfuseObject.score({
    name: "<score_name>",
    value: myEvalFunction(item.input, output, item.expectedOutput),
    comment: "This is a comment", // optional, useful to add reasoning
  });
}
 
// Flush the langfuse client to ensure all data is sent to the server at the end of the experiment run
await langfuse.flushAsync();

for item in dataset.items:
    # Langchain callback handler that automatically links the execution trace to the dataset item
    handler = item.get_langchain_handler(run_name="<run_name>")
 
    # Execute application and pass custom handler
    my_langchain_chain.run(item.input, callbacks=[handler])
 
    # Optionally: Add scores
    langfuse.score(trace_id=handler.get_trace_id(), name="my_score", value=1)
 
# Flush the langfuse client to ensure all data is sent to the server at the end of the experiment run
langfuse.flush()

import { Langfuse } from "langfuse";
import { createDatasetItemHandler } from "langfuse-langchain";
...
 
const langfuse = new Langfuse()
const runName = "my-dataset-run";
 
for (const item of dataset.items) {
  // Create langchain handler that automatically links the execution trace to the dataset item run
  const { handler, trace } = await createDatasetItemHandler({ item, runName, langfuseClient: langfuse });
 
  // Pass the callback handler when invoking the chain
  await chain.invoke({ country: item.input }, { callbacks: [handler] });
 
  // Optionally: Add scores
  trace.score({
    name: "test-score",
    value: 0.5,
  });
}
 
await langfuse.flushAsync();

// iterate over the dataset items
for (const item of dataset.items) {
  // create a trace manually in order to pass id to vercel ai sdk for later linking to the dataset run
  const trace = langfuse.trace({ name: "new experiment trace" });
 
  // run application on the dataset item input
  const output = await runMyLLMApplication(item.input, trace.id);
 
  // link the execution trace to the dataset item and give it a run_name
  await item.link(trace, "<run_name>", {
    description: "My first run", // optional run description
    metadata: { model: "gpt-4o" }, // optional run metadata
  });
 
  // Optionally: Add scores
  trace.score({
    name: "<score_name>",
    value: myEvalFunction(item.input, output, item.expectedOutput),
    comment: "This is a comment", // optional, useful to add reasoning
  });
}
 
// Flush the langfuse client to ensure all data is sent to the server at the end of the experiment run
await langfuse.flushAsync();

Optionally: Add scores

Optionally, the output of the application can be evaluated to compare different runs more easily. More details on scores/evals here. Options:

Use any evaluation function and directly add a score while looping over the dataset items. See above for implementation details.

Add custom scores

Compare dataset runs

After each experiment run on a dataset, you can check the aggregated score in the dataset runs table and compare results side-by-side.

Datasets Run via UI

Was this page helpful?

Support