06 Experiments
Workshop source
Workshop material is maintained in the public langfuse/langfuse-workshop repository. Use the repository for the runnable app, checkpoint branches, and local setup.
Starting point
git checkout checkpoint/06-experimentsYour dataset is seeded in Langfuse. scripts/run-dataset.ts is already in the repo.
Why experiments
A trace tells you about one turn. An experiment tells you about behavior across the dataset. Every experiment run does the same three things:
- Pulls each item from the dataset.
- Runs the item's input through the agent โ same
runSupportConversation(...)the web app uses, so the trace shape is the same as production. - Scores the actual output against the expected output with one or more evaluators.
Different evaluators answer different questions. For a broader tour of evaluator types and when to pick which, see the Langfuse Academy lesson on evaluate. For this workshop we use two that give a quick first read on answer quality:
- Keyword match (deterministic) โ did the answer cover the steps we expected? Fast, cheap, no model call. Done via SDK.
- Correctness (LLM-as-a-judge) โ is the answer actually correct? More expressive, especially when the wording can vary but the underlying answer has to match the ideal.
In this chapter you'll run both. The keyword match is already wired into the script. Evaluations like the keyword match can be run outside of Langfuse and written back as scores via API. The LLM-as-a-judge correctness evaluator you set yourself in the Langfuse UI.
Goal
By the end of this chapter:
- You can run the full dataset against the agent on demand.
- Every item gets a
keyword_overlapscore (deterministic) and acorrectnessscore (LLM-as-a-judge). - The two scores plus the per-item traces are visible in Langfuse and ready to compare against future runs.
Step 1 โ Understand the run script
Open scripts/run-dataset.ts. The file is annotated with numbered comments (// --- 1. Boot the OpenTelemetry SDK ..., // --- 3. The deterministic evaluator ..., etc.) so you can read it section by section. At a high level:
- Loads the hosted dataset from Langfuse by
DATASET_NAME. - For each item, calls the same
runSupportConversation(...)the web app uses. - Uses
dataset.runExperiment(...)to roll all per-item traces into a single run row. - Attaches a
keyword_overlapscore per item by comparingexpectedKeywordsagainst the agent's answer.
The traces produced are the same shape as production traces โ same dad-it-support-chat-turn root, same OpenAI generation, same tool spans. We don't touch the script in this chapter.
dataset.runExperiment(...) โ the moving parts
The whole run is one call to runExperiment. The shape boils down to:
await dataset.runExperiment({
// 1. Identity
name: "Dad IT Support Agent experiment",
runName, // unique label for this run; shows up in the Runs tab
description: "...",
metadata: { model: env.openaiModel },
maxConcurrency: 1, // run items one at a time
// 2. task โ runs the agent on one dataset item.
// Whatever you return becomes that item's `output` and is what the
// evaluators below score.
task: async (item) => {
const response = await runSupportConversation({ /* item.input */ });
return response.answer;
},
// 3. evaluators โ one entry per score. Each runs after task returns
// and gets { input, output, expectedOutput, metadata }. Return
// { name, value, comment } to attach the score to the item's trace.
evaluators: [
async ({ output, expectedOutput }) => ({
name: "keyword_overlap",
value: keywordOverlap(output as string, (expectedOutput as any).expectedKeywords),
comment: "..."
})
]
});Three things to understand:
taskis your application logic โ we call straight intorunSupportConversation(...), which means every trace this script produces looks identical to a production trace.evaluatorsis a list. Add more entries to attach more scores. Each evaluator is independent and runs against the same output, so combining a deterministic check likekeyword_overlapwith an LLM-as-a-judge check like the Correctness evaluator we set up below is just two list entries.runNameis what groups every per-item trace into one row in the Langfuse Runs view. Pick a name that changes per run (we include the timestamp) so two runs don't collide.
Step 2 โ Set up the correctness evaluator in Langfuse
Langfuse ships a Correctness LLM-as-a-judge template that compares an actual answer to an ideal answer and returns a score. We wire it up against the experiment runs so every item gets both a deterministic keyword score and a model-judged correctness score.
Fresh project check: Correctness is an LLM-as-a-judge evaluator too. If Langfuse says No default model set, configure Project Settings โ LLM Connections with your OpenAI key, then return to Evaluators โ Set up evaluator and save a default evaluator model such as
openai / gpt-4.1. Keep the API key in the Langfuse secret field only; do not paste it into workshop transcripts or shared notes.
- In Langfuse, open Evaluators โ New evaluator and pick the Correctness template.
- Target the runs from this dataset:
- Scope: Dataset runs
- Dataset:
dad-it-support-workshop
- Map the template variables to the trace's input/output and the dataset item's expected output:
query(the user's query) โ$.input.messages[-1].contentgeneration(what the agent answered) โ Output (the experiment run records the agent's answer here)ground_truth(the ideal answer) โ Expected Output from the dataset item
- Use the default judge model you configured during setup, or pick another structured-output-capable judge model, and save.
- Enable the evaluator.
![]()
Step 3 โ Run the dataset
npm run dataset:runWatch progress per item in the console; finishes with a summary line.
The script attaches keyword_overlap itself. The Correctness evaluator you set up in Step 2 runs asynchronously in Langfuse over the new run rows shortly after.
What to inspect in Langfuse
- The new Run under your dataset โ one row per item with two scores:
keyword_overlapandcorrectness, plus a trace link. - Item-level traces โ identical shape to production traces.
- The dataset's chart view โ per-run averages for both scores, ready for side-by-side comparison after future changes.
![]()
How to verify you are done
- One run row appears under the dataset.
- Every item has a trace and both scores attached.
- Trace shape matches a normal production trace.
Wrap-up
The two scoring approaches give you two angles on the same run: keyword match for "did we cover the right steps?" and correctness for "is the answer actually right?" Real evaluation programs typically combine deterministic and judge-based checks like this.
The Langfuse skill (/langfuse) knows the recommended evaluator shapes and how to wire them into runExperiment โ this walkthrough exists so you see what the skill is doing under the hood. Learn more about experiments in the Langfuse Academy lesson.
End state
This is the starting point for 07-evaluation.