05 Dataset
Workshop source
Workshop material is maintained in the public langfuse/langfuse-workshop repository. Use the repository for the runnable app, checkpoint branches, and local setup.
Starting point
git checkout checkpoint/05-datasetYou have a traced, attributed, monitored app. data/seed-dataset.json and scripts/seed-dataset.ts are already in the repo at this checkpoint.
Make sure .env has:
DATASET_NAME=dad-it-support-workshopWhy build datasets
A dataset is your representation of what the system will face in production โ the inputs you expect, and for each, what a good answer looks like. With a clear set of expectations written down, you can rerun the agent against them after every change and know whether you helped or hurt. A good dataset is the foundation for shipping confidently and iterating without regressing.
Learn more in the Langfuse Academy lesson on datasets.
Goal
Seed a first dataset that captures the kinds of requests we expect Specs to handle. To get there:
- Understand the item shape โ every dataset item has the same three fields, and we want ours to match the agent's actual input.
- Seed the hosted dataset so it lives in Langfuse and is ready for experiments in the next step.
![]()
Step 1 โ Read the item shape
Dataset items in Langfuse follow a consistent shape โ three fields, one required, two optional:
| Field | Required | Purpose |
|---|---|---|
input | yes | What you'd feed the agent. For us, the same { messages: [...] } shape /api/chat accepts. |
expectedOutput | optional | What a good answer would look like. Free-form โ used by evaluators to compare actual vs expected. |
metadata | optional | Tags or other fields for filtering and grouping (category, difficulty, etc.). |
For us, the conceptual shape of one item is:
{
"input": "How do I turn Bluetooth on on my iPhone?",
"expectedOutput": {
"idealAnswer": "Open Settings, tap Bluetooth, and turn the Bluetooth switch on.",
"expectedKeywords": ["Settings", "Bluetooth", "switch", "on"]
},
"metadata": { "category": "iphone-bluetooth", "difficulty": "easy" }
}The two fields inside expectedOutput answer two different evaluator questions:
idealAnsweris the human-readable reference reply. It's what the LLM-as-a-judge correctness evaluator (chapter 06) compares the agent's actual answer against to decide whether the meaning matches.expectedKeywordsis a small list of strings the answer must contain to be considered "covered the steps." A deterministic check (no model call) โ fast, cheap, and great for catching regressions where the agent paraphrases away the actual menu names.
metadata lets us slice runs by category or difficulty later when comparing experiment runs side by side.
If you look at the actual JSON in data/seed-dataset.json, the input is the full { messages: [...] } shape that /api/chat accepts, plus an id field for the dataset row. We've simplified the example above to show what an item is; the on-disk format is what the experiment script (step 06) can feed straight into runSupportConversation(...) without rewriting inputs.
Step 2 โ Seed the dataset
You have several options for getting items into a Langfuse dataset:
- Add items manually in the UI (Datasets โ New item).
- Upload a CSV / JSON file via the UI.
- Turn live production traces into dataset items directly from the Trace view โ this is the most powerful path once you have monitoring catching interesting traces.
- Programmatic seeding via the SDK / CLI โ best for an initial bulk load like ours.
For this workshop we use the programmatic path because we already have a curated JSON file:
npm run dataset:seedOpen Langfuse โ Datasets. The list view should show the new dad-it-support-workshop dataset with 14 items and 0 experiment runs (so far):
![]()
Click into the dataset and switch to the Items tab. You should see every seeded item with input, expected output, and metadata columns:
![]()
What the starter dataset covers
- iPhone Bluetooth basics and edge cases
- iPhone Wi-Fi reconnect + "I can't see the network"
- Photo capture + WhatsApp share
- Apple Maps directions + the live-location limit
- Messages basics
- Out-of-scope (file my taxes, book my train)
- Limitation cases (passwords, live location)
If you add items later, prefer ones that match a real signal you saw in monitoring rather than items invented from scratch.
How to verify you are done
- The dataset shows up in Langfuse with all items.
- Item inputs look like the
messagesarray a real chat turn would have. - You can articulate the failure modes the dataset covers.
Wrap-up
Datasets are how you write down what your system is expected to handle. A good one gives you confidence to ship and to iterate without regressing. You can seed datasets via the Langfuse CLI or skill, build them from production traces in the UI, or maintain them in code like we did โ the right approach depends on where your best examples come from.
Next we use this dataset to run experiments against the agent.
End state
This is the starting point for 06-experiments.