Datasets & Experiments
Via Langfuse Datasets you can create test sets and benchmarks to evaluate the performance of your LLM application.
- Continuous improvement: Create datasets from production edge cases to improve your application
- Pre-deployment testing: Benchmark new releases before deploying to production
- Structured testing: Run experiments on collections of inputs and expected outputs
- Flexible evaluation: Add custom evaluation metrics or use llm-as-a-judge
- Integrates well: Works with popular frameworks like LangChain and LlamaIndex
Collaboratively manage datasets via UI, API, or SDKs.
Follow the Get Started guide for step by step instructions on how to create your first dataset and run your first experiment.
How to build a workflow around datasets
This is a high-level example workflow of using datasets to continuously improve an LLM application:
-
Create dataset items with inputs and expected outputs through:
- Manual creation or import of test cases
- Synthetic generation of questions/responses
- Production app traces with issues that need attention
-
Make changes to your application that you want to test
-
Run your application (or parts of it) on all dataset items
-
Evaluate results:
- Compare against baseline/expected outputs if available
- Use custom evaluation metrics
- Leverage LLM-based evaluation
-
Review aggregated results across the full dataset to:
- Identify improvements
- Catch regressions
- Make data-driven decisions about releases
Process diagram:
Data model
Dataset
is a collection ofDatasetItem
sDatasetItem
containsinput
,expected_output
, andmetadata
DatasetRun
is an experiment run on aDataset
, it is identified by a uniquename
DatasetRunItem
links aDatasetItem
to aTrace
created during an experiment- Evaluation metrics of a
DatasetRun
are based onScores
associated with theTraces
linked to run