DocsDatasets & ExperimentsOverview

Datasets & Experiments

Via Langfuse Datasets you can create test sets and benchmarks to evaluate the performance of your LLM application.

  • Continuous improvement: Create datasets from production edge cases to improve your application
  • Pre-deployment testing: Benchmark new releases before deploying to production
  • Structured testing: Run experiments on collections of inputs and expected outputs
  • Flexible evaluation: Add custom evaluation metrics or use llm-as-a-judge
  • Integrates well: Works with popular frameworks like LangChain and LlamaIndex

Collaboratively manage datasets via UI, API, or SDKs.

Follow the Get Started guide for step by step instructions on how to create your first dataset and run your first experiment.

How to build a workflow around datasets

This is a high-level example workflow of using datasets to continuously improve an LLM application:

  1. Create dataset items with inputs and expected outputs through:

    • Manual creation or import of test cases
    • Synthetic generation of questions/responses
    • Production app traces with issues that need attention
  2. Make changes to your application that you want to test

  3. Run your application (or parts of it) on all dataset items

  4. Evaluate results:

    • Compare against baseline/expected outputs if available
    • Use custom evaluation metrics
    • Leverage LLM-based evaluation
  5. Review aggregated results across the full dataset to:

    • Identify improvements
    • Catch regressions
    • Make data-driven decisions about releases

Process diagram:

Datasets

Data model

  • Dataset is a collection of DatasetItems
    • DatasetItem contains input, expected_output, and metadata
  • DatasetRun is an experiment run on a Dataset, it is identified by a unique name
    • DatasetRunItem links a DatasetItem to a Trace created during an experiment
    • Evaluation metrics of a DatasetRun are based on Scores associated with the Traces linked to run

FAQ

GitHub Discussions

Was this page useful?

Questions? We're here to help

Subscribe to updates