November 18, 2024 | Launch Week 2 🚀

Dataset Run Comparison View

After running experiments on datasets, you can now compare results side-by-side, view metrics, and peek into details of each dataset item across runs.

Introduction

What is a dataset in Langfuse? Datasets in Langfuse allow you to create test sets and benchmarks to evaluate the performance of your LLM application. A dataset is a collection of dataset items, where each item contains inputs, expected outputs, and metadata. You can create datasets from production edge cases, synthetic test cases, or manual test cases. This enables continuous improvement through structured testing, pre-deployment benchmarking, and flexible evaluation using custom metrics or LLM-as-a-judge approaches.

What is a dataset experiment run? A dataset experiment run lets you test changes to your application - like trying different models, prompts, or parameters - by running each version against your test dataset and comparing the results through traces in Langfuse to evaluate which changes work best.

What’s new?

And the fun continues, Day 1 of Launch Week 2 is here.

Langfuse Datasets now enables intuitive comparison of dataset experiment runs for technical and non-technical users. The view features an overview of each item in the dataset, and a summary of each selected experiment run. The latter includes metrics on latency, cost, scores and the application’s output response for each dataset item.

How to use the comparison view?

Setup dataset and run experiments

Follow the getting started guide to set up a dataset, populate it with items, and run experiments.
Alternatively, execute an end-to-end example (Python notebook).

Open comparison view

Select multiple dataset runs
Actions, select Compare

Learn more

For a conceptual introduction to datasets and offline experiments, see the dataset documentation.

Was this page helpful?

Support