DocsEvaluationExperimentsOverview

Experiments Overview

Experiments allow you to systematically test your LLM application using a dataset, enabling you to evaluate and compare its performance.

Each experiment is based on a Dataset containing inputs and, optionally, expected outputs. This Dataset can be either local or hosted on Langfuse. For each input, the experiment runs a task function—this could be your LLM application when using Experiments via SDK, or a prompt sent to a model when using Experiments via UI.

The results can be assessed and scored using various Evaluation Methods.

Running experiments

The matrix below shows the different experiment configurations based on where your data is hosted and where the experiment execution takes place:

Langfuse Execution

Local/CI Execution

Langfuse Dataset

Local Dataset

Not supported

While it’s optional, we recommend managing the underlying Datasets in Langfuse as it allows for [1] In-UI comparison tables of different experiments on the same data and [2] Iteratively improve dataset based on production/staging traces.

Was this page helpful?