Experiments via UI (Prompt Experiments)

You can execute Experiments via UI (also called Prompt Experiments) in the Langfuse UI to test different prompt versions from Prompt Management or language models and compare the results side-by-side.

Optionally, you can use LLM-as-a-Judge Evaluators to automatically score the responses based on the expected outputs to further analyze the results on an aggregate level.

Why use Prompt Experiments?

Quickly test different prompt versions or models
Structure your prompt testing by using a dataset to test different prompt versions and models
Quickly iterate on prompts through Prompt Experiments
Optionally use LLM-as-a-Judge Evaluators to score the responses based on the expected outputs from the dataset
Prevent regressions by running tests when making prompt changes

Experiments always run on the latest dataset version at experiment time. Support for running experiments on specific dataset versions will be added shortly.

Prerequisites

Create a usable prompt

Create a prompt that you want to test and evaluate. How to create a prompt?

A prompt is usable when: your prompt has variables that match the dataset item keys in the dataset that will be used for the Dataset Run. See the example below.

Example: Prompt Variables & Dataset Item Keys Mapping

Prompt:

{{ documentation }}
 
Question: {{question}}

Dataset Item:

{
  "documentation": "Langfuse is an LLM Engineering Platform",
  "question": "What is Langfuse?"
}

In this example:

The prompt variable {{documentation}} maps to the JSON key "documentation"
The prompt variable {{question}} maps to the JSON key "question"
Both keys must exist in the dataset item’s input JSON for the experiment to run successfully

Example: Chat Message Placeholder Mapping

In addition to variables, you can also map placeholders in chat message prompts to dataset item keys. This is useful when the dataset item also contains for example a chat message history to use. Your chat prompt needs to contain a placeholder with a name. Variables within placeholders are not resolved.

Chat Prompt: Placeholder named: message_history

Dataset Item:

{
  "message_history": [
    {
      "role": "user",
      "content": "What is Langfuse?"
    },
    {
      "role": "assistant",
      "content": "Langfuse is a tool for tracking and analyzing the performance of language models."
    }
  ],
  "question": "What is Langfuse?"
}

In this example:

The chat prompt placeholder message_history maps to the JSON key "message_history".
The prompt variable {{question}} maps to the JSON key "question" in a variable not within a placeholder message.
Both keys must exist in the dataset item’s input JSON for the experiment to run successfully

Create a usable dataset

Create a dataset with the inputs and expected outputs you want to use for your prompt experiments. How to create a dataset?

A dataset is usable when: [1] the dataset items have JSON objects as input and [2] these objects have JSON keys that match the prompt variables of the prompt(s) you will use. See the example below.

Example: Prompt Variables & Dataset Item Keys Mapping

Prompt:

{{ documentation }}
 
Question: {{question}}

Dataset Item:

{
  "documentation": "Langfuse is an LLM Engineering Platform",
  "question": "What is Langfuse?"
}

In this example:

The prompt variable {{documentation}} maps to the JSON key "documentation"
The prompt variable {{question}} maps to the JSON key "question"
Both keys must exist in the dataset item’s input JSON for the experiment to run successfully

Configure LLM connection

As your prompt will be executed for each dataset item, you need to configure an LLM connection in the project settings. How to configure an LLM connection?

Optional: Set up LLM-as-a-judge

You can set up an LLM-as-a-judge evaluator to score the responses based on the expected outputs. Make sure to set the target of the LLM-as-a-Judge to “Experiment runs” and filter for the dataset you want to use. How to set up LLM-as-a-judge?

Trigger an Experiment via UI (Prompt Experiment)

Navigate to the dataset

Dataset Runs are currently started from the detail page of a dataset.

Navigate to Your Project > Datasets
Click on the dataset you want to start a Dataset Run for

New Experiment Button

Open the setup page

Click on Start Experiment to open the setup page

New Experiment Button

Click on Create below prompt Experiment

Configure the Dataset run

Set a Dataset Run name
Select the prompt you want to use
Set up or select the LLM connection you want to use
Select the dataset you want to use
Optionally configure structured output - Toggle on to enforce a JSON schema response format
- Select an existing schema from your project or create a new one
- Schemas can be created and saved in the Playground and reused here
- View/edit schemas using the eye icon next to the schema selector
Optionally select the evaluator you want to use
Click on Create to trigger the Dataset Run

Structured output ensures that LLM responses conform to a specific JSON schema. This is useful when you need consistent, parseable outputs for evaluation or downstream processing. The same schemas you define in the Playground are available for use in experiments.

This will trigger the Dataset Run and you will be redirected to the Dataset Runs page. The run might take a few seconds or minutes to complete depending on the prompt complexity and dataset size.

Compare runs

After each experiment run, you can check the aggregated score in the Dataset Runs table and compare results side-by-side.

GitHub Discussions

Experiments via SDK Data Model

Was this page helpful?

Support