Prompt Experiments
- HobbyPublic Beta
- ProPublic Beta
- TeamPublic Beta
- Self HostedNot Available
Prompt Experiments allows you to test a prompt version from Prompt Management on a Dataset of inputs and expected outputs. Thereby, you can verify that the change yields the expected outputs and does not cause regressions. You can directly analyze the results of different prompt experiments side-by-side.
Optionally, you can use LLM-as-a-Judge Evaluators to automatically evaluate the responses based on the expected outputs to further analyze the results on an aggregate level.
This is a no-code feature within Langfuse. You can run more complex experiments via the Langfuse SDKs/API. Follow this guide to get started.
Key benefits
- Feedback loop: Quickly iterate on prompts by running experiments and directly comparing evaluation results side-by-side.
- Regression prevention: When making prompt changes, run an experiment to make sure that the new version does not cause bad outputs.
Availability
Prompt Experiments is currently in public beta on Langfuse Cloud. It will be released for self-hosted users in Langfuse v3 (Pro plan) as it depends on parts of the new v3 infrastructure.
Setup
If you already have a dataset and a prompt, you can skip the following steps.
In Prompt Experiments, the items of a dataset are mapped to the variables of the prompt. In the following example, the variables (documentation
and question
) are mapped to the input
of the dataset which is a JSON object. The expected output
contains a reference answer for the given dataset item.
Configure LLM connection
Prompt Experiments runs LLM calls within Langfuse. Thus, you need to configure an LLM connection in the project settings.
Supported LLM providers
- OpenAI, or OpenAI-compatible providers (e.g. LiteLLM, Google Vertex AI)
- Anthropic
- Azure OpenAI
- AWS Bedrock
Create a dataset
Create a dataset with the inputs and expected outputs that you want to test your prompt on.
langfuse.create_dataset(
name="<dataset_name>",
# optional description
description="My first dataset",
# optional metadata
metadata={
"author": "Alice",
"date": "2022-01-01",
"type": "benchmark"
}
)
See low-level SDK docs for details on how to initialize the Python client.
Create dataset items with test cases
Dataset items include the input variables that should be inserted into the prompt.
Example Dataset Item with variables
{
"question": "What is Langfuse?",
"documentation": "Langfuse - the LLM Engineering Platform"
}
Langfuse is the LLM Engineering Platform.
langfuse.create_dataset_item(
dataset_name="<dataset_name>",
# any python object or value, optional
input={
"text": "hello world"
},
# any python object or value, optional
expected_output={
"text": "hello world"
},
# metadata, optional
metadata={
"model": "llama3",
}
)
See low-level SDK docs for details on how to initialize the Python client.
Create a prompt with variables
Use {{variables}}
to insert the dataset variables into the prompt during experiments.
Example Prompt
You are a Langfuse expert. Please answer questions based on the following documentation:
DOCUMENTATION
{{documentation}}
{{question}}
Run a prompt experiment
Now that we have set up a prompt version and a dataset, we can run a prompt experiment in Langfuse for each prompt version that we want to test.
When viewing the prompt details or a dataset, use the following button to run a prompt experiment:
Select the prompt version, dataset, and model configuration that you want to test. Before running the experiment, you will see whether the prompt variables match the dataset variables.