March 5, 2024

Langfuse adds >20 evals with integration

Langfuse significatly expands its range of open source evaluators through an integration with open source project and fellow YC startup


Langfuse adds >20 open source evaluations to its roster through an integration with open source project and fellow Y Combinator startup UpTrain (opens in a new tab) – we share an office building with UpTrain in SF, so this one came about over a cup of coffee. If you want to dive straight in, head over to the cookbook.

UpTrain (GitHub (opens in a new tab))

Open-source project to evaluate and improve Generative AI applications. It provides grades for 20+ preconfigured checks (covering language, code, embedding use cases), performs root cause analyses on instances of failure cases and provides guidance on how to resolve them.

Evaluations in Langfuse

Langfuse allows users to score the quality of their application. This can be done through human input via user feedback in the front-end or manual scoring in the Langfuse UI.

To scale evaluation to a large number of traces, Langfuse supports model-based evaluations. This allows users to integrate e.g. through UpTrain, Ragas or LangChain Evals and use their pre-configured evals to score their completions. Users can also use custom scoring and ingest these via the Langfuse SDKs.

Sneak peak: We are currently working on an evaluation service to automatically score all incoming observations by running custom evaluation templates. Ping us if you want to be among the first to try it: [email protected]

How to Trace with Langfuse and Evaluate with UpTrain

You can easily evaluate existing traces in Langfuse by loading them into a notebook, running UpTrain evaluators on them and writing the scores back to your data in Langfuse:

Log your query-response pairs with Langfuse

Use one of our many integrations (Python, JS, Langchain, LlamaIndex, LiteLLM, …) to trace your LLM app. Here is a full list of integrations and how to get started in the quickstart.

Retrieve sub-set of traces to evaluate

# paginated response
    name="qa-traces" # select the traces you want to evaluate
evaluation_batch = {
    "question": [],
    "context": [],
    "response": [],
    "trace_id": [],
for t in traces:
    # get the observations for the trace
    observations = [langfuse.client.observations.get(o) for o in t.observations]
    # extract data deeply nested in the Langfuse trace
    for o in observations:
        if == 'retrieval':
            question = o.input['question']
            context = o.output['context']
            answer = o.output['response']
# transform for UpTrain
data = [dict(zip(evaluation_batch,t)) for t in zip(*evaluation_batch.values())]

Run UpTrain evals

res = eval_llm.evaluate(
   data = data,

Log the scores back to Langfuse

for _, row in df.iterrows():
    for metric_name in ["context_relevance", "factual_accuracy","response_completeness"]:

And you're done! In just a few lines of code, you have added powerful eval capabilities that you can apply to all of your data stored in Langfuse.

20+ pre-configured evaluations available

UpTrain provides 20+ pre-configured OSS evals (list (opens in a new tab)). Common use cases include:

  • Accessing scores for Response Completeness, Relevance & Validity, etc.,
  • Computing the quality of retrieval and degree of context utilization,
  • Checking if the response can be verified from the context or not,
  • Detecting and preventing prompt injection and jailbreak attempts
  • Examining if the user is frustrated while interacting with a chatbot

Get Started

Run the end-to-end cookbook on your Langfuse traces or learn more about model-based evals in Langfuse.

Was this page useful?

Questions? We're here to help

Subscribe to updates