This is a Jupyter notebook

Automated Evaluations with Cleanlab

Cleanlab’s Trustworthy Language Model (TLM) enables Langfuse users to quickly identify low quality and hallucinated responses from any LLM trace.

What is TLM?

TLM is an automated evaluation tool that add reliability and explainability to every LLM output. TLM automatically finds the poor quality and incorrect LLM responses lurking within your production logs and traces. This helps you perform better Evals, with significantly less manual review and annotation work to find these bad responses yourself. TLM also enables smart-routing for LLM-automated responses and decision-making using trustworthiness scores for every LLM output.

TLM provides users with:

  • Trustworthiness scores and explanation for every LLM response
  • Higher accuracy: rigorous benchmarks show TLM consistently produces more accurate results than other LLMs like GPT 4/4o and Claude.
  • Scalable API: designed to handle large datasets, TLM is suitable for most enterprise applications, including data extraction, tagging/labeling, Q&A (RAG), and more.

Getting Started

This guide will walk you through the process of evaluating LLM responses captured in Langfuse with Cleanlab’s Trustworthy Language Models (TLM).

Install dependencies & Set environment variables

%pip install -q langfuse openai cleanlab-tlm --upgrade
import os
import pandas as pd
from getpass import getpass
import dotenv
dotenv.load_dotenv()

API Keys

This guide requires a Cleanlab TLM API key. If you don’t have one, you can sign up for a free trial here.

This guide requires four API keys:

# Get keys for your project from the project settings page: https://cloud.langfuse.com
 
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..." 
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # 🇪🇺 EU region
# os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com" # 🇺🇸 US region
 
os.environ["OPENAI_API_KEY"] = "<openai_api_key>"
 
os.environ["CLEANLAB_TLM_API_KEY"] = "<cleanlab_tlm_api_key>"

Prepare trace dataset and load into Langfuse

For the sake of demonstration purposes, we’ll briefly generate some traces and track them in Langfuse. Typically, you would have already captured traces in Langfuse and would skip to “Download trace dataset from Langfuse”

NOTE: TLM requires the entire input to the LLM to be provided. This includes any system prompts, context, or other information that was originally provided to the LLM to generate the response. Notice below that we include the system prompt in the trace metadata since by default the trace does not include the system prompt within the input.

from langfuse.decorators import langfuse_context, observe
from openai import OpenAI
 
openai = OpenAI()
# Let's use some tricky trivia questions to generate some traces
trivia_questions = [    
    "What is the 3rd month of the year in alphabetical order?",
    "What is the capital of France?",
    "How many seconds are in 100 years?",
    "Alice, Bob, and Charlie went to a café. Alice paid twice as much as Bob, and Bob paid three times as much as Charlie. If the total bill was $72, how much did each person pay?",
    "When was the Declaration of Independence signed?"
]
 
@observe()
def generate_answers(trivia_question):
    system_prompt = "You are a trivia master."
 
    # Update the trace with the question    
    langfuse_context.update_current_trace(
        name=f"Answering question: '{trivia_question}'",
        tags=["TLM_eval_pipeline"],
        metadata={"system_prompt": system_prompt}
    )
 
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": trivia_question},
        ],
    )
    
    answer = response.choices[0].message.content
    return answer
 
 
# Generate answers
answers = []
for i in range(len(trivia_questions)):
    answer = generate_answers(trivia_questions[i])
    answers.append(answer)  
    print(f"Question {i+1}: {trivia_questions[i]}")
    print(f"Answer {i+1}:\n{answer}\n")
 
print(f"Generated {len(answers)} answers and tracked them in Langfuse.")
Question 1: What is the 3rd month of the year in alphabetical order?
Answer 1:
March

Question 2: What is the capital of France?
Answer 2:
The capital of France is Paris.

Question 3: How many seconds are in 100 years?
Answer 3:
There are 31,536,000 seconds in a year (60 seconds x 60 minutes x 24 hours x 365 days). Therefore, in 100 years, there would be 3,153,600,000 seconds.

Question 4: Alice, Bob, and Charlie went to a café. Alice paid twice as much as Bob, and Bob paid three times as much as Charlie. If the total bill was $72, how much did each person pay?
Answer 4:
Let's call the amount Charlie paid x. 
Alice paid twice as much as Bob, so she paid 2*(3x) = 6x.
Bob paid three times as much as Charlie, so he paid 3x.

We know the total bill was $72:
x + 6x + 3x = 72
10x = 72
x = 7.2 

Therefore, Charlie paid $7.20, Bob paid $21.60, and Alice paid $43.20.

Question 5: When was the Declaration of Independence signed?
Answer 5:
The Declaration of Independence was signed on July 4, 1776.

Generated 5 answers and tracked them in Langfuse.

Remember, the goal of this tutorial is to show you how to build an external evaluation pipeline. These pipelines will run in your CI/CD environment, or be run in a different orchestrated container service. No matter the environment you choose, three key steps always apply:

  1. Fetch Your Traces: Get your application traces to your evaluation environment
  2. Run Your Evaluations: Apply any evaluation logic you prefer
  3. Save Your Results: Attach your evaluations back to the Langfuse trace used for calculating them.

For the rest of the notebook, we’ll have one goal:


🎯 Goal: Evaluate all traces run in the past 24 hours


Download trace dataset from Langfuse

Fetching traces from Langfuse is straightforward. Just set up the Langfuse client and use one of its functions to fetch the data. We’ll fetch the traces and evaluate them. After that, we’ll add our scores back into Langfuse.

The fetch_traces() function has arguments to filter the traces by tags, timestamps, and beyond. You can find more about other methods to query traces in our docs.

from langfuse import Langfuse
from datetime import datetime, timedelta
 
langfuse = Langfuse()
now = datetime.now()
one_day_ago = now - timedelta(hours=24)
 
traces = langfuse.fetch_traces(
    tags="TLM_eval_pipeline",
    from_timestamp=one_day_ago,
    to_timestamp=now,
).data
 

Generate evaluations with TLM

Langfuse can handle numerical, boolean and categorical (string) scores. Wrapping your custom evaluation logic in a function is often a good practice.

Instead of running TLM individually on each trace, we’ll provide all of the prompt, response pairs in a list to TLM in a single call. This is more efficient and allows us to get scores and explanations for all of the traces at once. Then, using the trace.id, we can attach the scores and explanations back to the correct trace in Langfuse.

from cleanlab_tlm import TLM
 
tlm = TLM(options={"log": ["explanation"]})
# This helper method will extract the prompt and response from each trace and return three lists: trace ID's, prompts, and responses.
def get_prompt_response_pairs(traces):
    prompts = []
    responses = []
    for trace in traces:
        prompts.append(trace.metadata["system_prompt"] + "\n" + trace.input["args"][0])
        responses.append(trace.output)
    return prompts, responses
 
trace_ids = [trace.id for trace in traces]
prompts, responses = get_prompt_response_pairs(traces)

Now, let’s use TLM to generate a trustworthiness score and explanation for each trace.

IMPORTANT: It is essential to always include any system prompts, context, or other information that was originally provided to the LLM to generate the response. You should construct the prompt input to get_trustworthiness_score() in a way that is as similar as possible to the original prompt. This is why we included the system prompt in the trace metadata.

# Evaluate each of the prompt, response pairs using TLM
evaluations = tlm.get_trustworthiness_score(prompts, responses)
 
# Extract the trustworthiness scores and explanations from the evaluations
trust_scores = [entry["trustworthiness_score"] for entry in evaluations]
explanations = [entry["log"]["explanation"] for entry in evaluations]
 
# Create a DataFrame with the evaluation results
trace_evaluations = pd.DataFrame({
    'trace_id': trace_ids,
    'prompt': prompts,
    'response': responses, 
    'trust_score': trust_scores,
    'explanation': explanations
})
trace_evaluations
Querying TLM... 100%|██████████|
trace_id prompt response trust_score explanation
0 2f0d41b2-9b89-4ba6-8b3f-7dadac8a8fae You are a trivia master.\nWhen was the Declara... The Declaration of Independence was signed on ... 0.389889 The proposed response states that the Declarat...
1 f8e91744-3fcb-4ef5-b6c6-7cbcf0773144 You are a trivia master.\nAlice, Bob, and Char... Let's denote the amount Charlie paid as C. \n\... 0.669774 This response is untrustworthy due to lack of ...
2 f9b42125-4e5e-4533-bfbb-36c30490bd1d You are a trivia master.\nHow many seconds are... There are 3,153,600,000 seconds in 100 years. 0.499818 To calculate the number of seconds in 100 year...
3 71b131b9-e706-41c7-9bfd-b77719783f29 You are a trivia master.\nWhat is the capital ... The capital of France is Paris. 0.987433 Did not find a reason to doubt trustworthiness.
4 da0ee9fa-01cf-42ce-9e3e-e8d127ca105b You are a trivia master.\nWhat is the 3rd mont... March. 0.114874 To determine the 3rd month of the year in alph...

Awesome! Now we have a DataFrame mapping trace IDs to their scores and explanations. We’ve also included the prompt and response for each trace for demonstration purposes to find the least trustworthy trace!

sorted_df = trace_evaluations.sort_values(by="trust_score", ascending=True).head()
sorted_df
trace_id prompt response trust_score explanation
4 da0ee9fa-01cf-42ce-9e3e-e8d127ca105b You are a trivia master.\nWhat is the 3rd mont... March. 0.114874 To determine the 3rd month of the year in alph...
0 2f0d41b2-9b89-4ba6-8b3f-7dadac8a8fae You are a trivia master.\nWhen was the Declara... The Declaration of Independence was signed on ... 0.389889 The proposed response states that the Declarat...
2 f9b42125-4e5e-4533-bfbb-36c30490bd1d You are a trivia master.\nHow many seconds are... There are 3,153,600,000 seconds in 100 years. 0.499818 To calculate the number of seconds in 100 year...
1 f8e91744-3fcb-4ef5-b6c6-7cbcf0773144 You are a trivia master.\nAlice, Bob, and Char... Let's denote the amount Charlie paid as C. \n\... 0.669774 This response is untrustworthy due to lack of ...
3 71b131b9-e706-41c7-9bfd-b77719783f29 You are a trivia master.\nWhat is the capital ... The capital of France is Paris. 0.987433 Did not find a reason to doubt trustworthiness.
# Let's look at the least trustworthy trace.
print("Prompt: ", sorted_df.iloc[0]["prompt"], "\n")
print("OpenAI Response: ", sorted_df.iloc[0]["response"], "\n")
print("TLM Trust Score: ", sorted_df.iloc[0]["trust_score"], "\n")
print("TLM Explanation: ", sorted_df.iloc[0]["explanation"])
Prompt:  You are a trivia master.
What is the 3rd month of the year in alphabetical order? 

OpenAI Response:  March. 

TLM Trust Score:  0.11487442493072615 

TLM Explanation:  To determine the 3rd month of the year in alphabetical order, we first list the months: January, February, March, April, May, June, July, August, September, October, November, December. When we arrange these months alphabetically, we get: April, August, December, February, January, July, June, March, May, November, October, September. In this alphabetical list, March is the 8th month, not the 3rd. The 3rd month in alphabetical order is actually December. Therefore, the proposed response is incorrect. 
This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): 
December.

Awesome! TLM was able to identify multiple traces that contained incorrect answers from OpenAI.

Let’s upload the trust_score and explanation columns to Langfuse.

Upload evaluations to Langfuse

for idx, row in trace_evaluations.iterrows():
    trace_id = row["trace_id"]
    trust_score = row["trust_score"]
    explanation = row["explanation"]
    
    # Add the trustworthiness score to the trace with the explanation as a comment
    langfuse.score(
            trace_id=trace_id,
            name="trust_score",
            value=trust_score,
            comment=explanation
        )

You should now see the TLM trustworthiness score and explanation in the Langfuse UI!

Image of Langfuse platform showing Cleanlab's TLM trust score

If you click on a trace, you can also see the trust score and provided explanation.

Image of Langfuse platform showing Cleanlab's TLM trust score and explanation

Was this page useful?

Questions? We're here to help

Subscribe to updates