How to evaluate sessions/conversations?

This guide explains how to evaluate entire sessions (such as conversations, threads, etc.), rather than just individual traces. The information here applies to all evaluation methods supported by Langfuse.

Scores in Langfuse can be assigned to traces, observations, or sessions (see the data model).

Session-level scores

You can add a custom score that references the session ID.
You can annotate the session using annotation queues.

Trace-level LLM-as-a-Judge evaluators

LLM-as-a-Judge evaluators can be applied to traces. They cannot be applied directly to sessions, as Langfuse does not inherently know when a session has concluded.

It is recommended to evaluate the session using a trace-level LLM-as-a-Judge evaluator. Ensure that the trace contains the complete conversation history. This is typically the case if the conversation is used as input to an LLM call within a trace (i.e., conversation history or short-term memory). You can select this specific information from the observation when setting up the evaluator.

You can either:

Evaluate the session on every trace.
Or apply a special tag to the final trace of the session (e.g., conversation_end) and configure the evaluator to only run on traces with this tag. This approach can help reduce evaluation costs and make metrics more stable.

In both cases, the score will be assigned to the trace.

Cloud

Was this page helpful?

Support