Launch Week 5 ยท Day 4: Code evaluators โ†’
DocsCode evaluators
DocsEvaluationEvaluation MethodsCode evaluators

Code evaluators

Code evaluators require observations ingested with an OpenTelemetry-based SDK: Python SDK v3+ or JS/TS SDK v4+. If needed, see the Python v2 โ†’ v3 migration guide or JS/TS v3 โ†’ v4 migration guide.

Code evaluators run custom Python or TypeScript logic in Langfuse and return one or more scores. Use them for deterministic, objective checks where code is more reliable than a model-based judgment.

Common examples include exact match checks, regex validation, JSON parseability, schema validation, keyword checks, tool-call checks, and custom business rules.

Use LLM-as-a-Judge instead when the evaluation needs semantic judgment, rubric-based reasoning, or subjective assessment such as helpfulness, tone, or answer quality.

How to use code evaluators?

Code evaluators can run on two types of data: Observations (individual operations from live production traffic) or Experiments (controlled test datasets). Your choice depends on whether you're testing in development or monitoring production.

Decision tree

Which data needs deterministic evaluation?

โ†“
Live production data
Monitor real-time traffic
โ†“
Observations
Individual operations: LLM calls, retrievals, tool calls
Offline experiment data
Test in controlled environment
โ†“
Experiments
Controlled test cases with datasets

Production pattern: Teams typically use Experiments during development to validate deterministic checks, then deploy Observation-level evaluators in production for scalable monitoring.

Understanding each evaluation target

Run evaluators on individual observations within your traces, such as LLM calls, retrieval operations, embedding generations, or tool calls.

Why target observations

  • Operation-level precision: Filter by observation type to evaluate only the operations that matter, not complete traces.
  • Deterministic production monitoring: Check JSON validity, schema compliance, exact matches, or business rules on live traffic.
  • Compositional evaluation: Run different code evaluators on different operations within one trace.
  • Combined filtering: Stack observation filters with trace filters such as userId, sessionId, tags, version, and metadata.

Data flow

At ingest time, each observation is evaluated against your filter criteria. Matching observations are added to an evaluation queue. Evaluation jobs are processed asynchronously, and scores are attached to the specific observation.

Example use cases

  • Validate that final LLM responses are parseable JSON
  • Check whether a tool call includes required arguments
  • Enforce custom business rules for selected model calls

Run evaluators on controlled test datasets to compare model versions, prompt variations, or system configurations in a reproducible environment.

Why target experiments

  • You need deterministic pass/fail checks for development workflows
  • You want to compare multiple prompt versions or model configurations
  • You have datasets with expected outputs or metadata that your evaluator should inspect

Data flow

Each experiment run generates traces and observations that can be scored by your selected evaluators. The evaluator receives observation data plus experiment item context, such as expected output and item metadata.

  1. Create a dataset with test inputs and, optionally, expected outputs.
  2. Run an experiment via UI or SDK. See Experiments via UI or Experiments via SDK.
  3. Select code evaluators to score the generated observations.
  4. Compare results across experiment runs to make data-driven decisions.

Example use case

  • Compare two prompt versions on a dataset of support questions and check whether each response contains the required JSON fields

Set up step-by-step

Create a code evaluator

Go to the Evaluators page and create a new code evaluator.

Select code evaluator in the evaluator setup flow

Write the evaluator code

Choose Python or TypeScript and implement the evaluate function. Keep the code deterministic and within the runtime constraints.

Write TypeScript or Python evaluator code

Configure where it runs

Select observations or experiments as the target and configure filters, sampling, and mappings as needed.

Configure the code evaluator run target and filters

Test the evaluator

Run the evaluator against sample observations before enabling it. Use the preview to confirm that the observation fields and experiment fields passed to ctx match what your code expects.

Successful code evaluator test run

Trigger the evaluation

To see your evaluator in action, you need to either send traces (fastest) or trigger an experiment run (takes longer to set up) via the UI or SDK. Make sure to set the correct target data in the evaluator settings according to how you want to trigger the evaluation.

Function contract

Each evaluator exposes an evaluate function. Langfuse passes an EvaluationContext and expects an EvaluationResult with one or more scores.

from dataclasses import dataclass
from typing import Any


@dataclass
class ObservationContext:
    input: Any = None
    output: Any = None
    metadata: Any = None


@dataclass
class ExperimentContext:
    item_expected_output: Any = None
    item_metadata: Any = None


@dataclass
class EvaluationContext:
    observation: ObservationContext
    experiment: ExperimentContext | None = None


@dataclass
class Score:
    name: str
    value: int | float | str | bool
    data_type: str
    comment: str | None = None
    config_id: str | None = None
    metadata: dict[str, Any] | None = None


@dataclass
class EvaluationResult:
    scores: list[Score]


def evaluate(ctx: EvaluationContext) -> EvaluationResult:
    output_present = ctx.observation.output is not None

    return EvaluationResult(
        scores=[
            Score(
                name="Output present",
                value=output_present,
                data_type="BOOLEAN",
                comment=(
                    "Observation output is present."
                    if output_present
                    else "Observation output is missing."
                ),
                metadata={"rule": "output_present"},
            )
        ]
    )
type EvaluationContext = {
  observation: {
    input: any;
    output: any;
    metadata: any;
  };
  experiment:
    | {
        itemExpectedOutput: any;
        itemMetadata: any;
      }
    | undefined;
};

type ScoreBase = {
  name: string;
  comment?: string;
  configId?: string | null;
  metadata?: Record<string, unknown>;
};

type NumericScore = ScoreBase & {
  dataType: "NUMERIC";
  value: number;
};

type BooleanScore = ScoreBase & {
  dataType: "BOOLEAN";
  value: boolean;
};

type CategoricalScore = ScoreBase & {
  dataType: "CATEGORICAL";
  value: string;
};

type TextScore = ScoreBase & {
  dataType: "TEXT";
  value: string;
};

type Score = NumericScore | BooleanScore | CategoricalScore | TextScore;

type EvaluationResult = {
  scores: Score[];
};

function evaluate({
  observation: { input, output, metadata },
  experiment,
}: EvaluationContext): EvaluationResult {
  const itemExpectedOutput = experiment?.itemExpectedOutput;
  const itemMetadata = experiment?.itemMetadata;
  const outputPresent = output != null;

  return {
    scores: [
      {
        name: "Output present",
        value: outputPresent,
        dataType: "BOOLEAN",
        comment: outputPresent
          ? "Observation output is present."
          : "Observation output is missing.",
        metadata: {
          rule: "output_present",
          hasInput: input != null,
          hasObservationMetadata: metadata != null,
          hasExpectedOutput: itemExpectedOutput != null,
          hasExperimentMetadata: itemMetadata != null,
        },
      },
    ],
  };
}

Context fields

FieldDescription
ctx.observation.inputThe input recorded on the observation selected by the evaluator target.
ctx.observation.outputThe output recorded on the observation selected by the evaluator target.
ctx.observation.metadataThe metadata recorded on the observation.
ctx.experimentPresent only when the evaluator runs on an experiment.
ctx.experiment.item_expected_output (Python) / ctx.experiment.itemExpectedOutput (TypeScript)Expected output from the experiment item.
ctx.experiment.item_metadata (Python) / ctx.experiment.itemMetadata (TypeScript)Metadata from the experiment item.

Score fields

FieldDescription
nameRequired score name.
valueRequired score value.
data_type / dataTypeRequired score data type. Supported values are NUMERIC, CATEGORICAL, BOOLEAN, and TEXT.
commentOptional reasoning or explanation stored with the score.
config_id / configIdOptional score config ID. When provided, the score must satisfy the referenced score config.
metadataOptional metadata stored with the score.

Example: Exact match

This example returns a boolean score that passes when the observation output exactly matches the experiment item's expected output.

def evaluate(ctx: EvaluationContext) -> EvaluationResult:
    """Evaluates one observation and returns one or more Langfuse scores."""
    expected_output = (
        ctx.experiment.item_expected_output if ctx.experiment is not None else None
    )
    matches_expected_output = (
        expected_output is not None and ctx.observation.output == expected_output
    )

    return EvaluationResult(
        scores=[
            Score(
                name="Exact match",
                value=matches_expected_output,
                data_type="BOOLEAN",
                comment=(
                    "Output exactly matches the expected output."
                    if matches_expected_output
                    else "Output does not match the expected output."
                ),
            )
        ]
    )
/**
 * Evaluates one observation and returns one or more Langfuse scores.
 */
function evaluate({
  observation: { input, output, metadata },
  experiment,
}: EvaluationContext): EvaluationResult {
  const itemExpectedOutput = experiment?.itemExpectedOutput;
  const itemMetadata = experiment?.itemMetadata;
  const matchesExpectedOutput =
    itemExpectedOutput != null && output === itemExpectedOutput;

  return {
    scores: [
      {
        name: "Exact match",
        value: matchesExpectedOutput,
        dataType: "BOOLEAN",
        comment: matchesExpectedOutput
          ? "Output exactly matches the expected output."
          : "Output does not match the expected output.",
        metadata: {
          hasInput: input != null,
          hasObservationMetadata: metadata != null,
          hasExperimentMetadata: itemMetadata != null,
        },
      },
    ],
  };
}

Debug code evaluator executions

Every code evaluator execution creates a trace, giving you complete visibility into the evaluation process. This lets you inspect the selected inputs and outputs, experiment context, runtime latency, returned scores, logs, and errors.

You can show code evaluator execution traces by filtering for the environment langfuse-code-eval in the tracing table:

Tracing table filtered to code evaluator
executions

Code evaluator execution status
  • Completed: Evaluation finished successfully and returned valid scores.
  • Error: Evaluation failed (click the execution trace ID for inputs, outputs, latency, logs, and error details).
  • Pending: Evaluation is queued and waiting to run.

Use the evaluator test run before enabling a new evaluator. It is the fastest way to validate the selected observation data, experiment context, score names, score values, and score data types.

Runtime constraints

Code evaluators are intended for compact, deterministic checks that can run quickly and safely for many observations.

Need a specific third-party library or network access for code evaluators? Please share your use case in GitHub Discussions. Your feedback helps us understand where broader runtime support would be useful.

ConstraintLimit / guidance
LanguagesWrite evaluators in Python or TypeScript. On self-hosted deployments, Python requires the aws-lambda dispatcher; insecure-local supports TypeScript/JavaScript only.
TypeScript syntaxUse erasable TypeScript syntax. Type annotations and interfaces are fine; avoid enums, namespaces, decorators, and parameter properties.
DependenciesUse the language standard library. Third-party packages are not available in the evaluator runtime.
Network accessEvaluators run without network egress. Keep all required data in the observation or experiment context.
Runtime limitEvaluators must complete within 2 seconds.
Result shapeReturn at least one score from evaluate.
Source sizeKeep evaluator source code under 256 KB.
Input sizeKeep the dispatch payload, including source code and selected variables, under 5.5 MB.
Result sizeKeep evaluator results under 256 KB.

FAQ

How do I debug timeout errors?

Timeouts usually mean the evaluator is doing too much work for the 2 second runtime limit or trying to access the network. Network requests are blocked by the runtime and can surface as timeout errors.

To debug this, run the evaluator on a small sample observation, remove network calls, avoid large loops or expensive parsing, and reduce the amount of input, output, metadata, or experiment context selected for the evaluator.

Can I use third-party packages?

No. Code evaluators currently support standard libraries only. If your evaluation requires a third-party package, run that logic in your own infrastructure and ingest the result with Scores via API/SDK.

Why does the experiment context sometimes not exist?

ctx.experiment is only present when the evaluator runs on an experiment. For live observation evaluators, write your code so it handles ctx.experiment being None in Python or undefined in TypeScript.

Can I create code evaluators via API or SDK?

Not yet. Create and manage code evaluators in the Langfuse UI. The public evaluator API currently remains scoped to LLM-as-a-Judge evaluators while the code evaluator contract is in Fast Preview.

If you want to run deterministic evaluation logic in your own application or CI pipeline, use Scores via API/SDK to ingest the resulting scores into Langfuse.

Why can't I find code evaluator execution traces?

Code evaluator executions use the internal environment langfuse-code-eval. Internal environments are hidden from the default tracing view, so filter the tracing table by environment = langfuse-code-eval or open the execution trace from the related score or evaluator log.

How do I configure code evaluators on self-hosted Langfuse?

For self-hosted deployments, configure the code evaluator dispatcher and execution worker in Code evaluators.

The only SDK requirement is OpenTelemetry-based ingestion:

GitHub Discussions

If you run into issues with one of the runtime constraints, or if a constraint blocks an important evaluation use case, please contribute details in GitHub Discussions.


Was this page helpful?

Last edited