May 28, 2026Launch Week 5 🚀

Code evaluators

Tobias Wochinger

Run deterministic Python or TypeScript checks on observations and experiments in Langfuse.

You can now create code evaluators in Langfuse to score observations and experiments with deterministic Python or TypeScript logic. Use them for exact checks such as JSON parseability, schema validation, exact match, required tool arguments, or custom business rules.

Run them on live production observations to monitor specific operations, or attach them to experiments to compare prompt and model variants against controlled datasets. Each evaluator returns native Langfuse scores, so results work with trace views, experiment comparisons, filters, dashboards, and Score Analytics.

Code evaluators complement LLM-as-a-Judge: use code for objective checks where deterministic logic is more reliable, and use a judge model for semantic quality, tone, helpfulness, or rubric-based reasoning.

How it works

Write an evaluate function in Python or TypeScript in the Langfuse UI
Target live observations or experiment observations
Configure filters, sampling, and context fields
Test the evaluator on sample data before enabling it
Debug executions through evaluator traces in the langfuse-code-eval environment

Code evaluators are designed for compact checks that run quickly at scale. They support standard library code, run without network egress, and return one or more numeric, categorical, boolean, or text scores.

Get started

Read the setup guide to create your first evaluator, choose the right target, and see Python and TypeScript examples for the evaluator contract. Code evaluators are available across Langfuse environments, including self-hosted deployments.

Code evaluators

Evaluation overview

Self-hosting setup

Was this page helpful?

PreviousFull-Text Search

NextLangfuse MCP: now with Observations, Metrics, Scores, Comments and more