Code evaluators

Run deterministic Python or TypeScript checks on observations and experiments in Langfuse.
You can now create code evaluators in Langfuse to score observations and experiments with deterministic Python or TypeScript logic. Use them for exact checks such as JSON parseability, schema validation, exact match, required tool arguments, or custom business rules.
Run them on live production observations to monitor specific operations, or attach them to experiments to compare prompt and model variants against controlled datasets. Each evaluator returns native Langfuse scores, so results work with trace views, experiment comparisons, filters, dashboards, and Score Analytics.
Code evaluators complement LLM-as-a-Judge: use code for objective checks where deterministic logic is more reliable, and use a judge model for semantic quality, tone, helpfulness, or rubric-based reasoning.
How it works
- Write an
evaluatefunction in Python or TypeScript in the Langfuse UI - Target live observations or experiment observations
- Configure filters, sampling, and context fields
- Test the evaluator on sample data before enabling it
- Debug executions through evaluator traces in the
langfuse-code-evalenvironment
Code evaluators are designed for compact checks that run quickly at scale. They support standard library code, run without network egress, and return one or more numeric, categorical, boolean, or text scores.
Get started
Read the setup guide to create your first evaluator, choose the right target, and see Python and TypeScript examples for the evaluator contract. Code evaluators are available across Langfuse environments, including self-hosted deployments.