May 31, 2026

Langfuse May Update

Code Evaluators, full-text search, Langfuse MCP, Experiments in CI/CD and more

Marc Klingen

May was Launch Week month. From May 25–29, we shipped one feature a day, live from ClickHouse OpenHouse. Five drops aimed at the same problem: moving AI applications from prototype to production without the usual guesswork.

Here's everything that landed.

Code Evaluators

Not every check needs an LLM. You can now write a Python or TypeScript evaluate function directly in the Langfuse UI, attach it to live observations or a dataset experiment, and the result lands as a native Langfuse score. JSON parseability, schema validation, exact match, required tool arguments, custom business rules. Deterministic, reproducible, no token cost.

Code evaluators sit alongside LLM-as-a-Judge: code wins for objective checks, the judge wins for semantic quality, together they give a more complete picture than either alone.

→ Changelog · Docs · Self-hosting config

New full-text search

Pulling one trace out of hundreds of thousands used to mean scroll-and-hope. Built on top of ClickHouse's new FTS engine, large input/output searches that took 18 seconds and scanned 494 GB now return in under half a second and read less than a gigabyte.

There's a new matches operator on Observations API v2 so agents and scripts get the same token-based search programmatically.

→ Changelog · Observations API v2 · ClickHouse FTS GA

Langfuse MCP

The hosted Langfuse MCP server used to cover prompt management only. It now covers most of Langfuse: 15 tool categories spanning observations, metrics, scores, score configs, datasets and their items and runs, comments, annotation queues, models, media, and health.

Any agent can now investigate a production issue, pull the relevant observation, query metrics, drop a comment for the team, create a score, or stage a dataset item.

Use the CLI when your agent has a sandbox, the MCP server when it doesn't, and allow-list lookup tools to keep it read-only.

→ Changelog · MCP reference

Experiments in CI/CD

Run your Langfuse experiments inside GitHub Actions. We released a new action that tests every PR against a Langfuse dataset, fails the workflow when scores drop below the threshold you set, and posts the result back to the PR as a comment. Every run is tracked in Langfuse so you can dig into regressions later.

→ Changelog · Docs · GitHub Action

Agent Skill

A playbook your AI coding agent can pick up. Drop it into Claude Code, Cursor, or Codex and the agent knows how to instrument an app, query traces, manage prompts, and set up evaluators.

It also ships with an LLM-as-a-Judge calibration skill to produce a full analysis: accuracy, F1, precision, recall, cost, graphed in the Experiments view.

→ Changelog · Docs · Skills on GitHub

Other ships

Self-service Enterprise SSO setup. Organization admins on Langfuse Cloud can now verify domains and configure Enterprise SSO directly in settings. → Changelog
Langfuse Academy. An open explanation of the AI engineering lifecycle: tracing, monitoring, datasets, experiments, evaluation, and how the pieces fit together. → Academy
Sign in with ClickHouse Cloud. Use your ClickHouse Cloud account to sign in to Langfuse Cloud, or link it to an existing Langfuse account. → Changelog
Trace context on /api/public/v2/observations. Fetch a trace's tags, release, and trace name directly on each observation row. → Changelog
Column selection and gzip for blob storage exports. Pick which field groups land in each row, enable gzip compression in scheduled S3, GCS, and Azure exports. Shrink files and drop fields you don't need. → Changelog
Enriched observations by default. New Cloud projects use enriched observations for blob storage, PostHog, and Mixpanel exports. The legacy traces/observations sources stay available on existing projects and self-hosted deployments. → Changelog

Fixes

Preserve trace URL filters when opening shared links in a new tab (#13665)
Align prompt variable handling in the UI with the SDK/compiler (#13680)
Use correct units for dashboard charts (#13338)
Render latency metrics in scaled units in custom dashboard widgets (#13242)
Stop retrying eval context-overflow errors (#13930)
Handle Bedrock reasoning content in LLM completions (#13527)
Recognize OpenInference cache-read/cache-write token counts via OTel (#13572)
Saved views: don't override filters when query params are provided (#13865)
Playground: make the tools list scrollable when more than 4 are attached (#13439)
Include today in the Prompts table observation count window (#13415)
Use the prompt's model config for experiments (#13565)
Prevent image flicker on trace UI when image validation fails (#13440)
Parse AI SDK tool calls that arrive as stringified JSON (#13550)
and many more!

That's a wrap on May. If you missed a drop, the Launch Week page has the demos for all five.

Was this page helpful?

PreviousLangfuse April Update

NextHow we use agents to review production infrastructure