March 31, 2026Langfuse March Update
Agent Skill, Langfuse CLI, boolean and categorical LLM-as-a-Judge scores, Kiro integration, and more
The past months we have been shipping a lot to make Langfuse much easier to use for your coding agents. If you are not using Langfuse through your coding agent yet, we strongly recommend giving it a spin!
Agent Skill
![]()
The Langfuse Agent Skill helps coding agents to use Langfuse effectively. It follows the open Agent Skills standard and works with Claude Code, Cursor, Codex, and others.
Install it in one line:
npx skills add langfuse/skills --skill "langfuse"Or just ask your coding agent to install it from github.com/langfuse/skills.
Once installed, your agent can query traces, create datasets, update prompts, migrate hardcoded prompts to Langfuse Prompt Management, and set up observability — all without leaving your editor. Even if you are already successfully using Langfuse, the Skill can help you improve your workflows and instrumentation.
Langfuse CLI
![]()
The skill uses the Langfuse CLI under the hood. It wraps the entire Langfuse API, auto-generated from our OpenAPI spec so it's always in sync. Every endpoint becomes a CLI command: traces, prompts, datasets, scores, sessions, metrics, and more.
Built for agents, but useful for humans too. Script your workflows, automate batch-scoring, or sync prompts across environments in CI/CD.
→ npm
Further reading
Here are some pointers for what to do with the Skill and CLI:
- Getting started with all Langfuse features. Using the Skill it is incredibly easy to get started using more of the Langfuse platform. Just ask your agent that you would like to test a certain feature and it can propose useful first use cases and start implementing them.
- Automatic prompt improvement. Annotate a few traces in Langfuse, then let an agent fetch your feedback, analyze patterns, and propose prompt changes. A fast loop from rough to robust. → Read the guide
We have also learned a lot about building efficient Skills:
- Evaluating skill quality. We used Langfuse datasets, tracing, and the Claude Agent SDK to systematically test and improve the Skill itself. Small details matter: a single comment saying "optional" instead of "mandatory" caused consistent agent failures. → Blog post
- Optimizing skills with Autoresearch. We ran Karpathy's autoresearch on our prompt migration skill. Score went from 0.35 to 0.82. Not all changes were keepers, but the process surfaced failure modes we'd never have found manually. → Blog post
Fixes & improvements
- Feat: Boolean Scores in LLM-as-a-Judge
- Feat: Categorical Scores in LLM-as-a-Judge
- Feat: Delete entire prompt folders
- Feat: Add to dataset batch action from events table
- Feat: Show evaluation prompt on hover in evals
- Feat: Support Japanese characters in prompt variables
- Feat: Position in trace filter
- UI: Chart loading and failure hints
- UI: Tooltip support for filter facets
- UX: Prevent focus loss when typing unit name in price editor
- UX: Mutual exclusion between temperature and top_p for Anthropic models
- UX: Data shown in JSON beta viewer for sessions
- API: Performance controls for
GET /api/public/traces - API: Typed
ObservationsV2Responsedata field - Fix: Usage/details summing now computes correctly
- Integration: Kiro — AI-powered IDE by AWS
- and many more!
Upcoming events
- AI Engineer Europe, London — April 8, 2026
- AI Demo Night in SF — April 9, 2026
- Co-hosting a Hackathon with OpenAI in Berlin — April 15, 2026
- Google Cloud Next in Vegas — April 22–24, 2026
- Clickhouse Openhouse in SF — May 26–28, 2026