04 Monitoring
Workshop source
Workshop material is maintained in the public langfuse/langfuse-workshop repository. Use the repository for the runnable app, checkpoint branches, and local setup.
Starting point
git checkout checkpoint/04-monitoringYou have a traced app with optional Langfuse-managed prompts. Every chat turn lands in Langfuse as a nested trace.
Why monitor your AI app
In production, an AI app produces a lot of traces. Most of them are fine. The interesting ones โ the answers that drift, the requests the agent shouldn't be handling at all, the patterns that change over time โ are what you want to find. Monitoring is how you catch those signals without reading every single trace by hand.
For the bigger picture, see the Langfuse Academy lesson on monitoring.
Goal
The goal of monitoring is finding the things that are worth knowing about for your AI application. For Specs, we chose two events that are worth catching as a starting point:
- User disagreement โ Dad pushes back ("No, that menu isn't there"). Either the agent gave the wrong steps or the app is showing its limits.
- Out-of-scope requests โ Dad tries to use Specs for something it isn't built for ("Can you file my taxes?"). Useful both for spotting product expansion ideas and for confirming the agent refuses gracefully.
Monitoring also has a quality-tracking dimension โ average score on some metric over time. We recommend signal detection first: tracking aggregate quality is most useful once you and your team have a clear opinion about what quality even means in your context, and the fastest way to form that opinion is to look at the surprising traces.
You don't need to change any code in this step. The trace shape from 02-tracing already has everything a judge-based monitor needs: the agent observation has the full conversation and final answer, and each OpenAI generation has the system prompt plus the same message array.
Step 1 โ Wire the first two monitors (Langfuse UI)
Langfuse ships published templates for User Disagreement and Out-of-Scope Request. Both are LLM-as-a-judge evaluators that read variables from observations. The two templates need slightly different targets:
- Out-of-Scope Request needs the system prompt, so target the final OpenAI generation.
- User Disagreement needs the conversation history, so target the root
dad-it-support-chat-turnagent observation.
Fresh project check: if Langfuse shows No default model set before the template picker, configure Project Settings โ LLM Connections with your OpenAI key, then return to Evaluators โ Set up evaluator and save a default evaluator model such as
openai / gpt-4.1. Keep the API key in the Langfuse secret field only; do not paste it into workshop transcripts or shared notes.
For Out-of-Scope Request:
-
In Langfuse, open Evaluators โ New evaluator and pick the Out-of-Scope Request template from the published library.
-
Target the final OpenAI generation:
- Observation type:
generation - Tool Call count = 0 (to exclude tool decisions)
- Observation type:
-
Map the template's variables from the generation's Input:
Template variable Object field JsonPath {{system_prompt}}Input$.messages[0].content{{last_user_message}}Input$.messages[-1:].contentThe
[-1:]slice reads the final message in the generation input, so the mapping keeps working as the conversation grows. If your trace has a different message shape, inspect the generation input and adjust the JsonPath. -
Use the default judge model you configured during setup, or pick another structured-output-capable judge model, and save.
-
Enable the evaluator.
![]()
For User Disagreement:
-
In Langfuse, open Evaluators โ New evaluator and pick the User Disagreement template from the published library.
-
Target the root agent observation:
- Observation type:
agent - Observation name:
dad-it-support-chat-turn
- Observation type:
-
Map the template's variables from the agent observation's Input:
Template variable Object field JsonPath {{conversation_history}}Input$.messages{{last_user_message}}Input$.messages[-1:].contentThe agent input is the chat request from the browser, so the last message is Dad's latest message for that turn.
-
Use the default judge model you configured during setup, or pick another structured-output-capable judge model, and save.
-
Enable the evaluator.
![]()
๐ก Custom evaluators. The shipped templates are a fast on-ramp, but you don't have to use them. Evaluators โ New evaluator โ Custom lets you write your own prompt and define your own variables. Same mapping flow โ point each variable at the right JsonPath on the right observation, and you're done.
Verify
npm run devSend three turns that should each light up one monitor:
- In-scope โ "How do I turn Bluetooth on?" (should score clean on both monitors)
- Out-of-scope โ "Can you file my taxes?"
- Disagreement โ ask a normal question, then reply with "No, that menu isn't there"
In Langfuse, wait for the evaluator to run (refresh after a few seconds), then sort traces by the evaluator score. The out-of-scope and disagreement traces should bubble to the top.
![]()
![]()
When the out-of-scope monitor fires, you can confirm the chatbot already rejected the request gracefully โ exactly what we asked it to do. But those traces are also the most interesting ones to read end-to-end: a steady stream of out-of-scope hits is often the earliest signal that there's additional scope worth handling. "Can you file my taxes?" is silly, but "Help me move photos to my new iPad" might be a real feature request hiding in monitor output.
User disagreement is a much higher-signal event. When a user pushes back on an answer the agent just gave, something almost certainly went wrong โ wrong tool result, missing context, an instruction that doesn't match the iPhone they're on. These are the traces you want to read first, and they're prime candidates to turn into dataset items for 05-dataset.
Wrap-up
Good monitors are how you separate signal from noise. Production means a lot of traces, and the most important question is which ones should I look at? โ monitors answer that.
Once you have signal-Request monitors in place, the next step over time is average-metric tracking โ picking quality metrics and watching them drift. The right way to choose those metrics is error analysis: look at a sample of the surprising traces you're now catching, group them by failure mode, and turn the failure modes into evaluators. The monitoring lesson on the Academy goes deeper on this.
The traces you catch with these monitors are also the best source for the next step โ 05-dataset โ because they're real examples of behavior you want to lock in or fix.
End state
This is the starting point for 05-dataset.