Classifying User Intent with Categorical LLM-as-a-Judge
A guide on how to set up a categorical LLM-as-a-judge evaluator to classify user intent. Follow along with how we applied this to our demo application.
We recently shipped categorical LLM-as-a-judge scores, a highly requested feature. Instead of only returning numeric scores, evaluators can now return category labels. To put this to use right away, we set it up on our own support chatbot demo application to classify the user's intent into a fixed set of categories.
New to LLM-as-a-judge? The docs page covers the concept in detail.
What you can do with intent labels
Once every trace has an intent label attached to it, a few things become possible:
- Filter traces by intent. Want to look at only self-hosting questions, or only implementation questions? Filter on the score value and review just those traces.
- Build dashboards showing question distribution. See at a glance how many users are asking conceptual questions vs. implementation questions vs. pricing questions, and track how the distribution changes over time.
- Correlate intent with other scores using Score Analytics. For example, if you collect user feedback scores, you can check whether self-hosting questions consistently receive lower feedback scores than conceptual questions.
The evaluator prompt
We defined six intent categories that map to distinct user needs:
| Category | Description |
|---|---|
conceptual-question | The user wants to understand what Langfuse is, how a feature or concept works, or wants a high-level explanation. |
implementation-question | The user wants to build, integrate, or write code with Langfuse. This includes getting-started questions, framework-specific integration, debugging errors, and API usage. |
self-hosting | The user wants to deploy or run Langfuse on their own infrastructure. Includes Docker, Kubernetes, local setup, and infrastructure questions. |
pricing-and-comparison | The user asks about cost, pricing plans, or compares Langfuse to an alternative. |
ui-feedback | The user gives feedback, reports a UI issue, or makes a product suggestion about the Langfuse application itself. |
irrelevant-to-langfuse | Greetings with no question, test messages, gibberish, or requests completely unrelated to Langfuse. |
Read the full evaluator prompt here.
You are a user intent classifier for a Langfuse support chatbot. You will be given the user's message. Classify it into exactly one of the following categories.
Categories
- conceptual-question: The user wants to understand what Langfuse is, how a feature or concept works, or wants a high-level explanation. The user is not trying to build or set up anything yet. Examples: "What is Langfuse?", "How does tracing work?", "Explain this to a product manager."
- implementation-question: The user wants to build, integrate, or write code with Langfuse. This includes getting-started questions, framework-specific integration, requests for code examples, debugging errors, and API usage. If the user is trying to do something with Langfuse in their application, it belongs here.
- self-hosting: The user wants to deploy or run Langfuse on their own infrastructure. Includes Docker, Kubernetes, local setup, resource provisioning, and infrastructure questions. If the question is about running Langfuse itself rather than using it in an application, it belongs here.
- pricing-and-comparison: The user asks about cost, pricing plans, or compares Langfuse to an alternative (e.g. "How does Langfuse compare to X?"). A general "What can Langfuse do?" question is conceptual, not a comparison.
- ui-feedback: The user gives feedback, reports a UI issue, or makes a product suggestion about the Langfuse application itself. The user is commenting on their experience, not asking a question.
- irrelevant-to-langfuse: Greetings with no question, test messages, gibberish, or requests completely unrelated to Langfuse (e.g. shopping lists, weather, song lyrics).
Rules
- Classify based on the user's primary goal. A user asking "what are traces?" is conceptual; a user asking "how do I set up tracing in Python?" is implementation.
- When a message touches multiple categories, choose the one that best matches what the user is ultimately trying to achieve.
- If the message is in a non-English language, classify based on the translated intent, language does not affect the category.
Input: {{input}}
Output: Respond with only the category name, nothing else.
A few things worth noting about this prompt:
- Each category has a clear boundary. For example,
conceptual-questionvs.implementation-questioncomes down to whether the user is trying to understand something or build something. - The rules section resolves ambiguity. When a message touches multiple categories, the judge picks the one matching the user's primary goal.
- The categories are actionable. Each one maps to something we can act on, whether that's improving docs, routing feedback, or filtering out noise.
You can see this evaluator live in our public demo project.
Configuring the evaluator in Langfuse
After entering your prompt for the evaluator, select "Categorical" as the score type and define the category values.
![]()
Then define the category values and enter the score reasoning and category selection prompts.
![]()
Make sure the category names you enter here match the category names used in the evaluator prompt exactly.
There is also a checkbox to allow multiple matches. In our case, we left this unchecked since each user message should have exactly one intent.
After saving, you can configure the remaining settings as you normally would for any LLM-as-a-judge evaluator: filtering on the observations you want to run on, mapping the variables to the observation input, etc. See the LLM-as-a-judge documentation for guidance on these steps.
Intent scores in action
With the evaluator running, every new support chatbot trace gets an intent label automatically. Now we can set up a dashboard widget that shows us the distribution of user intent over conversations.
To create this widget, select "Scores Categorical" as the view, set the metric to "Count", filter by score name user-intent, and break down by "String Value". We chose a pie chart to visualize the distribution.
![]()
The resulting widget shows the distribution of user intent across conversations.
![]()
The dashboard is publicly accessible if you want to explore the data yourself.
Get started
Categorical LLM-as-a-judge scores are available on all Langfuse plans. To set up your own evaluator:
- Read the LLM-as-a-judge documentation for the full setup guide.
- Explore the public demo project to see this evaluator in action.
- Check out the changelog entry for the feature announcement.