Customer support chatbot
This is an example to illustrate the concepts of the Langfuse Academy.
Context
A SaaS company wants to automate its customer support chat. Today a support team answers every conversation, and years of resolved tickets sit in the ticketing system. The bot would speak to customers in the company's name, so a bad answer does damage beyond one conversation: customers lose trust in support as a whole.
That risk rules out shipping straight to customers. Instead, the rollout happens in three phases, each one earning the confidence for the next:
| Phase | What's live | Gate to the next phase |
|---|---|---|
| 1. Offline | Nothing, the agent is in development | Drafts good enough that the support team wants to use them |
| 2. Internal | The agent drafts replies, the support team edits and sends | The difference between drafted and sent replies stays small across ticket categories, and the qualitative feedback from the support team is good |
| 3. Customer-facing | The bot answers customers directly | Nothing, this is the end goal: the agent keeps being improved continuously in this phase |
By employing this staged approach, we mitigate a lot of risk while getting production signals early.
Phase 1: creating the initial agent version
In order for the support team to actually use the internal agent, we need to make sure the initial version is already useful for them. To get there, we can lean on a lot of historical data from resolved tickets that we already have, no synthetic data generation needed.
We can use AI to understand which patterns exist in the historical data from the support ticketing system, and use that to group into a few datasets by recurring behavior:
Splitting into multiple datasets vs keeping it in one is a trade-off. Grouping allows you to measure performance per use case and lets you run a subset when needed. One big dataset can make sense too, especially early on when the optimal split isn't clear yet.
We now iterate on the prompt, running each version against these datasets, until the drafts reach the bar where the support team would actually find them useful. Each run is graded with:
Notice that we are not judging the tone of voice automatically here. This is on purpose: a few manual checks are enough to make sure the drafts sound acceptable, and we will use phase 2 to get the tone of voice perfectly aligned.
When the evaluator scores are good enough and the manual review looks good, the agent moves into the support ticketing system.
Before moving on to the next section, this is what a trace could look like:
Phase 2: internal rollout
In this phase, the agent drafts a reply for every incoming ticket, and a support team member can use it and edit it before sending. We can use the difference between the agent's draft and what the team member actually sends as a very reliable user signal.
Another, more explicit user signal is a thumbs up or thumbs down button on the draft, where the team can also leave written feedback.
Both signals are captured as scores on every draft trace:
Traces with a bad score can then be looked at, improved, and added to the datasets if they weren't covered before. One way to do this is using the error analysis process.
One iteration through the loop then looks like this: monitoring shows a spike of corrected drafts in the troubleshooting category. The traces reveal the agent keeps citing an outdated export dialog, because the help-article index is stale. The index gets refreshed, the troubleshooting dataset is rerun to confirm the fix, and the share of corrected drafts drops.
The share of edited drafts should go down over time. Once drafts go out mostly unchanged across all ticket categories, the agent is ready to face customers.
Phase 3: customer-facing rollout
The bot now answers customers directly, and the signal that drove phase 2 disappears: nobody edits the reply before the customer sees it. Instead, we introduce a couple of new implicit user signals that will help us learn and improve over time.
Instead of going full-auto immediately, you could also automate only the request categories that cleared the phase 2 gate, while the rest keeps going through the support team for a little longer.
Customers rarely rate their support chat, so the main focus is on gathering implicit user signals:
With no human in the loop anymore, we monitor every reply for reputation-damaging behavior. These evaluators don't block anything: they flag sent replies so the team can follow up with the customer quickly.
With these signals, there is a good setup to continuously improve the agent: bad conversations get surfaced, get improved, and become dataset items, to test against structurally. The team can also use this setup to safely try out a newer/cheaper/faster model, and make an informed decision on whether to deploy it or not.
Conclusion
Customer-facing automations often don't launch because the risk is high. This example outlines best practices for making it happen in a pragmatic way, embracing continuous learning and improvement.
Check out the other examples or the academy to learn more.
Last edited