Customer support chatbot

This is an example to illustrate the concepts of the Langfuse Academy.

Context

A SaaS company wants to automate its customer support chat. Today a support team answers every conversation, and years of resolved tickets sit in the ticketing system. The bot would speak to customers in the company's name, so a bad answer does damage beyond one conversation: customers lose trust in support as a whole.

That risk rules out shipping straight to customers. Instead, the rollout happens in three phases, each one earning the confidence for the next:

Phase	What's live	Gate to the next phase
1. Offline	Nothing, the agent is in development	Drafts good enough that the support team wants to use them
2. Internal	The agent drafts replies, the support team edits and sends	The difference between drafted and sent replies stays small across ticket categories, and the qualitative feedback from the support team is good
3. Customer-facing	The bot answers customers directly	Nothing, this is the end goal: the agent keeps being improved continuously in this phase

By employing this staged approach, we mitigate a lot of risk while getting production signals early.

Phase 1: creating the initial agent version

In order for the support team to actually use the internal agent, we need to make sure the initial version is already useful for them. To get there, we can lean on a lot of historical data from resolved tickets that we already have, no synthetic data generation needed.

Deploy

Trace

not live yet

Monitor

not live yet

Build datasets

from resolved tickets

Experiment

prompt and retrieval variants

Evaluate

compare against the human reply

We can use AI to understand which patterns exist in the historical data from the support ticketing system, and use that to group into a few datasets by recurring behavior:

Datasets

Splitting into multiple datasets vs keeping it in one is a trade-off. Grouping allows you to measure performance per use case and lets you run a subset when needed. One big dataset can make sense too, especially early on when the optimal split isn't clear yet.

We now iterate on the prompt, running each version against these datasets, until the drafts reach the bar where the support team would actually find them useful. Each run is graded with:

Evaluators

manual reviewVibe check on a sample of drafts

Notice that we are not judging the tone of voice automatically here. This is on purpose: a few manual checks are enough to make sure the drafts sound acceptable, and we will use phase 2 to get the tone of voice perfectly aligned.

When the evaluator scores are good enough and the manual review looks good, the agent moves into the support ticketing system.

Before moving on to the next section, this is what a trace could look like:

Tracedraft-support-replydataset item: account-changes3.1s

input"I can't add my colleague to our workspace, the invite says 'seat limit reached'."

output"Hi Sam, all 10 seats on your plan are in use. You can free one up by deactivating a former member under Settings → Members, or add seats under Billing."

Retrieverfind-similar-tickets0.5s

inputthe ticket text

output3 resolved tickets: T-3107, T-2954, T-2381 (seat limit, adding members)

Retrieversearch-help-articles0.4s

input"seat limit, add seats"

output'Managing seats' · 'Billing settings'

Toolfetch-account-details0.3s

inputworkspace W-2209

outputplan: Team · seats: 10 of 10 in use

Gendraft-replygpt-4.11.4k tok$0.011.9s

inputthe ticket text · account details · 2 help articles · 3 resolved tickets

output"Hi Sam, all 10 seats on your plan are in use. You can free one up by deactivating a former member under Settings → Members, or add seats under Billing."

Phase 2: internal rollout

In this phase, the agent drafts a reply for every incoming ticket, and a support team member can use it and edit it before sending. We can use the difference between the agent's draft and what the team member actually sends as a very reliable user signal.

Another, more explicit user signal is a thumbs up or thumbs down button on the draft, where the team can also leave written feedback.

Deploy

Trace

every draft is traced

Monitor

edit_type on every trace

Build datasets

heavily edited drafts become new items

Experiment

fixes for recurring edit patterns

Evaluate

rerun the datasets before shipping

Both signals are captured as scores on every draft trace:

Evaluators

Traces with a bad score can then be looked at, improved, and added to the datasets if they weren't covered before. One way to do this is using the error analysis process.

One iteration through the loop then looks like this: monitoring shows a spike of corrected drafts in the troubleshooting category. The traces reveal the agent keeps citing an outdated export dialog, because the help-article index is stale. The index gets refreshed, the troubleshooting dataset is rerun to confirm the fix, and the share of corrected drafts drops.

The share of edited drafts should go down over time. Once drafts go out mostly unchanged across all ticket categories, the agent is ready to face customers.

Phase 3: customer-facing rollout

The bot now answers customers directly, and the signal that drove phase 2 disappears: nobody edits the reply before the customer sees it. Instead, we introduce a couple of new implicit user signals that will help us learn and improve over time.

Instead of going full-auto immediately, you could also automate only the request categories that cleared the phase 2 gate, while the rest keeps going through the support team for a little longer.

Deploy

Trace

every customer conversation

Monitor

user signals and risk monitoring

Build datasets

bad conversations become new items

Experiment

fixes for what monitoring surfaces

Evaluate

rerun the datasets before shipping

Customers rarely rate their support chat, so the main focus is on gathering implicit user signals:

Evaluators

With no human in the loop anymore, we monitor every reply for reputation-damaging behavior. These evaluators don't block anything: they flag sent replies so the team can follow up with the customer quickly.

Evaluators

With these signals, there is a good setup to continuously improve the agent: bad conversations get surfaced, get improved, and become dataset items, to test against structurally. The team can also use this setup to safely try out a newer/cheaper/faster model, and make an informed decision on whether to deploy it or not.

Conclusion

Customer-facing automations often don't launch because the risk is high. This example outlines best practices for making it happen in a pragmatic way, embracing continuous learning and improvement.

Check out the other examples or the academy to learn more.

Was this page helpful?

Customer support chatbot

Context

Phase 1: creating the initial agent version

Phase 2: internal rollout

Phase 3: customer-facing rollout

Conclusion

On this page