Music streaming DJ
This is an example to illustrate the concepts of the Langfuse Academy.
Context
A music streaming app adds a DJ feature in beta. The DJ keeps a queue going based on what the user has played before and what the algorithm thinks the user will like. Every few songs it says a short word about what's coming up and why it chose it. Listeners can also optionally interact with the DJ to steer it, by clicking a microphone button and talking to it.
Two characteristics drive how to approach the AI engineering setup:
- The risk of a bad output is low. The worst case is a skipped song or an awkward DJ comment, and the feature is labeled beta.
- This is a new feature, so there is no historical data to start from, and the team will need to learn from live listener behavior.
Because of this, it makes sense to go live as soon as possible and iterate based on live feedback from the beginning.
Tracing the DJ
There are two different kinds of traces that will together form a session:
| Trace name | Details |
|---|---|
plan-next-set | Plans the next set of tracks and writes the commentary, triggered by the DJ itself every few songs or by a DJ request. input: listening context and any instructions output: next tracks queued, plus the commentary line |
handle-dj-request | Handles a voice request from the listener, triggered by the microphone button. input: voice clip and the current queue output: a short reply, plus instructions for the plan-next-set run it triggers |
A listening session and both trace types up close:
Capturing user behavior
In order to learn from users, we will log some interesting behavior on the applicable traces.
In principle, the team could already iterate with only this in place. Tracing and monitoring form a small loop of their own. In the beginning, this is probably enough for the team to quickly improve the DJ feature.
As the setup matures, and the team wants to have more structured testing in place, they can start building datasets and experiments on top of these signals.
Structured testing
With only the live signals, testing a change means shipping it and watching the scores. The team can add two more deliberate ways of testing: experiments on datasets, and A/B tests on live users.
Experiments on datasets
In this use case, testing end to end is very hard offline: whether a session was good only shows in live listening behavior, and taste differs per user, so no expected output holds for everyone. A single step like select-tracks can be tested, with the expectation describing the direction of the set rather than exact tracks:
A/B tests on live users
Some changes are hard to grade offline, like the tone of voice of the commentary. Since the risk is low, the team can give a small group of listeners the new version and compare the signal scores between the groups, like the skip rate and the message_type distribution. If the new version does better, they can roll it out to everyone.
With this in place, the full loop is running:
Conclusion
This is an example of a feature that's is low risk and there is no historical data to learn from. The best thing you can do in such cases is getting traces in as soon as possible and start monitoring them to learn from. Everything else, datasets, experiments, A/B tests, can be built on top of that over time.
Check out the other examples or the academy to learn more.
Last edited