Music streaming DJ

This is an example to illustrate the concepts of the Langfuse Academy.

Context

A music streaming app adds a DJ feature in beta. The DJ keeps a queue going based on what the user has played before and what the algorithm thinks the user will like. Every few songs it says a short word about what's coming up and why it chose it. Listeners can also optionally interact with the DJ to steer it, by clicking a microphone button and talking to it.

9:41

Good evening

Alex

Your DJ

Mixing songs for
your evening

It picks the tracks, talks you through them, and listens when you talk back.

Start DJ session

Recently played

Night Shapes

Vela

Paper Cities

Hollow Coast

Slow Tide

Marlowe

Made for you

See all

Daily mix

Late night

Daily mix

Focus drift

Home

Library

9:41

Now playing

DJ · why this song

A deep cut after that album you ran on repeat last week — easing you in before I pick the pace up.

Slow Tide

Marlowe

0:483:36

9:41

Talk to DJ

Listening…

Play something with a bit more energy

Try saying

Who is this? Skip this one More like this

Tap the orb to stop

Two characteristics drive how to approach the AI engineering setup:

The risk of a bad output is low. The worst case is a skipped song or an awkward DJ comment, and the feature is labeled beta.
This is a new feature, so there is no historical data to start from, and the team will need to learn from live listener behavior.

Because of this, it makes sense to go live as soon as possible and iterate based on live feedback from the beginning.

Tracing the DJ

There are two different kinds of traces that will together form a session:

Trace name	Details
`plan-next-set`	Plans the next set of tracks and writes the commentary, triggered by the DJ itself every few songs or by a DJ request. input: listening context and any instructions output: next tracks queued, plus the commentary line
`handle-dj-request`	Handles a voice request from the listener, triggered by the microphone button. input: voice clip and the current queue output: a short reply, plus instructions for the `plan-next-set` run it triggers

A listening session and both trace types up close:

Session listen_7f3e · user u_8841

Traceplan-next-set"Kicking off with two favorites from your week."2.8s

Traceplan-next-set"Staying in this lane with some mellow electronica."2.4s

Tracehandle-dj-request"Play something calmer."1.6s

Traceplan-next-set"Calming it down, here is some ambient piano."triggered by dj-request1.5s

Traceplan-next-setsession: listen_7f3euser: u_88411.5s

inputtaste: electronica, downtempo · recent: Tycho "Awake", Bonobo "Kerala" · instruction: "calmer"

output4 tracks queued · commentary: "Calming it down, here is some ambient piano."

Toolselect-tracks0.6s

inputtaste: electronica, downtempo · recent: Tycho "Awake", Bonobo "Kerala" · instruction: "calmer"

outputNils Frahm "Says" · Ólafur Arnalds "Saman" · Max Richter "On the Nature of Daylight" · Joep Beving "Sleeping Lotus"

Genwrite-commentarygpt-4.1-mini120 tok$0.00020.9s

inputthe 4 selected tracks · instruction: "calmer"

output"Calming it down, here is some ambient piano."

Tracehandle-dj-requestsession: listen_7f3euser: u_88411.6s

inputvoice clip (2s)

outputreply: "Got it, calming things down." · instruction: "calmer"

Tooltranscribe-request0.4s

inputvoice clip (2s)

output"Play something calmer."

Geninterpret-requestgpt-4.1-mini210 tok$0.00041.2s

input"Play something calmer." · current queue

outputreply: "Got it, calming things down." · instruction: "calmer"

Eventtrigger-plan-next-set@ 1.6s

inputinstruction: "calmer"

Capturing user behavior

In order to learn from users, we will log some interesting behavior on the applicable traces.

Evaluators

In principle, the team could already iterate with only this in place. Tracing and monitoring form a small loop of their own. In the beginning, this is probably enough for the team to quickly improve the DJ feature.

Deploy

Trace

every set and every request

Monitor

skips, dj_replaced, message_type

Build datasets

not yet

Experiment

not yet

Evaluate

not yet

As the setup matures, and the team wants to have more structured testing in place, they can start building datasets and experiments on top of these signals.

Structured testing

With only the live signals, testing a change means shipping it and watching the scores. The team can add two more deliberate ways of testing: experiments on datasets, and A/B tests on live users.

Experiments on datasets

In this use case, testing end to end is very hard offline: whether a session was good only shows in live listening behavior, and taste differs per user, so no expected output holds for everyone. A single step like select-tracks can be tested, with the expectation describing the direction of the set rather than exact tracks:

Datasets

Evaluators

A/B tests on live users

Some changes are hard to grade offline, like the tone of voice of the commentary. Since the risk is low, the team can give a small group of listeners the new version and compare the signal scores between the groups, like the skip rate and the message_type distribution. If the new version does better, they can roll it out to everyone.

With this in place, the full loop is running:

Deploy

Trace

every set and every request

Monitor

skips, dj_replaced, message_type, A/B comparisons

Build datasets

from what we see while monitoring production

Experiment

selection algorithm, DJ prompt, ...

Evaluate

grade step outputs against their expectations

Conclusion

This is an example of a feature that's is low risk and there is no historical data to learn from. The best thing you can do in such cases is getting traces in as soon as possible and start monitoring them to learn from. Everything else, datasets, experiments, A/B tests, can be built on top of that over time.

Check out the other examples or the academy to learn more.

Was this page helpful?

On this page