AcademyMusic DJ

Music streaming DJ

This is an example to illustrate the concepts of the Langfuse Academy.

Context

A music streaming app adds a DJ feature in beta. The DJ keeps a queue going based on what the user has played before and what the algorithm thinks the user will like. Every few songs it says a short word about what's coming up and why it chose it. Listeners can also optionally interact with the DJ to steer it, by clicking a microphone button and talking to it.

9:41
Good evening
Alex
Your DJ
Mixing songs for
your evening

It picks the tracks, talks you through them, and listens when you talk back.

Start DJ session
Recently played
Night Shapes
Vela
Paper Cities
Hollow Coast
Slow Tide
Marlowe
Made for you
See all
Daily mix
Late night
Daily mix
Focus drift
Home
Search
Library
9:41
Now playing
DJ Β· why this song

A deep cut after that album you ran on repeat last week β€” easing you in before I pick the pace up.

Slow Tide
Marlowe
0:483:36
9:41
Talk to DJ
Listening…
Play something with a bit more energy
Try saying
Who is this? Skip this one More like this
Tap the orb to stop

Two characteristics drive how to approach the AI engineering setup:

  1. The risk of a bad output is low. The worst case is a skipped song or an awkward DJ comment, and the feature is labeled beta.
  2. This is a new feature, so there is no historical data to start from, and the team will need to learn from live listener behavior.

Because of this, it makes sense to go live as soon as possible and iterate based on live feedback from the beginning.

Tracing the DJ

There are two different kinds of traces that will together form a session:

Trace nameDetails
plan-next-setPlans the next set of tracks and writes the commentary, triggered by the DJ itself every few songs or by a DJ request.
input: listening context and any instructions
output: next tracks queued, plus the commentary line
handle-dj-requestHandles a voice request from the listener, triggered by the microphone button.
input: voice clip and the current queue
output: a short reply, plus instructions for the plan-next-set run it triggers

A listening session and both trace types up close:

Session listen_7f3e Β· user u_8841
Traceplan-next-set"Kicking off with two favorites from your week."2.8s
Traceplan-next-set"Staying in this lane with some mellow electronica."2.4s
Tracehandle-dj-request"Play something calmer."1.6s
Traceplan-next-set"Calming it down, here is some ambient piano."triggered by dj-request1.5s
Traceplan-next-setsession: listen_7f3euser: u_88411.5s
inputtaste: electronica, downtempo Β· recent: Tycho "Awake", Bonobo "Kerala" Β· instruction: "calmer"
output4 tracks queued Β· commentary: "Calming it down, here is some ambient piano."
Toolselect-tracks0.6s
inputtaste: electronica, downtempo Β· recent: Tycho "Awake", Bonobo "Kerala" Β· instruction: "calmer"
outputNils Frahm "Says" Β· Γ“lafur Arnalds "Saman" Β· Max Richter "On the Nature of Daylight" Β· Joep Beving "Sleeping Lotus"
Genwrite-commentarygpt-4.1-mini120 tok$0.00020.9s
inputthe 4 selected tracks Β· instruction: "calmer"
output"Calming it down, here is some ambient piano."
Tracehandle-dj-requestsession: listen_7f3euser: u_88411.6s
inputvoice clip (2s)
outputreply: "Got it, calming things down." Β· instruction: "calmer"
Tooltranscribe-request0.4s
inputvoice clip (2s)
output"Play something calmer."
Geninterpret-requestgpt-4.1-mini210 tok$0.00041.2s
input"Play something calmer." Β· current queue
outputreply: "Got it, calming things down." Β· instruction: "calmer"
Eventtrigger-plan-next-set@ 1.6s
inputinstruction: "calmer"

Capturing user behavior

In order to learn from users, we will log some interesting behavior on the applicable traces.

Evaluators

In principle, the team could already iterate with only this in place. Tracing and monitoring form a small loop of their own. In the beginning, this is probably enough for the team to quickly improve the DJ feature.

Deploy
Trace
every set and every request
Monitor
skips, dj_replaced, message_type
Build datasets
not yet
Experiment
not yet
Evaluate
not yet

As the setup matures, and the team wants to have more structured testing in place, they can start building datasets and experiments on top of these signals.

Structured testing

With only the live signals, testing a change means shipping it and watching the scores. The team can add two more deliberate ways of testing: experiments on datasets, and A/B tests on live users.

Experiments on datasets

In this use case, testing end to end is very hard offline: whether a session was good only shows in live listening behavior, and taste differs per user, so no expected output holds for everyone. A single step like select-tracks can be tested, with the expectation describing the direction of the set rather than exact tracks:

Datasets
Evaluators

A/B tests on live users

Some changes are hard to grade offline, like the tone of voice of the commentary. Since the risk is low, the team can give a small group of listeners the new version and compare the signal scores between the groups, like the skip rate and the message_type distribution. If the new version does better, they can roll it out to everyone.

With this in place, the full loop is running:

Deploy
Trace
every set and every request
Monitor
skips, dj_replaced, message_type, A/B comparisons
Build datasets
from what we see while monitoring production
Experiment
selection algorithm, DJ prompt, ...
Evaluate
grade step outputs against their expectations

Conclusion

This is an example of a feature that's is low risk and there is no historical data to learn from. The best thing you can do in such cases is getting traces in as soon as possible and start monitoring them to learn from. Everything else, datasets, experiments, A/B tests, can be built on top of that over time.

Check out the other examples or the academy to learn more.


Was this page helpful?

Last edited