November 6, 2025 | Launch Week 4 🚀

Baseline Support in Experiment Compare View

Marlies Mayerhofer

Compare experiment runs side-by-side with baseline designation to systematically identify regressions and improvements

Every prompt tweak, model swap, or config change is an experiment. Teams need to know if the candidate actually improves upon production. Without structured comparison, you’re either building spreadsheets or missing regressions.

What’s New

The experiment compare view now supports baseline designation. Select two experiment runs from the experiments table, click Compare, and set one as baseline. This enables side-by-side analysis of baseline versus candidate performance across all test cases.

Side-by-Side Comparison

Baseline comparison view

Matched rows: Each row shows baseline and candidate outputs for the same dataset item, using stable identifiers for apples-to-apples comparison
Visual indicators: Green/red deltas for scores, cost, and latency make it easy to spot item-level changes
Column headers: Summary stats show aggregate performance differences between baseline and candidate
Trace access: Click any row to open execution traces and debug behavioral changes

Hunt for Regressions

Use column filters to build your regression worklist. Filter by score thresholds (e.g., Candidate Hallucination > 0.0) or performance deltas (e.g., Cost Delta > 10%). The filtered table becomes your work queue.

For each item:

Compare outputs: Review baseline vs. candidate behavior to see what changed
Validate evaluators: Check if the evaluator score matches actual output quality. Broken evaluators create false signals—fix them before trusting results
Add annotations: Use annotation mode to classify failures with structured scores

Aggregate Metrics

The “Charts” tab shows high-level metric summaries. Compare baseline and candidate across quality scores, cost, and latency distributions. Get first signal whether quality improvements come at acceptable cost in latency or price.

Getting Started

Run two experiment versions using the same dataset
Select both runs in the experiments table and click Compare
Designate the production version as baseline
Review aggregate metrics in Charts tab, then drill into item-level differences in Outputs tab

Learn More

Was this page helpful?

Support