Baseline Support in Experiment Compare View
Compare experiment runs side-by-side with baseline designation to systematically identify regressions and improvements
Every prompt tweak, model swap, or config change is an experiment. Teams need to know if the candidate actually improves upon production. Without structured comparison, you’re either building spreadsheets or missing regressions.
What’s New
The experiment compare view now supports baseline designation. Select two experiment runs from the experiments table, click Compare, and set one as baseline. This enables side-by-side analysis of baseline versus candidate performance across all test cases.
Side-by-Side Comparison

- Matched rows: Each row shows baseline and candidate outputs for the same dataset item, using stable identifiers for apples-to-apples comparison
- Visual indicators: Green/red deltas for scores, cost, and latency make it easy to spot item-level changes
- Column headers: Summary stats show aggregate performance differences between baseline and candidate
- Trace access: Click any row to open execution traces and debug behavioral changes
Hunt for Regressions
Use column filters to build your regression worklist. Filter by score thresholds (e.g., Candidate Hallucination > 0.0) or performance deltas (e.g., Cost Delta > 10%). The filtered table becomes your work queue.
For each item:
- Compare outputs: Review baseline vs. candidate behavior to see what changed
- Validate evaluators: Check if the evaluator score matches actual output quality. Broken evaluators create false signals—fix them before trusting results
- Add annotations: Use annotation mode to classify failures with structured scores
Aggregate Metrics
The “Charts” tab shows high-level metric summaries. Compare baseline and candidate across quality scores, cost, and latency distributions. Get first signal whether quality improvements come at acceptable cost in latency or price.
Getting Started
- Run two experiment versions using the same dataset
- Select both runs in the experiments table and click Compare
- Designate the production version as baseline
- Review aggregate metrics in Charts tab, then drill into item-level differences in Outputs tab