Baselines & References — ReplayCI
A reference (baseline) is your known-good standard. ReplayCI compares every new run against it to catch regressions — if model behavior drifts from what you've validated, you'll know immediately.
References build automatically from your runs. Once enough consistent runs prove a configuration is stable, it becomes eligible for promotion to an enforced reference.
The lifecycle
Every reference goes through four states:
CANDIDATE → ACTIVE → STALE → RETIRED
| State | What it means | CI behavior |
|---|---|---|
| CANDIDATE | Collecting runs, building confidence | Evidence-only (advisory) |
| ACTIVE | Promoted, full comparison enabled | Merge-blocking enforcement |
| STALE | Drift detected on an active reference | Advisory only (annotations) |
| RETIRED | Superseded by a newer promoted reference | Evidence-only |
How references are created
References are created automatically when you submit runs. Each unique combination of these six fields produces a distinct baseline key:
| Field | Source |
|---|---|
contract_hash | SHA-256 of your pack's contracts |
provider | e.g., openai, anthropic |
model_id | e.g., gpt-4o-mini, claude-sonnet-4-6 |
runner_version | ReplayCI version |
env_hash | Environment fingerprint |
normalization_profile | Response normalization mode |
Changing any of these produces a new baseline key, which starts a new CANDIDATE.
What this means in practice
If you run the same pack with the same provider and model, all those runs accumulate toward the same reference. If you switch models or update your contracts, a new reference starts from scratch.
Promotion: CANDIDATE to ACTIVE
A CANDIDATE is promoted to ACTIVE when it meets all of these:
- 7 or more successful runs in a rolling 10-day window
- All qualifying runs share the same corpus manifest hash (same contracts)
- Only Succeeded runs count — failed and non-reproducible runs are excluded, not poisonous
Promotion happens automatically after a successful run when the threshold is met. You don't need to run a manual script or click a button.
Example timeline
Day 1: Run 1 (Pass) — CANDIDATE created, 1/7
Day 1: Run 2 (Pass) — 2/7
Day 2: Run 3 (Fail) — still 2/7 (failures don't count)
Day 2: Run 4 (Pass) — 3/7
Day 3: Run 5 (Pass) — 4/7
Day 3: Run 6 (Pass) — 5/7
Day 4: Run 7 (Pass) — 6/7
Day 4: Run 8 (Pass) — 7/7 → AUTO-PROMOTED to ACTIVE
After promotion, all future runs are compared against this reference. Regressions are flagged.
Drift: ACTIVE to STALE
When something material changes — the model version updates, or runtime behavior shifts — an ACTIVE reference moves to STALE.
What triggers STALE:
- Model version drift (provider ships a new model version)
- Runtime drift (e.g., latency characteristics change significantly)
What does NOT trigger STALE:
- Behavioral drift marked as evidence-only (advisory observations)
When a reference is STALE, comparisons against it become advisory — they flag differences but don't block merges. The hard gate still blocks on deterministic signals (like contract failures).
Recovery: building a replacement
When a reference goes STALE, you need a replacement. The process is:
- Keep submitting runs with the same pack and configuration
- A new CANDIDATE starts accumulating successful runs
- When the new CANDIDATE reaches 7 qualifying runs, it auto-promotes to ACTIVE
- The old STALE reference moves to RETIRED
This is the roll-forward pattern — new evidence replaces old evidence. You don't repair a stale reference; you build a new one.
Dashboard
The References page shows your current trust posture:
Trust states
| Posture | Meaning |
|---|---|
| Healthy | Active reference enforced, no drift |
| Partially degraded | Active reference exists but drift detected on others |
| Action needed | No enforced reference, but a candidate is ready to promote |
| Building confidence | Candidates collecting runs toward the 7-run threshold |
What you'll see
- Current Trusted Reference — the active reference with provider, model, qualifying run count, and consistency score
- Collecting Runs — candidates with progress bars showing N/7 runs
- Needs Refresh — stale references that need replacement
- Retired — historical references (collapsed by default)
CI integration
References integrate with the two-lane CI system:
- Lane A (Hard Gate): Uses recorded fixtures, merge-blocking. When an ACTIVE reference exists, deterministic signals are enforced.
- Lane B (Evidence Lane): Uses live providers, advisory. Shows how current behavior compares to the reference.
When no ACTIVE reference exists, all comparisons are evidence-only. Once promoted, the reference becomes the enforcement standard.
Key concepts
Baseline key
A 16-character hex hash computed from six material fields. Two runs with identical configurations produce the same baseline key. Changing the model, provider, or contracts produces a different key.
Corpus pinning
All qualifying runs must share the same corpus manifest hash. This ensures the reference was built from consistent test inputs — you can't mix runs from different contract versions.
Non-reproducible exclusion
Non-reproducible (NR) runs are excluded from promotion counting. They don't poison the count — they're simply ignored. NR results indicate the model's response was not comparable, which is noise, not signal.
FAQ
How many runs do I need before my first reference? 7 successful runs with the same configuration, within a 10-day window.
Do I need to promote manually? No. Promotion is automatic after the 7th qualifying run.
What happens if I change my contracts?
A new baseline key is generated (because contract_hash changes). The old reference stays in its current state, and a new CANDIDATE starts for the updated contracts.
Can I have multiple active references? Yes. Different baseline keys (e.g., different providers or models) can each have their own ACTIVE reference.
How do I check my reference status?
The References page in the dashboard, or query the /api/dashboard/baselines endpoint.