Baselines & References — ReplayCI

A reference (baseline) is your known-good standard. ReplayCI compares every new run against it to catch regressions — if model behavior drifts from what you've validated, you'll know immediately.

References build automatically from your runs. Once enough consistent runs prove a configuration is stable, it becomes eligible for promotion to an enforced reference.

The lifecycle

Every reference goes through four states:

CANDIDATE → ACTIVE → STALE → RETIRED

State	What it means	CI behavior
CANDIDATE	Collecting runs, building confidence	Evidence-only (advisory)
ACTIVE	Promoted, full comparison enabled	Merge-blocking enforcement
STALE	Drift detected on an active reference	Advisory only (annotations)
RETIRED	Superseded by a newer promoted reference	Evidence-only

How references are created

References are created automatically when you submit runs. Each unique combination of these six fields produces a distinct baseline key:

Field	Source
`contract_hash`	SHA-256 of your pack's contracts
`provider`	e.g., `openai`, `anthropic`
`model_id`	e.g., `gpt-4o-mini`, `claude-sonnet-4-6`
`runner_version`	ReplayCI version
`env_hash`	Environment fingerprint
`normalization_profile`	Response normalization mode

Changing any of these produces a new baseline key, which starts a new CANDIDATE.

What this means in practice

If you run the same pack with the same provider and model, all those runs accumulate toward the same reference. If you switch models or update your contracts, a new reference starts from scratch.

Promotion: CANDIDATE to ACTIVE

A CANDIDATE is promoted to ACTIVE when it meets all of these:

7 or more successful runs in a rolling 10-day window
All qualifying runs share the same corpus manifest hash (same contracts)
Only Succeeded runs count — failed and non-reproducible runs are excluded, not poisonous

Promotion happens automatically after a successful run when the threshold is met. You don't need to run a manual script or click a button.

Example timeline

Day 1: Run 1 (Pass) — CANDIDATE created, 1/7
Day 1: Run 2 (Pass) — 2/7
Day 2: Run 3 (Fail) — still 2/7 (failures don't count)
Day 2: Run 4 (Pass) — 3/7
Day 3: Run 5 (Pass) — 4/7
Day 3: Run 6 (Pass) — 5/7
Day 4: Run 7 (Pass) — 6/7
Day 4: Run 8 (Pass) — 7/7 → AUTO-PROMOTED to ACTIVE

After promotion, all future runs are compared against this reference. Regressions are flagged.

Drift: ACTIVE to STALE

When something material changes — the model version updates, or runtime behavior shifts — an ACTIVE reference moves to STALE.

What triggers STALE:

Model version drift (provider ships a new model version)
Runtime drift (e.g., latency characteristics change significantly)

What does NOT trigger STALE:

Behavioral drift marked as evidence-only (advisory observations)

When a reference is STALE, comparisons against it become advisory — they flag differences but don't block merges. The hard gate still blocks on deterministic signals (like contract failures).

Recovery: building a replacement

When a reference goes STALE, you need a replacement. The process is:

Keep submitting runs with the same pack and configuration
A new CANDIDATE starts accumulating successful runs
When the new CANDIDATE reaches 7 qualifying runs, it auto-promotes to ACTIVE
The old STALE reference moves to RETIRED

This is the roll-forward pattern — new evidence replaces old evidence. You don't repair a stale reference; you build a new one.

Dashboard

The References page shows your current trust posture:

Trust states

Posture	Meaning
Healthy	Active reference enforced, no drift
Partially degraded	Active reference exists but drift detected on others
Action needed	No enforced reference, but a candidate is ready to promote
Building confidence	Candidates collecting runs toward the 7-run threshold

What you'll see

Current Trusted Reference — the active reference with provider, model, qualifying run count, and consistency score
Collecting Runs — candidates with progress bars showing N/7 runs
Needs Refresh — stale references that need replacement
Retired — historical references (collapsed by default)

CI integration

References integrate with the two-lane CI system:

Lane A (Hard Gate): Uses recorded fixtures, merge-blocking. When an ACTIVE reference exists, deterministic signals are enforced.
Lane B (Evidence Lane): Uses live providers, advisory. Shows how current behavior compares to the reference.

When no ACTIVE reference exists, all comparisons are evidence-only. Once promoted, the reference becomes the enforcement standard.

Key concepts

Baseline key

A 16-character hex hash computed from six material fields. Two runs with identical configurations produce the same baseline key. Changing the model, provider, or contracts produces a different key.

Corpus pinning

All qualifying runs must share the same corpus manifest hash. This ensures the reference was built from consistent test inputs — you can't mix runs from different contract versions.

Non-reproducible exclusion

Non-reproducible (NR) runs are excluded from promotion counting. They don't poison the count — they're simply ignored. NR results indicate the model's response was not comparable, which is noise, not signal.

FAQ

How many runs do I need before my first reference? 7 successful runs with the same configuration, within a 10-day window.

Do I need to promote manually? No. Promotion is automatic after the 7th qualifying run.

What happens if I change my contracts? A new baseline key is generated (because contract_hash changes). The old reference stays in its current state, and a new CANDIDATE starts for the updated contracts.

Can I have multiple active references? Yes. Different baseline keys (e.g., different providers or models) can each have their own ACTIVE reference.

How do I check my reference status? The References page in the dashboard, or query the /api/dashboard/baselines endpoint.

The lifecycle​

How references are created​

What this means in practice​

Promotion: CANDIDATE to ACTIVE​

Example timeline​

Drift: ACTIVE to STALE​

Recovery: building a replacement​

Dashboard​

Trust states​

What you'll see​

CI integration​

Key concepts​

Baseline key​

Corpus pinning​

Non-reproducible exclusion​

FAQ​