Skip to main content

Baselines & References — ReplayCI

A reference (baseline) is your known-good standard. ReplayCI compares every new run against it to catch regressions — if model behavior drifts from what you've validated, you'll know immediately.

References build automatically from your runs. Once enough consistent runs prove a configuration is stable, it becomes eligible for promotion to an enforced reference.


The lifecycle

Every reference goes through four states:

CANDIDATE → ACTIVE → STALE → RETIRED
StateWhat it meansCI behavior
CANDIDATECollecting runs, building confidenceEvidence-only (advisory)
ACTIVEPromoted, full comparison enabledMerge-blocking enforcement
STALEDrift detected on an active referenceAdvisory only (annotations)
RETIREDSuperseded by a newer promoted referenceEvidence-only

How references are created

References are created automatically when you submit runs. Each unique combination of these six fields produces a distinct baseline key:

FieldSource
contract_hashSHA-256 of your pack's contracts
providere.g., openai, anthropic
model_ide.g., gpt-4o-mini, claude-sonnet-4-6
runner_versionReplayCI version
env_hashEnvironment fingerprint
normalization_profileResponse normalization mode

Changing any of these produces a new baseline key, which starts a new CANDIDATE.

What this means in practice

If you run the same pack with the same provider and model, all those runs accumulate toward the same reference. If you switch models or update your contracts, a new reference starts from scratch.


Promotion: CANDIDATE to ACTIVE

A CANDIDATE is promoted to ACTIVE when it meets all of these:

  1. 7 or more successful runs in a rolling 10-day window
  2. All qualifying runs share the same corpus manifest hash (same contracts)
  3. Only Succeeded runs count — failed and non-reproducible runs are excluded, not poisonous

Promotion happens automatically after a successful run when the threshold is met. You don't need to run a manual script or click a button.

Example timeline

Day 1: Run 1 (Pass) — CANDIDATE created, 1/7
Day 1: Run 2 (Pass) — 2/7
Day 2: Run 3 (Fail) — still 2/7 (failures don't count)
Day 2: Run 4 (Pass) — 3/7
Day 3: Run 5 (Pass) — 4/7
Day 3: Run 6 (Pass) — 5/7
Day 4: Run 7 (Pass) — 6/7
Day 4: Run 8 (Pass) — 7/7 → AUTO-PROMOTED to ACTIVE

After promotion, all future runs are compared against this reference. Regressions are flagged.


Drift: ACTIVE to STALE

When something material changes — the model version updates, or runtime behavior shifts — an ACTIVE reference moves to STALE.

What triggers STALE:

  • Model version drift (provider ships a new model version)
  • Runtime drift (e.g., latency characteristics change significantly)

What does NOT trigger STALE:

  • Behavioral drift marked as evidence-only (advisory observations)

When a reference is STALE, comparisons against it become advisory — they flag differences but don't block merges. The hard gate still blocks on deterministic signals (like contract failures).


Recovery: building a replacement

When a reference goes STALE, you need a replacement. The process is:

  1. Keep submitting runs with the same pack and configuration
  2. A new CANDIDATE starts accumulating successful runs
  3. When the new CANDIDATE reaches 7 qualifying runs, it auto-promotes to ACTIVE
  4. The old STALE reference moves to RETIRED

This is the roll-forward pattern — new evidence replaces old evidence. You don't repair a stale reference; you build a new one.


Dashboard

The References page shows your current trust posture:

Trust states

PostureMeaning
HealthyActive reference enforced, no drift
Partially degradedActive reference exists but drift detected on others
Action neededNo enforced reference, but a candidate is ready to promote
Building confidenceCandidates collecting runs toward the 7-run threshold

What you'll see

  • Current Trusted Reference — the active reference with provider, model, qualifying run count, and consistency score
  • Collecting Runs — candidates with progress bars showing N/7 runs
  • Needs Refresh — stale references that need replacement
  • Retired — historical references (collapsed by default)

CI integration

References integrate with the two-lane CI system:

  • Lane A (Hard Gate): Uses recorded fixtures, merge-blocking. When an ACTIVE reference exists, deterministic signals are enforced.
  • Lane B (Evidence Lane): Uses live providers, advisory. Shows how current behavior compares to the reference.

When no ACTIVE reference exists, all comparisons are evidence-only. Once promoted, the reference becomes the enforcement standard.


Key concepts

Baseline key

A 16-character hex hash computed from six material fields. Two runs with identical configurations produce the same baseline key. Changing the model, provider, or contracts produces a different key.

Corpus pinning

All qualifying runs must share the same corpus manifest hash. This ensures the reference was built from consistent test inputs — you can't mix runs from different contract versions.

Non-reproducible exclusion

Non-reproducible (NR) runs are excluded from promotion counting. They don't poison the count — they're simply ignored. NR results indicate the model's response was not comparable, which is noise, not signal.


FAQ

How many runs do I need before my first reference? 7 successful runs with the same configuration, within a 10-day window.

Do I need to promote manually? No. Promotion is automatic after the 7th qualifying run.

What happens if I change my contracts? A new baseline key is generated (because contract_hash changes). The old reference stays in its current state, and a new CANDIDATE starts for the updated contracts.

Can I have multiple active references? Yes. Different baseline keys (e.g., different providers or models) can each have their own ACTIVE reference.

How do I check my reference status? The References page in the dashboard, or query the /api/dashboard/baselines endpoint.