Shadow Mode — ReplayCI
Shadow mode lets you compare two LLM providers side-by-side without affecting production. Run your primary provider as normal, and ReplayCI silently replays the same requests against a shadow provider, then reports what would break if you switched.
When to use shadow mode
- Evaluating a provider switch — will Anthropic pass the same contracts as OpenAI?
- Testing a model upgrade — does gpt-4o behave differently from gpt-4o-mini?
- Cost optimization — can a cheaper model pass the same contracts?
- Regression detection — does a new model version change tool-calling behavior?
Quick start
# Primary: OpenAI, Shadow: Anthropic
export REPLAYCI_PROVIDER_KEY="sk-..." # Primary provider key
export REPLAYCI_SHADOW_PROVIDER_KEY="sk-ant-..." # Shadow provider key
export REPLAYCI_API_KEY="rci_live_..." # Push results to dashboard
npx replayci --pack packs/my-pack \
--provider openai --model gpt-4o-mini \
--shadow-capture \
--shadow-provider anthropic --shadow-model claude-sonnet-4-6
The CLI runs all contracts against OpenAI (primary), captures each request, replays them against Anthropic (shadow), then compares the results.
Shadow mode makes real API calls to both providers. Each shadow run costs the same as a normal run against that provider. For cost-sensitive environments, run shadow comparisons on a schedule or manually rather than on every PR.
How it works
Shadow mode runs in four stages:
1. Capture
During primary provider execution, each tool-call request is captured — the messages, tool definitions, and tool choice mode. Requests are redacted through SecurityGate before capture.
2. Execute
Each captured request is replayed against the shadow provider. The shadow provider receives the exact same input the primary provider saw.
3. Compare
For each request, ReplayCI checks:
- Comparability — did the context match? If the tool schema or system prompt differed between providers, the pair is marked
INCOMPARABLE(this shouldn't happen in normal usage, but protects against false conclusions). - Contract validation — did both providers pass the same contract assertions?
- Diff — which response fields changed between primary and shadow?
4. Persist
The comparison report is pushed to the dashboard alongside the primary run. You can view results on the Shadow page.
CLI flags
| Flag | Description |
|---|---|
--shadow-capture | Enable shadow mode (required) |
--shadow-provider | Shadow provider name (openai, anthropic) |
--shadow-model | Shadow model ID (e.g., claude-sonnet-4-6) |
All three flags are required together. Shadow mode is ignored when --provider recorded is used.
Environment variables
| Variable | Description |
|---|---|
REPLAYCI_SHADOW_PROVIDER_KEY | API key for the shadow provider |
REPLAYCI_PROVIDER_KEY | Falls back to this if shadow key not set (for same-provider model comparisons) |
For cross-provider shadow (e.g., OpenAI primary, Anthropic shadow), always set REPLAYCI_SHADOW_PROVIDER_KEY separately.
Reading the output
{
"shadow_pipeline": {
"executed": 5,
"compared": 5,
"persisted": true,
"summary": {
"comparable_n": 5,
"incomparable_n": 0,
"confidence_label": "HIGH_CONFIDENCE",
"would_fail": 2,
"differs": 4,
"success_rate_primary": 1.0,
"success_rate_shadow": 0.6,
"success_rate_delta": -0.4
}
}
}
| Field | Meaning |
|---|---|
comparable_n | How many requests could be fairly compared |
incomparable_n | How many had context mismatches (should be 0) |
confidence_label | HIGH_CONFIDENCE, MEDIUM_CONFIDENCE, or LOW_CONFIDENCE based on incomparable rate |
would_fail | How many contracts the shadow provider would have failed |
differs | How many responses differed (even if both passed) |
success_rate_delta | Shadow success rate minus primary. Negative means shadow is worse |
Interpreting the results
would_fail: 0,differs: 0— shadow provider behaves identically. Safe to switch.would_fail: 0,differs: N— shadow passes all contracts but produces different responses. Check if the differences matter for your use case.would_fail: N— shadow would break N contracts. Review which contracts fail before switching. You may need provider-specific contracts.
Same-provider model comparison
Shadow mode also works for comparing models from the same provider:
# Compare gpt-4o-mini vs gpt-4o (same provider, different models)
export REPLAYCI_PROVIDER_KEY="sk-..."
npx replayci --pack packs/my-pack \
--provider openai --model gpt-4o-mini \
--shadow-capture \
--shadow-provider openai --shadow-model gpt-4o
When both providers are the same, REPLAYCI_PROVIDER_KEY is used for both (no need for REPLAYCI_SHADOW_PROVIDER_KEY).
Safety guarantees
Shadow mode follows GR-09: shadow calls never affect production.
- Shadow failures are non-fatal — if the shadow provider errors, the primary run still succeeds
- Shadow results are stored in separate database tables (
shadow_runs,shadow_comparisons) - Shadow execution never blocks the primary provider run
- If persistence fails, the primary run response is still returned
Dashboard
Shadow results appear on the Shadow page in the dashboard.
Verdict system
Each primary–shadow pair is assigned a verdict that tells you what to do:
| Verdict | Meaning | Action |
|---|---|---|
needs_review | Shadow failed contracts the primary passed | Investigate before switching |
needs_better_evidence | Not enough comparable data for a conclusion | Run more shadow comparisons |
changed_but_passing | Responses differ but both pass contracts | Review if differences matter for your use case |
clean | Both providers behave identically | Safe to switch |
What you'll see
- Overview — all candidate pairs with verdict badges, window run counts, and quick filtering by verdict scope
- Decision guidance — per-pair summary explaining what the comparison means
- Confidence label — based on how many comparable runs exist
- Incomparable breakdown — why some runs couldn't be compared (schema mismatches, etc.)
- Step-by-step comparison — per-contract results for primary vs. shadow with field-level diffs
See Dashboard Guide for a full walkthrough of the Shadow page.