Shadow Mode — ReplayCI

Shadow mode lets you compare two LLM providers side-by-side without affecting production. Run your primary provider as normal, and ReplayCI silently replays the same requests against a shadow provider, then reports what would break if you switched.

When to use shadow mode

Evaluating a provider switch — will Anthropic pass the same contracts as OpenAI?
Testing a model upgrade — does gpt-4o behave differently from gpt-4o-mini?
Cost optimization — can a cheaper model pass the same contracts?
Regression detection — does a new model version change tool-calling behavior?

Quick start

# Primary: OpenAI, Shadow: Anthropic
export REPLAYCI_PROVIDER_KEY="sk-..."           # Primary provider key
export REPLAYCI_SHADOW_PROVIDER_KEY="sk-ant-..."  # Shadow provider key
export REPLAYCI_API_KEY="rci_live_..."           # Push results to dashboard

npx replayci --pack packs/my-pack \
  --provider openai --model gpt-4o-mini \
  --shadow-capture \
  --shadow-provider anthropic --shadow-model claude-sonnet-4-6

The CLI runs all contracts against OpenAI (primary), captures each request, replays them against Anthropic (shadow), then compares the results.

Cost

Shadow mode makes real API calls to both providers. Each shadow run costs the same as a normal run against that provider. For cost-sensitive environments, run shadow comparisons on a schedule or manually rather than on every PR.

How it works

Shadow mode runs in four stages:

1. Capture

During primary provider execution, each tool-call request is captured — the messages, tool definitions, and tool choice mode. Requests are redacted through SecurityGate before capture.

2. Execute

Each captured request is replayed against the shadow provider. The shadow provider receives the exact same input the primary provider saw.

3. Compare

For each request, ReplayCI checks:

Comparability — did the context match? If the tool schema or system prompt differed between providers, the pair is marked INCOMPARABLE (this shouldn't happen in normal usage, but protects against false conclusions).
Contract validation — did both providers pass the same contract assertions?
Diff — which response fields changed between primary and shadow?

4. Persist

The comparison report is pushed to the dashboard alongside the primary run. You can view results on the Shadow page.

CLI flags

Flag	Description
`--shadow-capture`	Enable shadow mode (required)
`--shadow-provider`	Shadow provider name (`openai`, `anthropic`)
`--shadow-model`	Shadow model ID (e.g., `claude-sonnet-4-6`)

All three flags are required together. Shadow mode is ignored when --provider recorded is used.

Environment variables

Variable	Description
`REPLAYCI_SHADOW_PROVIDER_KEY`	API key for the shadow provider
`REPLAYCI_PROVIDER_KEY`	Falls back to this if shadow key not set (for same-provider model comparisons)

For cross-provider shadow (e.g., OpenAI primary, Anthropic shadow), always set REPLAYCI_SHADOW_PROVIDER_KEY separately.

Reading the output

{
  "shadow_pipeline": {
    "executed": 5,
    "compared": 5,
    "persisted": true,
    "summary": {
      "comparable_n": 5,
      "incomparable_n": 0,
      "confidence_label": "HIGH_CONFIDENCE",
      "would_fail": 2,
      "differs": 4,
      "success_rate_primary": 1.0,
      "success_rate_shadow": 0.6,
      "success_rate_delta": -0.4
    }
  }
}

Field	Meaning
`comparable_n`	How many requests could be fairly compared
`incomparable_n`	How many had context mismatches (should be 0)
`confidence_label`	`HIGH_CONFIDENCE`, `MEDIUM_CONFIDENCE`, or `LOW_CONFIDENCE` based on incomparable rate
`would_fail`	How many contracts the shadow provider would have failed
`differs`	How many responses differed (even if both passed)
`success_rate_delta`	Shadow success rate minus primary. Negative means shadow is worse

Interpreting the results

would_fail: 0, differs: 0 — shadow provider behaves identically. Safe to switch.
would_fail: 0, differs: N — shadow passes all contracts but produces different responses. Check if the differences matter for your use case.
would_fail: N — shadow would break N contracts. Review which contracts fail before switching. You may need provider-specific contracts.

Same-provider model comparison

Shadow mode also works for comparing models from the same provider:

# Compare gpt-4o-mini vs gpt-4o (same provider, different models)
export REPLAYCI_PROVIDER_KEY="sk-..."

npx replayci --pack packs/my-pack \
  --provider openai --model gpt-4o-mini \
  --shadow-capture \
  --shadow-provider openai --shadow-model gpt-4o

When both providers are the same, REPLAYCI_PROVIDER_KEY is used for both (no need for REPLAYCI_SHADOW_PROVIDER_KEY).

Safety guarantees

Shadow mode follows GR-09: shadow calls never affect production.

Shadow failures are non-fatal — if the shadow provider errors, the primary run still succeeds
Shadow results are stored in separate database tables (shadow_runs, shadow_comparisons)
Shadow execution never blocks the primary provider run
If persistence fails, the primary run response is still returned

Dashboard

Shadow results appear on the Shadow page in the dashboard.

Verdict system

Each primary–shadow pair is assigned a verdict that tells you what to do:

Verdict	Meaning	Action
`needs_review`	Shadow failed contracts the primary passed	Investigate before switching
`needs_better_evidence`	Not enough comparable data for a conclusion	Run more shadow comparisons
`changed_but_passing`	Responses differ but both pass contracts	Review if differences matter for your use case
`clean`	Both providers behave identically	Safe to switch

What you'll see

Overview — all candidate pairs with verdict badges, window run counts, and quick filtering by verdict scope
Decision guidance — per-pair summary explaining what the comparison means
Confidence label — based on how many comparable runs exist
Incomparable breakdown — why some runs couldn't be compared (schema mismatches, etc.)
Step-by-step comparison — per-contract results for primary vs. shadow with field-level diffs

See Dashboard Guide for a full walkthrough of the Shadow page.

When to use shadow mode​

Quick start​

How it works​

1. Capture​

2. Execute​

3. Compare​

4. Persist​

CLI flags​

Environment variables​

Reading the output​

Interpreting the results​

Same-provider model comparison​

Safety guarantees​

Dashboard​

Verdict system​

What you'll see​