Skip to main content

Shadow Mode — ReplayCI

Shadow mode lets you compare two LLM providers side-by-side without affecting production. Run your primary provider as normal, and ReplayCI silently replays the same requests against a shadow provider, then reports what would break if you switched.


When to use shadow mode

  • Evaluating a provider switch — will Anthropic pass the same contracts as OpenAI?
  • Testing a model upgrade — does gpt-4o behave differently from gpt-4o-mini?
  • Cost optimization — can a cheaper model pass the same contracts?
  • Regression detection — does a new model version change tool-calling behavior?

Quick start

# Primary: OpenAI, Shadow: Anthropic
export REPLAYCI_PROVIDER_KEY="sk-..." # Primary provider key
export REPLAYCI_SHADOW_PROVIDER_KEY="sk-ant-..." # Shadow provider key
export REPLAYCI_API_KEY="rci_live_..." # Push results to dashboard

npx replayci --pack packs/my-pack \
--provider openai --model gpt-4o-mini \
--shadow-capture \
--shadow-provider anthropic --shadow-model claude-sonnet-4-6

The CLI runs all contracts against OpenAI (primary), captures each request, replays them against Anthropic (shadow), then compares the results.

Cost

Shadow mode makes real API calls to both providers. Each shadow run costs the same as a normal run against that provider. For cost-sensitive environments, run shadow comparisons on a schedule or manually rather than on every PR.


How it works

Shadow mode runs in four stages:

1. Capture

During primary provider execution, each tool-call request is captured — the messages, tool definitions, and tool choice mode. Requests are redacted through SecurityGate before capture.

2. Execute

Each captured request is replayed against the shadow provider. The shadow provider receives the exact same input the primary provider saw.

3. Compare

For each request, ReplayCI checks:

  • Comparability — did the context match? If the tool schema or system prompt differed between providers, the pair is marked INCOMPARABLE (this shouldn't happen in normal usage, but protects against false conclusions).
  • Contract validation — did both providers pass the same contract assertions?
  • Diff — which response fields changed between primary and shadow?

4. Persist

The comparison report is pushed to the dashboard alongside the primary run. You can view results on the Shadow page.


CLI flags

FlagDescription
--shadow-captureEnable shadow mode (required)
--shadow-providerShadow provider name (openai, anthropic)
--shadow-modelShadow model ID (e.g., claude-sonnet-4-6)

All three flags are required together. Shadow mode is ignored when --provider recorded is used.

Environment variables

VariableDescription
REPLAYCI_SHADOW_PROVIDER_KEYAPI key for the shadow provider
REPLAYCI_PROVIDER_KEYFalls back to this if shadow key not set (for same-provider model comparisons)

For cross-provider shadow (e.g., OpenAI primary, Anthropic shadow), always set REPLAYCI_SHADOW_PROVIDER_KEY separately.


Reading the output

{
"shadow_pipeline": {
"executed": 5,
"compared": 5,
"persisted": true,
"summary": {
"comparable_n": 5,
"incomparable_n": 0,
"confidence_label": "HIGH_CONFIDENCE",
"would_fail": 2,
"differs": 4,
"success_rate_primary": 1.0,
"success_rate_shadow": 0.6,
"success_rate_delta": -0.4
}
}
}
FieldMeaning
comparable_nHow many requests could be fairly compared
incomparable_nHow many had context mismatches (should be 0)
confidence_labelHIGH_CONFIDENCE, MEDIUM_CONFIDENCE, or LOW_CONFIDENCE based on incomparable rate
would_failHow many contracts the shadow provider would have failed
differsHow many responses differed (even if both passed)
success_rate_deltaShadow success rate minus primary. Negative means shadow is worse

Interpreting the results

  • would_fail: 0, differs: 0 — shadow provider behaves identically. Safe to switch.
  • would_fail: 0, differs: N — shadow passes all contracts but produces different responses. Check if the differences matter for your use case.
  • would_fail: N — shadow would break N contracts. Review which contracts fail before switching. You may need provider-specific contracts.

Same-provider model comparison

Shadow mode also works for comparing models from the same provider:

# Compare gpt-4o-mini vs gpt-4o (same provider, different models)
export REPLAYCI_PROVIDER_KEY="sk-..."

npx replayci --pack packs/my-pack \
--provider openai --model gpt-4o-mini \
--shadow-capture \
--shadow-provider openai --shadow-model gpt-4o

When both providers are the same, REPLAYCI_PROVIDER_KEY is used for both (no need for REPLAYCI_SHADOW_PROVIDER_KEY).


Safety guarantees

Shadow mode follows GR-09: shadow calls never affect production.

  • Shadow failures are non-fatal — if the shadow provider errors, the primary run still succeeds
  • Shadow results are stored in separate database tables (shadow_runs, shadow_comparisons)
  • Shadow execution never blocks the primary provider run
  • If persistence fails, the primary run response is still returned

Dashboard

Shadow results appear on the Shadow page in the dashboard.

Verdict system

Each primary–shadow pair is assigned a verdict that tells you what to do:

VerdictMeaningAction
needs_reviewShadow failed contracts the primary passedInvestigate before switching
needs_better_evidenceNot enough comparable data for a conclusionRun more shadow comparisons
changed_but_passingResponses differ but both pass contractsReview if differences matter for your use case
cleanBoth providers behave identicallySafe to switch

What you'll see

  • Overview — all candidate pairs with verdict badges, window run counts, and quick filtering by verdict scope
  • Decision guidance — per-pair summary explaining what the comparison means
  • Confidence label — based on how many comparable runs exist
  • Incomparable breakdown — why some runs couldn't be compared (schema mismatches, etc.)
  • Step-by-step comparison — per-contract results for primary vs. shadow with field-level diffs

See Dashboard Guide for a full walkthrough of the Shadow page.