Troubleshooting — ReplayCI
Solutions to common issues you'll encounter when writing contracts, running tests, and setting up CI.
My test shows "NonReproducible" — what does that mean?
A NonReproducible (NR) result means ReplayCI can't meaningfully evaluate the test. It's not a pass or fail — it's an "I can't tell" result.
Common causes
Recording not found
NonReproducible recording_not_found
You're running with --provider recorded but there's no recording file for this fixture. Capture one first:
npx replayci --provider openai --model gpt-4o-mini --capture-recordings
Tool definitions changed (SCHEMA_DRIFT)
NonReproducible context_variance
Your tool definitions have changed since the recording was captured. The tool_schema_hash in the recording no longer matches. Re-capture:
npx replayci --provider openai --model gpt-4o-mini --capture-recordings
Messages changed (NON_DETERMINISTIC_INPUT)
Same as above but for the messages array. If you changed your system prompt or user message, the messages_hash won't match. Re-capture your recordings.
Missing API key (MISSING_PREREQUISITES)
NonReproducible context_variance
You're running against a live provider but REPLAYCI_PROVIDER_KEY isn't set:
export REPLAYCI_PROVIDER_KEY="sk-..."
Model non-determinism
NonReproducible model_nondeterminism
The model returned different outputs on identical inputs across multiple runs. This is normal for LLMs — set temperature: 0 in your fixture to minimize variance, but some models still exhibit non-determinism.
NR in CI
NonReproducible results in the recorded gate (Lane A) are treated as failures — your CI will block. This is intentional: if a recorded test can't be evaluated, something is wrong with your fixtures.
To allowlist a known NR case, add an entry to nr-allowlist.json in your pack:
[
{
"contract_path": "contracts/flaky_test.yaml",
"reason_code": "model_nondeterminism",
"owner": "your-name",
"expiry_date": "2026-06-01",
"rationale": "Known model variance on this prompt, tracking in JIRA-123"
}
]
Allowlist entries require all five fields and must have a future expiry date. Expired entries cause CI failures.
My contract passes with OpenAI but fails with Anthropic
This is usually caused by differences in how providers handle tool calls. Here are the most common issues:
The model doesn't call the tool
Different models have different thresholds for when to use tools vs. respond with text. If your prompt is ambiguous, one model might call the tool while another responds with text.
Fix: Make your prompt more explicit about using tools:
{
"role": "user",
"content": "Use the get_weather tool to check the weather in San Francisco."
}
Tool choice "none" doesn't work with Anthropic
Anthropic doesn't support tool_choice: "none". If your fixture uses this setting, the Anthropic adapter omits it entirely, which may produce different behavior.
Fix: Avoid tool_choice: "none" in cross-provider contracts. If you need a text-only response test, use provider_modes: ["recorded"] to restrict it to recorded playback.
Argument format differences
Both providers normalize to arguments: string (JSON), but the exact formatting may differ. OpenAI might return {"location": "SF"} while Anthropic returns {"location":"SF"} (no spaces).
Fix: Use contains or regex instead of equals for argument assertions where exact formatting doesn't matter:
output_invariants:
- path: "$.tool_calls[0].arguments"
contains: "San Francisco"
Or use nested paths to check specific argument fields:
output_invariants:
- path: "$.tool_calls[0].arguments.location"
contains: "San Francisco"
ReplayCI auto-parses JSON string arguments during path traversal, so $.tool_calls[0].arguments.location works even though arguments is a JSON string in the normalized response. Use exists: true (not type: "object") when asserting on $.tool_calls[0].arguments directly.
How do I debug a schema_violation?
A schema_violation means the model called the right tool but passed arguments that don't match the expected schema.
Find what went wrong
Run with --json and inspect the step details:
npx replayci --json | jq '.provider_run.steps[] | select(.status == "Fail")'
Look at the invariant_failures array for the specific assertion that failed:
{
"path": "$.tool_calls[0].arguments.replicas",
"rule": "type",
"detail": "expected number, got string"
}
Common causes
-
String instead of number — the model returns
"5"instead of5. Add atypeassertion to catch this:- path: "$.tool_calls[0].arguments.replicas"
type: "number" -
Missing required field — the model omits a required argument. Use
exists:- path: "$.tool_calls[0].arguments.environment"
exists: true -
Extra fields — the model includes unexpected arguments. This usually isn't a problem unless your downstream code rejects them.
What does "path_not_found" mean?
path_not_found: path does not exist in response
An assertion references a JSON path that doesn't exist in the model's response. This is different from a value mismatch — the path itself is missing.
Common causes
- Provider changed response shape — a model update altered the response structure
- Wrong path expression — typo in the JSON path (e.g.,
$.tool_callinstead of$.tool_calls) - Model didn't call a tool — you're asserting on
$.tool_calls[0].namebuttool_callsis empty
Fix
Check that your JSON path matches the actual response structure. Run with --json to see the raw response:
npx replayci --json | jq '.provider_run.steps[0]'
If the path is correct but the response shape changed, you may need to re-capture recordings or update your assertions.
What are fingerprints and why do they matter?
Every test result gets a fingerprint — an 8-character hash that uniquely identifies the outcome. Same input + same output = same fingerprint.
Why fingerprints matter
- Regression detection — if a fingerprint changes, the model's behavior changed
- Deduplication — identical failures produce identical fingerprints, so you can count unique issues
- Baseline comparison — ReplayCI tracks fingerprints over time to detect drift
When fingerprints change
A fingerprint changes when:
- The model's response changes (different tool call, different arguments, different text)
- Your tool definitions change (affects the boundary hash)
- Your messages change
A fingerprint does not change when:
- The model version is updated but produces identical output
- Latency varies
- Token usage varies
Response shape hash
In addition to the full fingerprint, each result includes a response shape hash — a structural fingerprint that only captures keys and types, not values. This lets you detect when a provider changes its response structure (e.g., adding a new field) independently from value changes.
My CI gate fails with "unknown_rate_threshold"
unknown_rate_threshold: unknown classification rate exceeds 20%
This means more than 20% of your failing tests have an "unknown" failure classification. The classifier couldn't determine why they failed.
What to do
-
Run locally to see the failures:
npx replayci --provider recorded --json | jq '.provider_run.steps[] | select(.status == "Fail")' -
Look at the failure details. "Unknown" usually means the error doesn't match any known pattern (auth, rate limit, timeout, schema violation, etc.)
-
Common fixes:
- If it's a new failure type, add a negative test case to capture it
- If it's a transient issue, add an
allowed_errorsentry to your contract - If the model is genuinely misbehaving, update your contract assertions
My recordings keep going stale
Recordings go stale when the boundary hashes don't match. This happens when you change:
- Tool definitions (parameters, descriptions, names)
- Messages (system prompt, user message content)
- Tool choice setting
Prevention
- Keep tool definitions stable. When you need to change them, update recordings in a dedicated commit.
- Use a separate CI step to re-capture recordings on a schedule:
# Weekly recording refresh
replayci-refresh:
schedule:
- cron: '0 6 * * 1' # Monday 6 AM
steps:
- run: |
npx replayci --provider openai --model gpt-4o-mini --capture-recordings
git add packs/*/recordings/
git diff --cached --quiet || git commit -m "Refresh recorded fixtures"
Quick re-capture
npx replayci --provider openai --model gpt-4o-mini --capture-recordings
npx replayci --provider recorded # verify they work
git add packs/*/recordings/
git commit -m "Update recorded fixtures"
How do I test that a model does NOT call a tool?
Use the text_response pattern — a contract where the model should respond with text instead of calling a tool:
tool: "text_response_check"
assertions:
output_invariants:
- path: "$.tool_calls"
length_lte: 0
- path: "$.content"
exists: true
The tool field here is a descriptive label (not a literal tool name to match) because the contract uses length_lte: 0 to assert that no tools were called.
For the fixture, create a request where the model should naturally respond with text:
{
"request": {
"messages": [
{ "role": "system", "content": "You are a deployment assistant." },
{ "role": "user", "content": "Explain what a blue-green deployment is." }
],
"tools": [
{ "name": "deploy_service", "description": "Deploy a service", "..." : "..." }
]
},
"response": {
"success": true,
"tool_calls": [],
"content": "A blue-green deployment is a release strategy..."
}
}
How do I validate multi-tool responses?
When a prompt triggers multiple tool calls, use expect_tools instead of a single tool name:
tool: "deployment_pipeline"
expect_tools:
- "deploy_service"
- "check_health"
By default, tools can appear in any order. To enforce order:
expect_tools:
- "deploy_service"
- "check_health"
tool_order: "strict"
For partial success (e.g., 2 of 3 tools is acceptable):
expect_tools:
- "deploy_service"
- "check_health"
- "notify_slack"
pass_threshold: 0.66
See Writing Tests — Multi-tool contracts for full details.
Failure classification reference
When a test fails, ReplayCI classifies the failure:
| Classification | What happened | Typical cause |
|---|---|---|
tool_not_invoked | Model returned text instead of calling a tool | Ambiguous prompt, model chose not to use tools |
malformed_arguments | Tool arguments aren't valid JSON | Model generated broken JSON |
schema_violation | Arguments don't match the expected schema | Wrong types, missing fields |
wrong_tool | Model called a different tool than expected | Ambiguous tool descriptions, model confusion |
path_not_found | An assertion path doesn't exist in the response | Response shape changed, typo in path |
unexpected_error | Provider returned an error | Auth failure, rate limit, timeout |
Each failure gets a fingerprint. If the same failure recurs across runs, it produces the same fingerprint — useful for tracking whether an issue is new or recurring.
SDK troubleshooting
observe() doesn't capture anything
No API key: observe() requires REPLAYCI_API_KEY or an explicit apiKey option. Without one, it returns a no-op handle silently.
export REPLAYCI_API_KEY="rci_live_..."
Disabled by env: Check that REPLAYCI_DISABLE is not set to true.
Unsupported client: The SDK auto-detects OpenAI and Anthropic clients. If detection fails, pass a diagnostics callback to see why:
observe(openai, {
apiKey: "...",
diagnostics: (event) => console.log(event),
});
Double wrap: Calling observe() twice on the same client is a no-op. The diagnostics callback will emit { type: "double_wrap" }.
Circuit breaker triggered
If the capture API fails 5 times in a row, the SDK auto-disables for 10 minutes. Check:
- Is
REPLAYCI_API_KEYvalid? (Trycurl -H "Authorization: Bearer rci_live_..." https://app.replayci.com/api/v1/captures) - Is the endpoint reachable? (Check
REPLAYCI_API_URLif set) - Is your email verified? (Unverified accounts get 403 on
/api/v1/)
validate() returns unexpected failures
Provider format mismatch: validate() auto-detects OpenAI and Anthropic response formats. If you're passing a pre-processed response, make sure it has a tool_calls array:
validate({
tool_calls: [{ id: "1", name: "get_weather", arguments: '{"location":"SF"}' }],
}, { contracts });
Unmatched tools: By default, tool calls with no matching contract cause a failure. Use unmatchedPolicy: "allow" if your response includes tools not covered by your contracts:
validate(response, { contracts, unmatchedPolicy: "allow" });
Captures show in dashboard but no contracts generated
Contracts need at least one tool call with arguments to generate assertions. If captures only have metadata-level data (tool names only), increase the capture level:
observe(openai, {
captureLevel: "redacted", // default — includes arguments
});
Model resolution errors
"Model not found in registry; using defaults"
⚠ Model "my-custom-model" not found in registry; using openai defaults
This is a warning, not an error. Your model ID didn't match any registered family pattern, so ReplayCI uses the provider's default request profile. The run proceeds normally.
This happens when:
- You're using a new model not yet in the registry (e.g., a fine-tuned model or a newly released variant)
- There's a typo in the model name
Fix: Check npx replayci models --provider openai to see registered families. Any model matching a family prefix works automatically. If you want unregistered models to be a hard error, use --strict-model.
"No matching family" with --strict-model
Error: Model "my-custom-model" has no matching family for provider openai
Known families: GPT-5, GPT-4, O-series
You used --strict-model and the model doesn't match any family pattern. This is intentional — strict mode prevents accidental use of unregistered models.
Fix:
- Check for typos:
npx replayci --provider openai --model 4o-mini --resolveto verify resolution - If this is a new model, remove
--strict-modeluntil the registry is updated - Use
npx replayci models --checkto see if the registry needs updating
"Access denied" with --probe
npx replayci models --provider openai --probe
✗ gpt-5.2 denied
✓ gpt-4o-mini accessible
The --probe command checks which registered models your API key can access. "Denied" means the provider's model listing API didn't include that model — your account may not have access.
Fix:
- Check your API plan or billing tier with your provider
- Some models require enrollment or waitlist access
- Verify your API key is correct:
echo $REPLAYCI_PROVIDER_KEY
Registry load failure
⚠ Failed to load model registry; model resolution disabled
The registry file (artifacts/model-registry/registry.json) couldn't be loaded. Resolution degrades to raw passthrough — model IDs are sent directly to the provider without alias expansion or family matching.
Fix:
- This is non-fatal. Runs still work; you just lose alias and profile features.
- If you installed via npm, ensure
artifacts/is included in the package. Runnpm ls @replayci/clito check the installed version.
Still stuck?
- Check the CLI Reference for all available flags and options
- Review your contract YAML against the Writing Tests format
- Run with
--jsonto see full structured output for debugging - Check the Dashboard Guide for understanding dashboard features
- File an issue at github.com/sparshbaluni07/replayci/issues