Troubleshooting — ReplayCI

Solutions to common issues you'll encounter when writing contracts, running tests, and setting up CI.

My test shows "NonReproducible" — what does that mean?

A NonReproducible (NR) result means ReplayCI can't meaningfully evaluate the test. It's not a pass or fail — it's an "I can't tell" result.

Common causes

Recording not found

NonReproducible  recording_not_found

You're running with --provider recorded but there's no recording file for this fixture. Capture one first:

npx replayci --provider openai --model gpt-4o-mini --capture-recordings

Tool definitions changed (SCHEMA_DRIFT)

NonReproducible  context_variance

Your tool definitions have changed since the recording was captured. The tool_schema_hash in the recording no longer matches. Re-capture:

npx replayci --provider openai --model gpt-4o-mini --capture-recordings

Messages changed (NON_DETERMINISTIC_INPUT)

Same as above but for the messages array. If you changed your system prompt or user message, the messages_hash won't match. Re-capture your recordings.

Missing API key (MISSING_PREREQUISITES)

NonReproducible  context_variance

You're running against a live provider but REPLAYCI_PROVIDER_KEY isn't set:

export REPLAYCI_PROVIDER_KEY="sk-..."

Model non-determinism

NonReproducible  model_nondeterminism

The model returned different outputs on identical inputs across multiple runs. This is normal for LLMs — set temperature: 0 in your fixture to minimize variance, but some models still exhibit non-determinism.

NR in CI

NonReproducible results in the recorded gate (Lane A) are treated as failures — your CI will block. This is intentional: if a recorded test can't be evaluated, something is wrong with your fixtures.

To allowlist a known NR case, add an entry to nr-allowlist.json in your pack:

[
  {
    "contract_path": "contracts/flaky_test.yaml",
    "reason_code": "model_nondeterminism",
    "owner": "your-name",
    "expiry_date": "2026-06-01",
    "rationale": "Known model variance on this prompt, tracking in JIRA-123"
  }
]

Allowlist entries require all five fields and must have a future expiry date. Expired entries cause CI failures.

My contract passes with OpenAI but fails with Anthropic

This is usually caused by differences in how providers handle tool calls. Here are the most common issues:

The model doesn't call the tool

Different models have different thresholds for when to use tools vs. respond with text. If your prompt is ambiguous, one model might call the tool while another responds with text.

Fix: Make your prompt more explicit about using tools:

{
  "role": "user",
  "content": "Use the get_weather tool to check the weather in San Francisco."
}

Tool choice "none" doesn't work with Anthropic

Anthropic doesn't support tool_choice: "none". If your fixture uses this setting, the Anthropic adapter omits it entirely, which may produce different behavior.

Fix: Avoid tool_choice: "none" in cross-provider contracts. If you need a text-only response test, use provider_modes: ["recorded"] to restrict it to recorded playback.

Argument format differences

Both providers normalize to arguments: string (JSON), but the exact formatting may differ. OpenAI might return {"location": "SF"} while Anthropic returns {"location":"SF"} (no spaces).

Fix: Use contains or regex instead of equals for argument assertions where exact formatting doesn't matter:

output_invariants:
  - path: "$.tool_calls[0].arguments"
    contains: "San Francisco"

Or use nested paths to check specific argument fields:

output_invariants:
  - path: "$.tool_calls[0].arguments.location"
    contains: "San Francisco"

ReplayCI auto-parses JSON string arguments during path traversal, so $.tool_calls[0].arguments.location works even though arguments is a JSON string in the normalized response. Use exists: true (not type: "object") when asserting on $.tool_calls[0].arguments directly.

How do I debug a schema_violation?

A schema_violation means the model called the right tool but passed arguments that don't match the expected schema.

Find what went wrong

Run with --json and inspect the step details:

npx replayci --json | jq '.provider_run.steps[] | select(.status == "Fail")'

Look at the invariant_failures array for the specific assertion that failed:

{
  "path": "$.tool_calls[0].arguments.replicas",
  "rule": "type",
  "detail": "expected number, got string"
}

Common causes

String instead of number — the model returns "5" instead of 5. Add a type assertion to catch this:
```
- path: "$.tool_calls[0].arguments.replicas"
  type: "number"
```
Missing required field — the model omits a required argument. Use exists:
```
- path: "$.tool_calls[0].arguments.environment"
  exists: true
```
Extra fields — the model includes unexpected arguments. This usually isn't a problem unless your downstream code rejects them.

What does "path_not_found" mean?

path_not_found: path does not exist in response

An assertion references a JSON path that doesn't exist in the model's response. This is different from a value mismatch — the path itself is missing.

Common causes

Provider changed response shape — a model update altered the response structure
Wrong path expression — typo in the JSON path (e.g., $.tool_call instead of $.tool_calls)
Model didn't call a tool — you're asserting on $.tool_calls[0].name but tool_calls is empty

Fix

Check that your JSON path matches the actual response structure. Run with --json to see the raw response:

npx replayci --json | jq '.provider_run.steps[0]'

If the path is correct but the response shape changed, you may need to re-capture recordings or update your assertions.

What are fingerprints and why do they matter?

Every test result gets a fingerprint — an 8-character hash that uniquely identifies the outcome. Same input + same output = same fingerprint.

Why fingerprints matter

Regression detection — if a fingerprint changes, the model's behavior changed
Deduplication — identical failures produce identical fingerprints, so you can count unique issues
Baseline comparison — ReplayCI tracks fingerprints over time to detect drift

When fingerprints change

A fingerprint changes when:

The model's response changes (different tool call, different arguments, different text)
Your tool definitions change (affects the boundary hash)
Your messages change

A fingerprint does not change when:

The model version is updated but produces identical output
Latency varies
Token usage varies

Response shape hash

In addition to the full fingerprint, each result includes a response shape hash — a structural fingerprint that only captures keys and types, not values. This lets you detect when a provider changes its response structure (e.g., adding a new field) independently from value changes.

My CI gate fails with "unknown_rate_threshold"

unknown_rate_threshold: unknown classification rate exceeds 20%

This means more than 20% of your failing tests have an "unknown" failure classification. The classifier couldn't determine why they failed.

What to do

Run locally to see the failures:

npx replayci --provider recorded --json | jq '.provider_run.steps[] | select(.status == "Fail")'

Look at the failure details. "Unknown" usually means the error doesn't match any known pattern (auth, rate limit, timeout, schema violation, etc.)
Common fixes:
- If it's a new failure type, add a negative test case to capture it
- If it's a transient issue, add an allowed_errors entry to your contract
- If the model is genuinely misbehaving, update your contract assertions

My recordings keep going stale

Recordings go stale when the boundary hashes don't match. This happens when you change:

Tool definitions (parameters, descriptions, names)
Messages (system prompt, user message content)
Tool choice setting

Prevention

Keep tool definitions stable. When you need to change them, update recordings in a dedicated commit.
Use a separate CI step to re-capture recordings on a schedule:

# Weekly recording refresh
replayci-refresh:
  schedule:
    - cron: '0 6 * * 1'  # Monday 6 AM
  steps:
    - run: |
        npx replayci --provider openai --model gpt-4o-mini --capture-recordings
        git add packs/*/recordings/
        git diff --cached --quiet || git commit -m "Refresh recorded fixtures"

Quick re-capture

npx replayci --provider openai --model gpt-4o-mini --capture-recordings
npx replayci --provider recorded  # verify they work
git add packs/*/recordings/
git commit -m "Update recorded fixtures"

How do I test that a model does NOT call a tool?

Use the text_response pattern — a contract where the model should respond with text instead of calling a tool:

tool: "text_response_check"

assertions:
  output_invariants:
    - path: "$.tool_calls"
      length_lte: 0
    - path: "$.content"
      exists: true

The tool field here is a descriptive label (not a literal tool name to match) because the contract uses length_lte: 0 to assert that no tools were called.

For the fixture, create a request where the model should naturally respond with text:

{
  "request": {
    "messages": [
      { "role": "system", "content": "You are a deployment assistant." },
      { "role": "user", "content": "Explain what a blue-green deployment is." }
    ],
    "tools": [
      { "name": "deploy_service", "description": "Deploy a service", "..." : "..." }
    ]
  },
  "response": {
    "success": true,
    "tool_calls": [],
    "content": "A blue-green deployment is a release strategy..."
  }
}

How do I validate multi-tool responses?

When a prompt triggers multiple tool calls, use expect_tools instead of a single tool name:

tool: "deployment_pipeline"

expect_tools:
  - "deploy_service"
  - "check_health"

By default, tools can appear in any order. To enforce order:

expect_tools:
  - "deploy_service"
  - "check_health"
tool_order: "strict"

For partial success (e.g., 2 of 3 tools is acceptable):

expect_tools:
  - "deploy_service"
  - "check_health"
  - "notify_slack"
pass_threshold: 0.66

See Writing Tests — Multi-tool contracts for full details.

Failure classification reference

When a test fails, ReplayCI classifies the failure:

Classification	What happened	Typical cause
`tool_not_invoked`	Model returned text instead of calling a tool	Ambiguous prompt, model chose not to use tools
`malformed_arguments`	Tool arguments aren't valid JSON	Model generated broken JSON
`schema_violation`	Arguments don't match the expected schema	Wrong types, missing fields
`wrong_tool`	Model called a different tool than expected	Ambiguous tool descriptions, model confusion
`path_not_found`	An assertion path doesn't exist in the response	Response shape changed, typo in path
`unexpected_error`	Provider returned an error	Auth failure, rate limit, timeout

Each failure gets a fingerprint. If the same failure recurs across runs, it produces the same fingerprint — useful for tracking whether an issue is new or recurring.

SDK troubleshooting

observe() doesn't capture anything

No API key: observe() requires REPLAYCI_API_KEY or an explicit apiKey option. Without one, it returns a no-op handle silently.

export REPLAYCI_API_KEY="rci_live_..."

Disabled by env: Check that REPLAYCI_DISABLE is not set to true.

Unsupported client: The SDK auto-detects OpenAI and Anthropic clients. If detection fails, pass a diagnostics callback to see why:

observe(openai, {
  apiKey: "...",
  diagnostics: (event) => console.log(event),
});

Double wrap: Calling observe() twice on the same client is a no-op. The diagnostics callback will emit { type: "double_wrap" }.

Circuit breaker triggered

If the capture API fails 5 times in a row, the SDK auto-disables for 10 minutes. Check:

Is REPLAYCI_API_KEY valid? (Try curl -H "Authorization: Bearer rci_live_..." https://app.replayci.com/api/v1/captures)
Is the endpoint reachable? (Check REPLAYCI_API_URL if set)
Is your email verified? (Unverified accounts get 403 on /api/v1/)

validate() returns unexpected failures

Provider format mismatch: validate() auto-detects OpenAI and Anthropic response formats. If you're passing a pre-processed response, make sure it has a tool_calls array:

validate({
  tool_calls: [{ id: "1", name: "get_weather", arguments: '{"location":"SF"}' }],
}, { contracts });

Unmatched tools: By default, tool calls with no matching contract cause a failure. Use unmatchedPolicy: "allow" if your response includes tools not covered by your contracts:

validate(response, { contracts, unmatchedPolicy: "allow" });

Captures show in dashboard but no contracts generated

Contracts need at least one tool call with arguments to generate assertions. If captures only have metadata-level data (tool names only), increase the capture level:

observe(openai, {
  captureLevel: "redacted",  // default — includes arguments
});

Model resolution errors

"Model not found in registry; using defaults"

⚠ Model "my-custom-model" not found in registry; using openai defaults

This is a warning, not an error. Your model ID didn't match any registered family pattern, so ReplayCI uses the provider's default request profile. The run proceeds normally.

This happens when:

You're using a new model not yet in the registry (e.g., a fine-tuned model or a newly released variant)
There's a typo in the model name

Fix: Check npx replayci models --provider openai to see registered families. Any model matching a family prefix works automatically. If you want unregistered models to be a hard error, use --strict-model.

"No matching family" with --strict-model

Error: Model "my-custom-model" has no matching family for provider openai
  Known families: GPT-5, GPT-4, O-series

You used --strict-model and the model doesn't match any family pattern. This is intentional — strict mode prevents accidental use of unregistered models.

Fix:

Check for typos: npx replayci --provider openai --model 4o-mini --resolve to verify resolution
If this is a new model, remove --strict-model until the registry is updated
Use npx replayci models --check to see if the registry needs updating

"Access denied" with --probe

npx replayci models --provider openai --probe

  ✗ gpt-5.2        denied
  ✓ gpt-4o-mini    accessible

The --probe command checks which registered models your API key can access. "Denied" means the provider's model listing API didn't include that model — your account may not have access.

Fix:

Check your API plan or billing tier with your provider
Some models require enrollment or waitlist access
Verify your API key is correct: echo $REPLAYCI_PROVIDER_KEY

Registry load failure

⚠ Failed to load model registry; model resolution disabled

The registry file (artifacts/model-registry/registry.json) couldn't be loaded. Resolution degrades to raw passthrough — model IDs are sent directly to the provider without alias expansion or family matching.

Fix:

This is non-fatal. Runs still work; you just lose alias and profile features.
If you installed via npm, ensure artifacts/ is included in the package. Run npm ls @replayci/cli to check the installed version.

Still stuck?

Check the CLI Reference for all available flags and options
Review your contract YAML against the Writing Tests format
Run with --json to see full structured output for debugging
Check the Dashboard Guide for understanding dashboard features
File an issue at github.com/sparshbaluni07/replayci/issues

My test shows "NonReproducible" — what does that mean?​

Common causes​

NR in CI​

My contract passes with OpenAI but fails with Anthropic​

The model doesn't call the tool​

Tool choice "none" doesn't work with Anthropic​

Argument format differences​

How do I debug a schema_violation?​

Find what went wrong​

Common causes​

What does "path_not_found" mean?​

Common causes​

Fix​

What are fingerprints and why do they matter?​

Why fingerprints matter​

When fingerprints change​

Response shape hash​

My CI gate fails with "unknown_rate_threshold"​

What to do​

My recordings keep going stale​

Prevention​

Quick re-capture​

How do I test that a model does NOT call a tool?​

How do I validate multi-tool responses?​

Failure classification reference​

SDK troubleshooting​

observe() doesn't capture anything​

Circuit breaker triggered​

validate() returns unexpected failures​

Captures show in dashboard but no contracts generated​

Model resolution errors​

"Model not found in registry; using defaults"​

"No matching family" with --strict-model​

"Access denied" with --probe​

Registry load failure​

Still stuck?​

My test shows "NonReproducible" — what does that mean?

Common causes

NR in CI

My contract passes with OpenAI but fails with Anthropic

The model doesn't call the tool

Tool choice "none" doesn't work with Anthropic

Argument format differences

How do I debug a schema_violation?

Find what went wrong

Common causes

What does "path_not_found" mean?

Common causes

Fix

What are fingerprints and why do they matter?

Why fingerprints matter

When fingerprints change

Response shape hash

My CI gate fails with "unknown_rate_threshold"

What to do

My recordings keep going stale

Prevention

Quick re-capture

How do I test that a model does NOT call a tool?

How do I validate multi-tool responses?

Failure classification reference

SDK troubleshooting

observe() doesn't capture anything

Circuit breaker triggered

validate() returns unexpected failures

Captures show in dashboard but no contracts generated

Model resolution errors

"Model not found in registry; using defaults"

"No matching family" with --strict-model

"Access denied" with --probe

Registry load failure

Still stuck?