Writing Tests — ReplayCI
ReplayCI tests are YAML files called contracts. Each contract describes one tool-call behavior to validate. Contracts live in a pack — a directory of related contracts with their fixture files.
Contract structure
A contract has three parts: the tool name, assertions, and golden cases.
# Does my AI follow the incident response protocol?
tool: multi_tool_call
expect_tools:
- check_service_health
- pull_service_logs
- search_past_incidents
- create_incident_ticket
tool_order: strict
tool_call_match_mode: strict
pass_threshold: 1.0
expected_tool_calls:
- name: check_service_health
argument_invariants:
- path: $.service_id
exists: true
- path: $.service_id
equals: "payment-gateway-us-east-1"
- name: pull_service_logs
argument_invariants:
- path: $.service_id
equals: "payment-gateway-us-east-1"
- path: $.start_time
exists: true
type: string
- name: search_past_incidents
argument_invariants:
- path: $.query
exists: true
- path: $.query
contains: "connection timeout"
- name: create_incident_ticket
argument_invariants:
- path: $.severity
equals: "P1"
- path: $.description
contains: "503"
assertions:
output_invariants:
- path: $.tool_calls
exists: true
- path: $.tool_calls
length_gte: 4
golden_cases:
- id: incident_response_success
input_ref: incident_response.success.json
expect_ok: true
provider_modes: ["recorded"]
- id: incident_response_not_invoked
input_ref: negative/incident_response.not_invoked.json
expect_ok: false
expected_error: "tool_not_invoked"
provider_modes: ["recorded"]
tool
The tool identifier. For single-tool contracts, this is the tool name the model should call. For multi-tool contracts, use multi_tool_call and specify individual tools via expect_tools.
assertions
JSON path checks on the model's response. Two types:
- output_invariants — checks on the model's response (tool call name, arguments)
- input_invariants — checks on the request sent to the provider (messages array, tools array)
Each assertion uses a JSON path expression:
| Field | Description |
|---|---|
path | JSON path expression (e.g. $.tool_calls[0].name) |
equals | Exact value match |
exists | Check that the path exists (true/false) |
type | Expected type ("array", "object", "string", "number") |
contains | Substring match (string values) |
equals_env | Compare value to an environment variable |
one_of | Value must be one of the given values |
regex | Value must match a regular expression |
gte | Numeric value must be >= threshold |
lte | Numeric value must be <= threshold |
length_gte | Array or string length must be >= threshold |
length_lte | Array or string length must be <= threshold |
Multiple operators can be combined on a single assertion (e.g. gte + lte for a range check).
Choosing the right operator
ReplayCI assertions are exact and deterministic by design — there's no fuzzy or semantic matching. This means equals: "CH" will fail if the model returns "Switzerland", even though they mean the same thing. Choosing the right operator prevents false failures.
Quick guide:
| Scenario | Recommended operator | Why |
|---|---|---|
| Tool name, fixed status code | equals | Value is fully controlled and stable |
| Country codes, enums with known variants | one_of | Model may return any valid representation |
| Free text, error messages | contains | Wording varies across models and runs |
| Timestamps, IDs with known patterns | regex | Format is predictable, exact value isn't |
| Numeric ranges (scores, counts) | gte / lte | Exact value varies, range is meaningful |
| "Field must exist, value varies" | exists: true + type | Assert structure without pinning values |
Common pitfall — exact matching on variable output:
# Too strict — fails when model returns "Switzerland" instead of "CH"
- path: "$.tool_calls[0].arguments.country"
equals: "CH"
# Better — accept all valid representations
- path: "$.tool_calls[0].arguments.country"
one_of: ["CH", "Switzerland", "Swiss Confederation"]
# Or check structurally if the exact value doesn't matter
- path: "$.tool_calls[0].arguments.country"
type: "string"
exists: true
Cross-provider tip: Different models format arguments differently. OpenAI and Anthropic may return "San Francisco, CA" vs "San Francisco". Use contains for substring checks or structural assertions (exists + type) when testing across providers.
Principle: assert structure, not exact values. Use equals only for values you fully control (tool names, fixed codes). For everything else, prefer one_of, contains, regex, or structural checks. When in doubt, start loose (exists + type) and tighten only after you understand your model's output patterns.
golden_cases
Specific input/output pairs to test. Each golden case references a fixture file.
| Field | Description |
|---|---|
id | Unique identifier for this test case |
input_ref | Path to fixture file (relative to the pack's golden directory) |
expect_ok | Whether this case should pass (true) or fail (false) |
expected_error | For failure cases: the expected failure classification |
provider_modes | Restrict to specific modes: ["recorded"] for offline-only cases |
Golden fixture files
A golden fixture is a JSON file containing the request and expected response. Here's a simplified view of the starter pack's incident_response.success.json:
{
"boundary_version": "1.0",
"provider": "openai",
"model_id": "gpt-4o-mini",
"tool_schema_hash": "starter_ir_tool_001",
"messages_hash": "starter_ir_msg_001",
"tool_choice_mode": "required",
"request": {
"messages": [
{ "role": "system", "content": "You are a production incident response coordinator..." },
{ "role": "user", "content": "INCIDENT REPORT #IR-2026-0847\n\nService: payment-gateway-us-east-1..." }
],
"tools": [
{ "name": "check_service_health", "description": "Check service health", "parameters": { "..." : "..." } },
{ "name": "pull_service_logs", "description": "Retrieve service logs", "parameters": { "..." : "..." } },
{ "name": "search_past_incidents", "description": "Search historical incidents", "parameters": { "..." : "..." } },
{ "name": "create_incident_ticket", "description": "Create incident ticket", "parameters": { "..." : "..." } }
],
"tool_choice": "required",
"temperature": 0,
"max_tokens": 4096
},
"response": {
"success": true,
"tool_calls": [
{ "id": "call_health_001", "name": "check_service_health", "arguments": { "service_id": "payment-gateway-us-east-1" } },
{ "id": "call_logs_002", "name": "pull_service_logs", "arguments": { "service_id": "payment-gateway-us-east-1", "start_time": "2026-03-03T14:23:00-05:00", "end_time": "2026-03-03T14:47:00-05:00" } },
{ "id": "call_search_003", "name": "search_past_incidents", "arguments": { "query": "connection timeout after 30000ms to upstream payment-processor.internal:8443" } },
{ "id": "call_ticket_004", "name": "create_incident_ticket", "arguments": { "severity": "P1", "service_id": "payment-gateway-us-east-1", "description": "503 errors on /api/v1/checkout..." } }
],
"content": null
}
}
Boundary fields
Identifying fields at the top level of the fixture. Used for fingerprinting and determinism proof.
| Field | Description |
|---|---|
boundary_version | Schema version (always "1.0") |
provider | Provider name ("openai", "anthropic") |
model_id | The model this fixture targets |
tool_schema_hash | Hash of the tool schema (for change detection) |
system_prompt_hash | Hash of the system prompt (nullable) |
messages_hash | Hash of the messages array |
tool_choice_mode | Tool choice mode ("auto", "none", "required") |
request
The request payload sent to the provider. Note: model is not inside request — it's the top-level model_id field.
Tool format note: Both flat format (
{name, description, parameters}) and OpenAI-wrapped format ({type: "function", function: {name, ...}}) are accepted in thetoolsarray. The runner normalizes to flat format before evaluation and hashing, so both produce identical results. The starter pack uses flat format.
response
The expected response. Tool calls use flat format:
name— the tool name (flat, not nested underfunction)arguments— may be authored as either a parsed JSON object or a JSON string in fixture files
ReplayCI normalizes tool-call arguments to a JSON string in runtime output. When evaluating nested paths like $.tool_calls[0].arguments.location, ReplayCI auto-parses that JSON string during path traversal.
Tip: Use
exists: trueon$.tool_calls[0].arguments(nottype: "object"). Use nested paths like$.tool_calls[0].arguments.locationfor field-level checks.
For success cases: success: true with a tool_calls array. For failure cases: success: false or a response that violates the contract assertions.
Recording files are optional. When running with --provider recorded, ReplayCI first looks for a sidecar recording file (recordings/*.recording.json). If that file is missing, it falls back to the fixture's embedded response field. This means hand-authored fixtures with a response section work without a separate recording file — useful for negative test cases. If neither a recording file nor an embedded response exists, the result is recording_not_found.
Negative test cases
Negative cases test that ReplayCI correctly detects failures. Place negative fixtures in a negative/ subdirectory.
Example negative fixture (negative/incident_response.not_invoked.json) — the model returns text instead of calling any tools:
{
"boundary_version": "1.0",
"provider": "openai",
"model_id": "gpt-4o-mini",
"tool_choice_mode": "auto",
"request": {
"messages": [
{ "role": "user", "content": "INCIDENT REPORT #IR-2026-0847..." }
],
"tools": [
{ "name": "check_service_health", "parameters": { "..." : "..." } }
],
"tool_choice": "auto",
"temperature": 0,
"max_tokens": 1024
},
"response": {
"success": false,
"tool_calls": [],
"content": "I'll help you investigate this incident. Based on the report, it appears there are 503 errors...",
"error": {
"code": "tool_not_invoked",
"message": "Model did not invoke the expected tool"
}
}
}
Mark negative cases with expect_ok: false and expected_error in the contract:
golden_cases:
- id: incident_response_not_invoked
input_ref: negative/incident_response.not_invoked.json
expect_ok: false
expected_error: "tool_not_invoked"
provider_modes: ["recorded"]
The provider_modes: ["recorded"] restriction means this case only runs against recorded fixtures, not live providers (since you can't force a live model to fail in a specific way).
Multi-tool contracts
When your prompt triggers multiple tool calls, use expect_tools to validate all of them in a single contract.
expect_tools
A list of tool names that must all appear in the model's response:
tool: "deployment_pipeline"
expect_tools:
- "deploy_service"
- "check_health"
assertions:
output_invariants:
- path: "$.tool_calls"
length_gte: 2
By default, tools can appear in any order. Use tool_order to enforce ordering:
expect_tools:
- "deploy_service"
- "check_health"
tool_order: "strict" # tools must appear in this exact order
| Field | Description |
|---|---|
expect_tools | List of tool names that must all be called |
tool_order | "any" (default) or "strict" — whether order matters |
pass_threshold
For contracts where partial success is acceptable, set a threshold between 0 and 1:
expect_tools:
- "deploy_service"
- "check_health"
- "notify_slack"
pass_threshold: 0.66 # pass if at least 2 of 3 tools are called
Without pass_threshold (or with pass_threshold: 1.0), all expected tools must be called. pass_threshold is only used when expect_tools is set.
Failure classifications
When a test fails, ReplayCI assigns a classification:
| Classification | Meaning |
|---|---|
tool_not_invoked | Model returned text instead of calling the tool |
malformed_arguments | Tool arguments aren't valid JSON |
schema_violation | Arguments don't match the expected schema |
wrong_tool | Model called a different tool than expected |
path_not_found | An invariant path does not exist in the response |
unexpected_error | Provider returned an error |
Each failure also gets a fingerprint — a short hash that identifies the specific failure. If the same failure recurs, it produces the same fingerprint.
Pack structure
A pack is a directory containing contracts and their fixtures:
packs/my-pack/
pack.yaml # Pack metadata
nr-allowlist.json # Non-reproducible allowlist (optional)
contracts/
incident_response.yaml # Contract files
search.yaml
golden/
incident_response.success.json # Success fixtures
search.success.json
negative/
incident_response.not_invoked.json # Negative fixtures
pack.yaml
pack_id: "my-pack"
name: "my-tool-calling-tests"
version: "0.1.0"
schema_version: "v0.1"
provider: "openai"
default_model: "gpt-4o-mini"
paths:
contracts_dir: "packs/my-pack/contracts"
golden_dir: "packs/my-pack/golden"
negative_golden_dir: "packs/my-pack/golden/negative"
contracts:
- incident_response.yaml
- search.yaml
nr-allowlist.json (optional)
An allowlist for non-reproducible results. Most packs don't need this file — if it's missing, ReplayCI treats the pack as having zero NR exceptions (the default, correct state for deterministic tests).
If you do include it, use:
{
"version": "1.0",
"entries": []
}
When you need to allowlist a specific non-reproducible result (e.g. a model that's intentionally non-deterministic for a particular contract), add an entry:
{
"version": "1.0",
"entries": [
{
"contract_path": "contracts/creative_response.yaml",
"reason_code": "model_nondeterminism",
"owner": "team-ml",
"expiry_date": "2026-06-01T00:00:00Z",
"rationale": "Creative responses are intentionally non-deterministic"
}
]
}
Every entry requires all five fields. Expired entries (past expiry_date) are automatically rejected at runtime.
Optional contract fields
Contracts support additional configuration for timeouts, retries, and rate limit handling:
timeouts:
total_ms: 30000
retries:
max_attempts: 2
retry_on:
- "429"
- "5xx"
- "timeout"
rate_limits:
on_429:
respect_retry_after: true
max_sleep_seconds: 60
allowed_errors:
- "rate_limit"
- "timeout"
| Field | Description |
|---|---|
timeouts.total_ms | Maximum time for the provider call (default: 30000) |
retries.max_attempts | Number of retries on transient errors |
retries.retry_on | Error types that trigger a retry |
rate_limits.on_429 | How to handle rate limit responses |
allowed_errors | Error types that don't count as test failures |
Two-source contracts
Contracts can come from two sources:
- Inferred — auto-generated by the server from captured tool calls (via the SDK or CLI observe). These contain structure-based assertions:
exists,type, schema-derived bounds. Refreshed automatically as new data arrives. - Customer — manually written or edited by you. These are immutable to auto-refresh — your edits always take precedence.
When both exist for the same tool, they're merged. Customer assertions override inferred ones on conflicts; inferred assertions fill gaps.
Confidence tiers
Auto-generated contracts have a confidence tier based on sample count:
| Tier | Samples | Meaning |
|---|---|---|
| Low | < 5 | Too few samples — contract may not represent typical behavior |
| Medium | 5–9 | Growing confidence — worth reviewing |
| High | ≥ 10 | Enough evidence — safe to promote to enforced truth |
Review auto-generated contracts in the Dashboard Contracts page before relying on them in CI.
Running your tests
Point ReplayCI at your pack:
# Using .replayci.yml
pack: "./packs/my-pack"
# Or via CLI flag
npx replayci --pack packs/my-pack
Run against recorded fixtures (offline, deterministic):
npx replayci --pack packs/my-pack --provider recorded
Run against a live provider:
npx replayci --pack packs/my-pack --provider openai --model gpt-4o-mini
You can also validate responses directly in your application code using the SDK. See SDK Integration.
See CLI Reference for all CLI flags.