Skip to main content

Writing Tests — ReplayCI

ReplayCI tests are YAML files called contracts. Each contract describes one tool-call behavior to validate. Contracts live in a pack — a directory of related contracts with their fixture files.


Contract structure

A contract has three parts: the tool name, assertions, and golden cases.

# Does my AI follow the incident response protocol?

tool: multi_tool_call

expect_tools:
- check_service_health
- pull_service_logs
- search_past_incidents
- create_incident_ticket

tool_order: strict
tool_call_match_mode: strict
pass_threshold: 1.0

expected_tool_calls:
- name: check_service_health
argument_invariants:
- path: $.service_id
exists: true
- path: $.service_id
equals: "payment-gateway-us-east-1"

- name: pull_service_logs
argument_invariants:
- path: $.service_id
equals: "payment-gateway-us-east-1"
- path: $.start_time
exists: true
type: string

- name: search_past_incidents
argument_invariants:
- path: $.query
exists: true
- path: $.query
contains: "connection timeout"

- name: create_incident_ticket
argument_invariants:
- path: $.severity
equals: "P1"
- path: $.description
contains: "503"

assertions:
output_invariants:
- path: $.tool_calls
exists: true
- path: $.tool_calls
length_gte: 4

golden_cases:
- id: incident_response_success
input_ref: incident_response.success.json
expect_ok: true
provider_modes: ["recorded"]

- id: incident_response_not_invoked
input_ref: negative/incident_response.not_invoked.json
expect_ok: false
expected_error: "tool_not_invoked"
provider_modes: ["recorded"]

tool

The tool identifier. For single-tool contracts, this is the tool name the model should call. For multi-tool contracts, use multi_tool_call and specify individual tools via expect_tools.

assertions

JSON path checks on the model's response. Two types:

  • output_invariants — checks on the model's response (tool call name, arguments)
  • input_invariants — checks on the request sent to the provider (messages array, tools array)

Each assertion uses a JSON path expression:

FieldDescription
pathJSON path expression (e.g. $.tool_calls[0].name)
equalsExact value match
existsCheck that the path exists (true/false)
typeExpected type ("array", "object", "string", "number")
containsSubstring match (string values)
equals_envCompare value to an environment variable
one_ofValue must be one of the given values
regexValue must match a regular expression
gteNumeric value must be >= threshold
lteNumeric value must be <= threshold
length_gteArray or string length must be >= threshold
length_lteArray or string length must be <= threshold

Multiple operators can be combined on a single assertion (e.g. gte + lte for a range check).

Choosing the right operator

ReplayCI assertions are exact and deterministic by design — there's no fuzzy or semantic matching. This means equals: "CH" will fail if the model returns "Switzerland", even though they mean the same thing. Choosing the right operator prevents false failures.

Quick guide:

ScenarioRecommended operatorWhy
Tool name, fixed status codeequalsValue is fully controlled and stable
Country codes, enums with known variantsone_ofModel may return any valid representation
Free text, error messagescontainsWording varies across models and runs
Timestamps, IDs with known patternsregexFormat is predictable, exact value isn't
Numeric ranges (scores, counts)gte / lteExact value varies, range is meaningful
"Field must exist, value varies"exists: true + typeAssert structure without pinning values

Common pitfall — exact matching on variable output:

# Too strict — fails when model returns "Switzerland" instead of "CH"
- path: "$.tool_calls[0].arguments.country"
equals: "CH"

# Better — accept all valid representations
- path: "$.tool_calls[0].arguments.country"
one_of: ["CH", "Switzerland", "Swiss Confederation"]

# Or check structurally if the exact value doesn't matter
- path: "$.tool_calls[0].arguments.country"
type: "string"
exists: true

Cross-provider tip: Different models format arguments differently. OpenAI and Anthropic may return "San Francisco, CA" vs "San Francisco". Use contains for substring checks or structural assertions (exists + type) when testing across providers.

Principle: assert structure, not exact values. Use equals only for values you fully control (tool names, fixed codes). For everything else, prefer one_of, contains, regex, or structural checks. When in doubt, start loose (exists + type) and tighten only after you understand your model's output patterns.

golden_cases

Specific input/output pairs to test. Each golden case references a fixture file.

FieldDescription
idUnique identifier for this test case
input_refPath to fixture file (relative to the pack's golden directory)
expect_okWhether this case should pass (true) or fail (false)
expected_errorFor failure cases: the expected failure classification
provider_modesRestrict to specific modes: ["recorded"] for offline-only cases

Golden fixture files

A golden fixture is a JSON file containing the request and expected response. Here's a simplified view of the starter pack's incident_response.success.json:

{
"boundary_version": "1.0",
"provider": "openai",
"model_id": "gpt-4o-mini",
"tool_schema_hash": "starter_ir_tool_001",
"messages_hash": "starter_ir_msg_001",
"tool_choice_mode": "required",
"request": {
"messages": [
{ "role": "system", "content": "You are a production incident response coordinator..." },
{ "role": "user", "content": "INCIDENT REPORT #IR-2026-0847\n\nService: payment-gateway-us-east-1..." }
],
"tools": [
{ "name": "check_service_health", "description": "Check service health", "parameters": { "..." : "..." } },
{ "name": "pull_service_logs", "description": "Retrieve service logs", "parameters": { "..." : "..." } },
{ "name": "search_past_incidents", "description": "Search historical incidents", "parameters": { "..." : "..." } },
{ "name": "create_incident_ticket", "description": "Create incident ticket", "parameters": { "..." : "..." } }
],
"tool_choice": "required",
"temperature": 0,
"max_tokens": 4096
},
"response": {
"success": true,
"tool_calls": [
{ "id": "call_health_001", "name": "check_service_health", "arguments": { "service_id": "payment-gateway-us-east-1" } },
{ "id": "call_logs_002", "name": "pull_service_logs", "arguments": { "service_id": "payment-gateway-us-east-1", "start_time": "2026-03-03T14:23:00-05:00", "end_time": "2026-03-03T14:47:00-05:00" } },
{ "id": "call_search_003", "name": "search_past_incidents", "arguments": { "query": "connection timeout after 30000ms to upstream payment-processor.internal:8443" } },
{ "id": "call_ticket_004", "name": "create_incident_ticket", "arguments": { "severity": "P1", "service_id": "payment-gateway-us-east-1", "description": "503 errors on /api/v1/checkout..." } }
],
"content": null
}
}

Boundary fields

Identifying fields at the top level of the fixture. Used for fingerprinting and determinism proof.

FieldDescription
boundary_versionSchema version (always "1.0")
providerProvider name ("openai", "anthropic")
model_idThe model this fixture targets
tool_schema_hashHash of the tool schema (for change detection)
system_prompt_hashHash of the system prompt (nullable)
messages_hashHash of the messages array
tool_choice_modeTool choice mode ("auto", "none", "required")

request

The request payload sent to the provider. Note: model is not inside request — it's the top-level model_id field.

Tool format note: Both flat format ({name, description, parameters}) and OpenAI-wrapped format ({type: "function", function: {name, ...}}) are accepted in the tools array. The runner normalizes to flat format before evaluation and hashing, so both produce identical results. The starter pack uses flat format.

response

The expected response. Tool calls use flat format:

  • name — the tool name (flat, not nested under function)
  • arguments — may be authored as either a parsed JSON object or a JSON string in fixture files

ReplayCI normalizes tool-call arguments to a JSON string in runtime output. When evaluating nested paths like $.tool_calls[0].arguments.location, ReplayCI auto-parses that JSON string during path traversal.

Tip: Use exists: true on $.tool_calls[0].arguments (not type: "object"). Use nested paths like $.tool_calls[0].arguments.location for field-level checks.

For success cases: success: true with a tool_calls array. For failure cases: success: false or a response that violates the contract assertions.

Recording files are optional. When running with --provider recorded, ReplayCI first looks for a sidecar recording file (recordings/*.recording.json). If that file is missing, it falls back to the fixture's embedded response field. This means hand-authored fixtures with a response section work without a separate recording file — useful for negative test cases. If neither a recording file nor an embedded response exists, the result is recording_not_found.


Negative test cases

Negative cases test that ReplayCI correctly detects failures. Place negative fixtures in a negative/ subdirectory.

Example negative fixture (negative/incident_response.not_invoked.json) — the model returns text instead of calling any tools:

{
"boundary_version": "1.0",
"provider": "openai",
"model_id": "gpt-4o-mini",
"tool_choice_mode": "auto",
"request": {
"messages": [
{ "role": "user", "content": "INCIDENT REPORT #IR-2026-0847..." }
],
"tools": [
{ "name": "check_service_health", "parameters": { "..." : "..." } }
],
"tool_choice": "auto",
"temperature": 0,
"max_tokens": 1024
},
"response": {
"success": false,
"tool_calls": [],
"content": "I'll help you investigate this incident. Based on the report, it appears there are 503 errors...",
"error": {
"code": "tool_not_invoked",
"message": "Model did not invoke the expected tool"
}
}
}

Mark negative cases with expect_ok: false and expected_error in the contract:

golden_cases:
- id: incident_response_not_invoked
input_ref: negative/incident_response.not_invoked.json
expect_ok: false
expected_error: "tool_not_invoked"
provider_modes: ["recorded"]

The provider_modes: ["recorded"] restriction means this case only runs against recorded fixtures, not live providers (since you can't force a live model to fail in a specific way).


Multi-tool contracts

When your prompt triggers multiple tool calls, use expect_tools to validate all of them in a single contract.

expect_tools

A list of tool names that must all appear in the model's response:

tool: "deployment_pipeline"

expect_tools:
- "deploy_service"
- "check_health"

assertions:
output_invariants:
- path: "$.tool_calls"
length_gte: 2

By default, tools can appear in any order. Use tool_order to enforce ordering:

expect_tools:
- "deploy_service"
- "check_health"
tool_order: "strict" # tools must appear in this exact order
FieldDescription
expect_toolsList of tool names that must all be called
tool_order"any" (default) or "strict" — whether order matters

pass_threshold

For contracts where partial success is acceptable, set a threshold between 0 and 1:

expect_tools:
- "deploy_service"
- "check_health"
- "notify_slack"
pass_threshold: 0.66 # pass if at least 2 of 3 tools are called

Without pass_threshold (or with pass_threshold: 1.0), all expected tools must be called. pass_threshold is only used when expect_tools is set.


Failure classifications

When a test fails, ReplayCI assigns a classification:

ClassificationMeaning
tool_not_invokedModel returned text instead of calling the tool
malformed_argumentsTool arguments aren't valid JSON
schema_violationArguments don't match the expected schema
wrong_toolModel called a different tool than expected
path_not_foundAn invariant path does not exist in the response
unexpected_errorProvider returned an error

Each failure also gets a fingerprint — a short hash that identifies the specific failure. If the same failure recurs, it produces the same fingerprint.


Pack structure

A pack is a directory containing contracts and their fixtures:

packs/my-pack/
pack.yaml # Pack metadata
nr-allowlist.json # Non-reproducible allowlist (optional)
contracts/
incident_response.yaml # Contract files
search.yaml
golden/
incident_response.success.json # Success fixtures
search.success.json
negative/
incident_response.not_invoked.json # Negative fixtures

pack.yaml

pack_id: "my-pack"
name: "my-tool-calling-tests"
version: "0.1.0"
schema_version: "v0.1"

provider: "openai"
default_model: "gpt-4o-mini"

paths:
contracts_dir: "packs/my-pack/contracts"
golden_dir: "packs/my-pack/golden"
negative_golden_dir: "packs/my-pack/golden/negative"

contracts:
- incident_response.yaml
- search.yaml

nr-allowlist.json (optional)

An allowlist for non-reproducible results. Most packs don't need this file — if it's missing, ReplayCI treats the pack as having zero NR exceptions (the default, correct state for deterministic tests).

If you do include it, use:

{
"version": "1.0",
"entries": []
}

When you need to allowlist a specific non-reproducible result (e.g. a model that's intentionally non-deterministic for a particular contract), add an entry:

{
"version": "1.0",
"entries": [
{
"contract_path": "contracts/creative_response.yaml",
"reason_code": "model_nondeterminism",
"owner": "team-ml",
"expiry_date": "2026-06-01T00:00:00Z",
"rationale": "Creative responses are intentionally non-deterministic"
}
]
}

Every entry requires all five fields. Expired entries (past expiry_date) are automatically rejected at runtime.


Optional contract fields

Contracts support additional configuration for timeouts, retries, and rate limit handling:

timeouts:
total_ms: 30000

retries:
max_attempts: 2
retry_on:
- "429"
- "5xx"
- "timeout"

rate_limits:
on_429:
respect_retry_after: true
max_sleep_seconds: 60

allowed_errors:
- "rate_limit"
- "timeout"
FieldDescription
timeouts.total_msMaximum time for the provider call (default: 30000)
retries.max_attemptsNumber of retries on transient errors
retries.retry_onError types that trigger a retry
rate_limits.on_429How to handle rate limit responses
allowed_errorsError types that don't count as test failures

Two-source contracts

Contracts can come from two sources:

  • Inferred — auto-generated by the server from captured tool calls (via the SDK or CLI observe). These contain structure-based assertions: exists, type, schema-derived bounds. Refreshed automatically as new data arrives.
  • Customer — manually written or edited by you. These are immutable to auto-refresh — your edits always take precedence.

When both exist for the same tool, they're merged. Customer assertions override inferred ones on conflicts; inferred assertions fill gaps.

Confidence tiers

Auto-generated contracts have a confidence tier based on sample count:

TierSamplesMeaning
Low< 5Too few samples — contract may not represent typical behavior
Medium5–9Growing confidence — worth reviewing
High≥ 10Enough evidence — safe to promote to enforced truth

Review auto-generated contracts in the Dashboard Contracts page before relying on them in CI.


Running your tests

Point ReplayCI at your pack:

# Using .replayci.yml
pack: "./packs/my-pack"

# Or via CLI flag
npx replayci --pack packs/my-pack

Run against recorded fixtures (offline, deterministic):

npx replayci --pack packs/my-pack --provider recorded

Run against a live provider:

npx replayci --pack packs/my-pack --provider openai --model gpt-4o-mini

You can also validate responses directly in your application code using the SDK. See SDK Integration.

See CLI Reference for all CLI flags.