Writing Tests — ReplayCI

ReplayCI tests are YAML files called contracts. Each contract describes one tool-call behavior to validate. Contracts live in a pack — a directory of related contracts with their fixture files.

Contract structure

A contract has three parts: the tool name, assertions, and golden cases.

# Does my AI follow the incident response protocol?

tool: multi_tool_call

expect_tools:
  - check_service_health
  - pull_service_logs
  - search_past_incidents
  - create_incident_ticket

tool_order: strict
tool_call_match_mode: strict
pass_threshold: 1.0

expected_tool_calls:
  - name: check_service_health
    argument_invariants:
      - path: $.service_id
        exists: true
      - path: $.service_id
        equals: "payment-gateway-us-east-1"

  - name: pull_service_logs
    argument_invariants:
      - path: $.service_id
        equals: "payment-gateway-us-east-1"
      - path: $.start_time
        exists: true
        type: string

  - name: search_past_incidents
    argument_invariants:
      - path: $.query
        exists: true
      - path: $.query
        contains: "connection timeout"

  - name: create_incident_ticket
    argument_invariants:
      - path: $.severity
        equals: "P1"
      - path: $.description
        contains: "503"

assertions:
  output_invariants:
    - path: $.tool_calls
      exists: true
    - path: $.tool_calls
      length_gte: 4

golden_cases:
  - id: incident_response_success
    input_ref: incident_response.success.json
    expect_ok: true
    provider_modes: ["recorded"]

  - id: incident_response_not_invoked
    input_ref: negative/incident_response.not_invoked.json
    expect_ok: false
    expected_error: "tool_not_invoked"
    provider_modes: ["recorded"]

tool

The tool identifier. For single-tool contracts, this is the tool name the model should call. For multi-tool contracts, use multi_tool_call and specify individual tools via expect_tools.

assertions

JSON path checks on the model's response. Two types:

output_invariants — checks on the model's response (tool call name, arguments)
input_invariants — checks on the request sent to the provider (messages array, tools array)

Each assertion uses a JSON path expression:

Field	Description
`path`	JSON path expression (e.g. `$.tool_calls[0].name`)
`equals`	Exact value match
`exists`	Check that the path exists (`true`/`false`)
`type`	Expected type (`"array"`, `"object"`, `"string"`, `"number"`)
`contains`	Substring match (string values)
`equals_env`	Compare value to an environment variable
`one_of`	Value must be one of the given values
`regex`	Value must match a regular expression
`gte`	Numeric value must be >= threshold
`lte`	Numeric value must be <= threshold
`length_gte`	Array or string length must be >= threshold
`length_lte`	Array or string length must be <= threshold

Multiple operators can be combined on a single assertion (e.g. gte + lte for a range check).

Choosing the right operator

ReplayCI assertions are exact and deterministic by design — there's no fuzzy or semantic matching. This means equals: "CH" will fail if the model returns "Switzerland", even though they mean the same thing. Choosing the right operator prevents false failures.

Quick guide:

Scenario	Recommended operator	Why
Tool name, fixed status code	`equals`	Value is fully controlled and stable
Country codes, enums with known variants	`one_of`	Model may return any valid representation
Free text, error messages	`contains`	Wording varies across models and runs
Timestamps, IDs with known patterns	`regex`	Format is predictable, exact value isn't
Numeric ranges (scores, counts)	`gte` / `lte`	Exact value varies, range is meaningful
"Field must exist, value varies"	`exists: true` + `type`	Assert structure without pinning values

Common pitfall — exact matching on variable output:

# Too strict — fails when model returns "Switzerland" instead of "CH"
- path: "$.tool_calls[0].arguments.country"
  equals: "CH"

# Better — accept all valid representations
- path: "$.tool_calls[0].arguments.country"
  one_of: ["CH", "Switzerland", "Swiss Confederation"]

# Or check structurally if the exact value doesn't matter
- path: "$.tool_calls[0].arguments.country"
  type: "string"
  exists: true

Cross-provider tip: Different models format arguments differently. OpenAI and Anthropic may return "San Francisco, CA" vs "San Francisco". Use contains for substring checks or structural assertions (exists + type) when testing across providers.

Principle: assert structure, not exact values. Use equals only for values you fully control (tool names, fixed codes). For everything else, prefer one_of, contains, regex, or structural checks. When in doubt, start loose (exists + type) and tighten only after you understand your model's output patterns.

golden_cases

Specific input/output pairs to test. Each golden case references a fixture file.

Field	Description
`id`	Unique identifier for this test case
`input_ref`	Path to fixture file (relative to the pack's golden directory)
`expect_ok`	Whether this case should pass (`true`) or fail (`false`)
`expected_error`	For failure cases: the expected failure classification
`provider_modes`	Restrict to specific modes: `["recorded"]` for offline-only cases

Golden fixture files

A golden fixture is a JSON file containing the request and expected response. Here's a simplified view of the starter pack's incident_response.success.json:

{
  "boundary_version": "1.0",
  "provider": "openai",
  "model_id": "gpt-4o-mini",
  "tool_schema_hash": "starter_ir_tool_001",
  "messages_hash": "starter_ir_msg_001",
  "tool_choice_mode": "required",
  "request": {
    "messages": [
      { "role": "system", "content": "You are a production incident response coordinator..." },
      { "role": "user", "content": "INCIDENT REPORT #IR-2026-0847\n\nService: payment-gateway-us-east-1..." }
    ],
    "tools": [
      { "name": "check_service_health", "description": "Check service health", "parameters": { "..." : "..." } },
      { "name": "pull_service_logs", "description": "Retrieve service logs", "parameters": { "..." : "..." } },
      { "name": "search_past_incidents", "description": "Search historical incidents", "parameters": { "..." : "..." } },
      { "name": "create_incident_ticket", "description": "Create incident ticket", "parameters": { "..." : "..." } }
    ],
    "tool_choice": "required",
    "temperature": 0,
    "max_tokens": 4096
  },
  "response": {
    "success": true,
    "tool_calls": [
      { "id": "call_health_001", "name": "check_service_health", "arguments": { "service_id": "payment-gateway-us-east-1" } },
      { "id": "call_logs_002", "name": "pull_service_logs", "arguments": { "service_id": "payment-gateway-us-east-1", "start_time": "2026-03-03T14:23:00-05:00", "end_time": "2026-03-03T14:47:00-05:00" } },
      { "id": "call_search_003", "name": "search_past_incidents", "arguments": { "query": "connection timeout after 30000ms to upstream payment-processor.internal:8443" } },
      { "id": "call_ticket_004", "name": "create_incident_ticket", "arguments": { "severity": "P1", "service_id": "payment-gateway-us-east-1", "description": "503 errors on /api/v1/checkout..." } }
    ],
    "content": null
  }
}

Boundary fields

Identifying fields at the top level of the fixture. Used for fingerprinting and determinism proof.

Field	Description
`boundary_version`	Schema version (always `"1.0"`)
`provider`	Provider name (`"openai"`, `"anthropic"`)
`model_id`	The model this fixture targets
`tool_schema_hash`	Hash of the tool schema (for change detection)
`system_prompt_hash`	Hash of the system prompt (nullable)
`messages_hash`	Hash of the messages array
`tool_choice_mode`	Tool choice mode (`"auto"`, `"none"`, `"required"`)

request

The request payload sent to the provider. Note: model is not inside request — it's the top-level model_id field.

Tool format note: Both flat format ({name, description, parameters}) and OpenAI-wrapped format ({type: "function", function: {name, ...}}) are accepted in the tools array. The runner normalizes to flat format before evaluation and hashing, so both produce identical results. The starter pack uses flat format.

response

The expected response. Tool calls use flat format:

name — the tool name (flat, not nested under function)
arguments — may be authored as either a parsed JSON object or a JSON string in fixture files

ReplayCI normalizes tool-call arguments to a JSON string in runtime output. When evaluating nested paths like $.tool_calls[0].arguments.location, ReplayCI auto-parses that JSON string during path traversal.

Tip: Use exists: true on $.tool_calls[0].arguments (not type: "object"). Use nested paths like $.tool_calls[0].arguments.location for field-level checks.

For success cases: success: true with a tool_calls array. For failure cases: success: false or a response that violates the contract assertions.

Recording files are optional. When running with --provider recorded, ReplayCI first looks for a sidecar recording file (recordings/*.recording.json). If that file is missing, it falls back to the fixture's embedded response field. This means hand-authored fixtures with a response section work without a separate recording file — useful for negative test cases. If neither a recording file nor an embedded response exists, the result is recording_not_found.

Negative test cases

Negative cases test that ReplayCI correctly detects failures. Place negative fixtures in a negative/ subdirectory.

Example negative fixture (negative/incident_response.not_invoked.json) — the model returns text instead of calling any tools:

{
  "boundary_version": "1.0",
  "provider": "openai",
  "model_id": "gpt-4o-mini",
  "tool_choice_mode": "auto",
  "request": {
    "messages": [
      { "role": "user", "content": "INCIDENT REPORT #IR-2026-0847..." }
    ],
    "tools": [
      { "name": "check_service_health", "parameters": { "..." : "..." } }
    ],
    "tool_choice": "auto",
    "temperature": 0,
    "max_tokens": 1024
  },
  "response": {
    "success": false,
    "tool_calls": [],
    "content": "I'll help you investigate this incident. Based on the report, it appears there are 503 errors...",
    "error": {
      "code": "tool_not_invoked",
      "message": "Model did not invoke the expected tool"
    }
  }
}

Mark negative cases with expect_ok: false and expected_error in the contract:

golden_cases:
  - id: incident_response_not_invoked
    input_ref: negative/incident_response.not_invoked.json
    expect_ok: false
    expected_error: "tool_not_invoked"
    provider_modes: ["recorded"]

The provider_modes: ["recorded"] restriction means this case only runs against recorded fixtures, not live providers (since you can't force a live model to fail in a specific way).

Multi-tool contracts

When your prompt triggers multiple tool calls, use expect_tools to validate all of them in a single contract.

expect_tools

A list of tool names that must all appear in the model's response:

tool: "deployment_pipeline"

expect_tools:
  - "deploy_service"
  - "check_health"

assertions:
  output_invariants:
    - path: "$.tool_calls"
      length_gte: 2

By default, tools can appear in any order. Use tool_order to enforce ordering:

expect_tools:
  - "deploy_service"
  - "check_health"
tool_order: "strict"    # tools must appear in this exact order

Field	Description
`expect_tools`	List of tool names that must all be called
`tool_order`	`"any"` (default) or `"strict"` — whether order matters

pass_threshold

For contracts where partial success is acceptable, set a threshold between 0 and 1:

expect_tools:
  - "deploy_service"
  - "check_health"
  - "notify_slack"
pass_threshold: 0.66    # pass if at least 2 of 3 tools are called

Without pass_threshold (or with pass_threshold: 1.0), all expected tools must be called. pass_threshold is only used when expect_tools is set.

Failure classifications

When a test fails, ReplayCI assigns a classification:

Classification	Meaning
`tool_not_invoked`	Model returned text instead of calling the tool
`malformed_arguments`	Tool arguments aren't valid JSON
`schema_violation`	Arguments don't match the expected schema
`wrong_tool`	Model called a different tool than expected
`path_not_found`	An invariant path does not exist in the response
`unexpected_error`	Provider returned an error

Each failure also gets a fingerprint — a short hash that identifies the specific failure. If the same failure recurs, it produces the same fingerprint.

Pack structure

A pack is a directory containing contracts and their fixtures:

packs/my-pack/
  pack.yaml              # Pack metadata
  nr-allowlist.json      # Non-reproducible allowlist (optional)
  contracts/
    incident_response.yaml    # Contract files
    search.yaml
  golden/
    incident_response.success.json    # Success fixtures
    search.success.json
    negative/
      incident_response.not_invoked.json   # Negative fixtures

pack.yaml

pack_id: "my-pack"
name: "my-tool-calling-tests"
version: "0.1.0"
schema_version: "v0.1"

provider: "openai"
default_model: "gpt-4o-mini"

paths:
  contracts_dir: "packs/my-pack/contracts"
  golden_dir: "packs/my-pack/golden"
  negative_golden_dir: "packs/my-pack/golden/negative"

contracts:
  - incident_response.yaml
  - search.yaml

nr-allowlist.json (optional)

An allowlist for non-reproducible results. Most packs don't need this file — if it's missing, ReplayCI treats the pack as having zero NR exceptions (the default, correct state for deterministic tests).

If you do include it, use:

{
  "version": "1.0",
  "entries": []
}

When you need to allowlist a specific non-reproducible result (e.g. a model that's intentionally non-deterministic for a particular contract), add an entry:

{
  "version": "1.0",
  "entries": [
    {
      "contract_path": "contracts/creative_response.yaml",
      "reason_code": "model_nondeterminism",
      "owner": "team-ml",
      "expiry_date": "2026-06-01T00:00:00Z",
      "rationale": "Creative responses are intentionally non-deterministic"
    }
  ]
}

Every entry requires all five fields. Expired entries (past expiry_date) are automatically rejected at runtime.

Optional contract fields

Contracts support additional configuration for timeouts, retries, and rate limit handling:

timeouts:
  total_ms: 30000

retries:
  max_attempts: 2
  retry_on:
    - "429"
    - "5xx"
    - "timeout"

rate_limits:
  on_429:
    respect_retry_after: true
    max_sleep_seconds: 60

allowed_errors:
  - "rate_limit"
  - "timeout"

Field	Description
`timeouts.total_ms`	Maximum time for the provider call (default: 30000)
`retries.max_attempts`	Number of retries on transient errors
`retries.retry_on`	Error types that trigger a retry
`rate_limits.on_429`	How to handle rate limit responses
`allowed_errors`	Error types that don't count as test failures

Two-source contracts

Contracts can come from two sources:

Inferred — auto-generated by the server from captured tool calls (via the SDK or CLI observe). These contain structure-based assertions: exists, type, schema-derived bounds. Refreshed automatically as new data arrives.
Customer — manually written or edited by you. These are immutable to auto-refresh — your edits always take precedence.

When both exist for the same tool, they're merged. Customer assertions override inferred ones on conflicts; inferred assertions fill gaps.

Confidence tiers

Auto-generated contracts have a confidence tier based on sample count:

Tier	Samples	Meaning
Low	< 5	Too few samples — contract may not represent typical behavior
Medium	5–9	Growing confidence — worth reviewing
High	≥ 10	Enough evidence — safe to promote to enforced truth

Review auto-generated contracts in the Dashboard Contracts page before relying on them in CI.

Running your tests

Point ReplayCI at your pack:

# Using .replayci.yml
pack: "./packs/my-pack"

# Or via CLI flag
npx replayci --pack packs/my-pack

Run against recorded fixtures (offline, deterministic):

npx replayci --pack packs/my-pack --provider recorded

Run against a live provider:

npx replayci --pack packs/my-pack --provider openai --model gpt-4o-mini

You can also validate responses directly in your application code using the SDK. See SDK Integration.

See CLI Reference for all CLI flags.

Contract structure​

tool​

assertions​

Choosing the right operator​

golden_cases​

Golden fixture files​

Boundary fields​

request​

response​

Negative test cases​

Multi-tool contracts​

expect_tools​

pass_threshold​

Failure classifications​

Pack structure​

pack.yaml​

nr-allowlist.json (optional)​

Optional contract fields​

Two-source contracts​

Confidence tiers​

Running your tests​