Advanced Contracts — ReplayCI

This guide covers advanced contract patterns beyond the basics in Writing Tests. Use these when you need per-tool argument validation, cross-provider testing, or auto-generated contracts from live observations.

Expected tool calls with argument matchers

expect_tools validates that certain tools were called, but it can't check what arguments each tool received. expected_tool_calls adds per-tool argument-level invariants.

Basic example

tool: multi_tool_call

expect_tools:
  - deploy_service
  - check_health

expected_tool_calls:
  - name: deploy_service
    argument_invariants:
      - path: $.environment
        exists: true
      - path: $.environment
        one_of: ["staging", "production", "canary"]
      - path: $.replicas
        type: number
        gte: 1

  - name: check_health
    argument_invariants:
      - path: $.service_name
        exists: true
        type: string

assertions:
  output_invariants:
    - path: $.tool_calls
      length_gte: 2

Each entry in expected_tool_calls specifies a tool name and optional argument_invariants. The invariants use the same operators as output invariants (exists, type, equals, one_of, regex, gte, lte, length_gte, length_lte, contains).

Relative paths

Argument invariant paths are relative to the parsed arguments object used by the matcher (the runtime parses the JSON string arguments before evaluating), not the full response. Use $.field, not $.tool_calls[0].arguments.field.

# Correct — relative to the tool call's arguments
argument_invariants:
  - path: $.service_name
    exists: true

# Wrong — indexed paths don't work in expected_tool_calls
argument_invariants:
  - path: $.tool_calls[0].arguments.service_name
    exists: true

This makes expected_tool_calls order-independent by default. The matcher finds any tool call matching the name and argument invariants, regardless of position.

Nested arguments

Invariants support nested objects and arrays:

expected_tool_calls:
  - name: scan_vulnerabilities
    argument_invariants:
      - path: $.repository
        type: string
      - path: $.frameworks
        type: array
        length_gte: 1
      - path: $.frameworks[0]
        type: string
        one_of: ["owasp_top10", "cwe_top25", "sans_top25"]
      - path: $.severity_threshold
        one_of: ["critical", "high", "medium", "low"]

Match modes

tool_call_match_mode

Controls how expected_tool_calls matches actual tool calls.

Mode	Behavior
`"any"` (default)	Each expected call can match any actual call, in any order
`"strict"`	Matched calls must appear in the order declared

expected_tool_calls:
  - name: deploy_service
    argument_invariants:
      - path: $.environment
        equals: "staging"
  - name: check_health
    argument_invariants:
      - path: $.service_name
        exists: true

tool_call_match_mode: "strict"  # deploy_service must come before check_health

Consuming multiset matching

Both expect_tools and expected_tool_calls use consuming multiset matching. Each actual tool call satisfies at most one expected entry. This matters when you expect duplicate tool names:

expect_tools:
  - search
  - search

assertions:
  output_invariants:
    - path: $.tool_calls
      length_gte: 2

This requires the model to call search at least twice. A single search call satisfies one entry, leaving one unmatched.

pass_threshold with expected_tool_calls

pass_threshold works with both expect_tools and expected_tool_calls. When set, the contract passes if the ratio of matched tool calls meets the threshold.

expected_tool_calls:
  - name: fetch_data
    argument_invariants:
      - path: $.source
        exists: true
  - name: transform_data
    argument_invariants:
      - path: $.format
        exists: true
  - name: publish_results
    argument_invariants:
      - path: $.destination
        exists: true

pass_threshold: 0.66  # pass if 2 of 3 expected calls match

Without pass_threshold (or set to 1.0), all expected tool calls must match.

Combining operators

A single invariant can combine multiple operators for range checks, type + constraint checks, or existence + validation:

assertions:
  output_invariants:
    # Range check: 1 to 10 replicas
    - path: $.tool_calls[0].arguments.replicas
      type: number
      gte: 1
      lte: 10

    # Array length bounds
    - path: $.tool_calls
      length_gte: 1
      length_lte: 5

    # Existence + type + pattern
    - path: $.tool_calls[0].arguments.environment
      exists: true
      type: string
      regex: "^(staging|production|canary)$"

Cross-provider testing

The same contract works across OpenAI, Anthropic, and other providers. ReplayCI normalizes provider responses to a common format, so $.tool_calls[0].name works regardless of whether the model returns OpenAI-style function.name or Anthropic-style name.

Running the same pack against different providers

# Test against OpenAI
npx replayci --pack packs/my-pack --provider openai --model gpt-4o-mini

# Same pack, Anthropic
npx replayci --pack packs/my-pack --provider anthropic --model claude-sonnet-4-6

Shadow mode for side-by-side comparison

Run your primary provider normally, and ReplayCI replays the same requests against a shadow provider:

npx replayci --pack packs/my-pack \
  --provider openai --model gpt-4o-mini \
  --shadow-capture \
  --shadow-provider anthropic --shadow-model claude-sonnet-4-6

See Shadow Mode for details.

Provider-specific golden cases

Some golden cases only make sense for recorded fixtures (e.g., negative cases that force specific failure modes). Use provider_modes to restrict:

golden_cases:
  - id: "success_case"
    input_ref: "tool_call.success.json"
    expect_ok: true

  - id: "tool_not_invoked"
    input_ref: "negative/tool_call.not_invoked.json"
    expect_ok: false
    expected_error: "tool_not_invoked"
    provider_modes: ["recorded"]  # only runs against recorded fixtures

Auto-generating contracts with observe

Instead of writing YAML by hand, use replayci observe to generate contracts from live provider responses. See the Observe Guide for the full walkthrough.

The two-tier model

Generated contracts have status: observed — they're drafts based on structure, not enforced truth. Promote them to truth contracts after review:

Observe — auto-generate contracts from live provider responses
Review — inspect the generated YAML, adjust invariants
Run — validate with npx replayci --pack packs/my-observed-pack
Promote — remove status: observed to make them enforcement-ready

What observe infers

The inference engine analyzes each response and generates invariants based on structure, not values:

Pattern	What's inferred
Single tool call	`exists`, `type` on each argument
Multiple tool calls	`expected_tool_calls` with relative argument invariants
Schema enums	`one_of` constraints from tool parameter definitions
Nested objects	Recursive `exists` + `type` down to depth 5
Arrays	`type: array`, `length_gte: 1`, plus element-level checks
Text-only response (no tool calls)	Warning — review needed

Example: generated multi-tool contract

Running replayci observe against a security audit prompt might produce:

# Auto-generated by replayci observe
# Status: OBSERVED (draft)
tool: multi_tool_call
status: observed
expect_tools:
  - scan_vulnerabilities
  - generate_compliance_report
tool_order: any
expected_tool_calls:
  - name: scan_vulnerabilities
    argument_invariants:
      - path: $.repository
        exists: true
        type: string
      - path: $.scan_type
        one_of: ["sast", "dast", "dependency", "secret_scan", "full"]
  - name: generate_compliance_report
    argument_invariants:
      - path: $.report_type
        one_of: ["soc2_type1", "soc2_type2", "iso27001", "pci_dss"]
      - path: $.output_format
        one_of: ["pdf", "json", "markdown"]
assertions:
  output_invariants:
    - path: $.tool_calls
      length_gte: 2

This captures structural expectations without hardcoding specific values. A model that calls scan_vulnerabilities with scan_type: "dast" and generate_compliance_report with report_type: "iso27001" both pass.

Contract field reference

Full list of contract fields, including advanced ones not covered in Writing Tests:

Field	Type	Description
`tool`	string	Tool name, or `"multi_tool_call"` for multi-tool contracts
`status`	string	`"observed"` for auto-generated drafts (optional)
`expect_tools`	string[]	List of tool names that must all be called
`tool_order`	`"any"` or `"strict"`	Whether tool call order matters (default: `"any"`)
`expected_tool_calls`	object[]	Per-tool argument invariants (see above)
`tool_call_match_mode`	`"any"` or `"strict"`	Matching order for expected_tool_calls (default: `"any"`)
`pass_threshold`	number	0.0 to 1.0 — minimum match ratio for partial success
`side_effect`	`"read"`, `"write"`, or `"destructive"`	Declare the tool's side effect level
`assertions`	object	`input_invariants` and `output_invariants` arrays
`golden_cases`	object[]	Test cases with fixture references
`timeouts`	object	`total_ms` for provider call timeout
`retries`	object	`max_attempts`, `retry_on` for transient errors
`rate_limits`	object	`on_429` handling configuration
`allowed_errors`	object[]	Error types that don't count as failures

Dashboard contract review

The dashboard provides an interactive workspace for reviewing and editing contracts. See Dashboard Guide — Contracts for details on:

Coverage map — which response fields have assertions and which don't
Field samples — observed values and distributions for each JSON path
Impact preview — test contract changes against recent captures before committing
Smart defaults — suggested assertions from schema bounds
Pin/unpin — lock field choices so auto-refresh doesn't overwrite them
Confidence tiers — low/medium/high based on sample count

You can also validate contracts directly in your application code using the SDK. See SDK Integration.

Expected tool calls with argument matchers​

Basic example​

Relative paths​

Nested arguments​

Match modes​

tool_call_match_mode​

Consuming multiset matching​

pass_threshold with expected_tool_calls​

Combining operators​

Cross-provider testing​

Running the same pack against different providers​

Shadow mode for side-by-side comparison​

Provider-specific golden cases​

Auto-generating contracts with observe​

The two-tier model​

What observe infers​

Example: generated multi-tool contract​

Contract field reference​

Dashboard contract review​