Advanced Contracts — ReplayCI
This guide covers advanced contract patterns beyond the basics in Writing Tests. Use these when you need per-tool argument validation, cross-provider testing, or auto-generated contracts from live observations.
Expected tool calls with argument matchers
expect_tools validates that certain tools were called, but it can't check what arguments each tool received. expected_tool_calls adds per-tool argument-level invariants.
Basic example
tool: multi_tool_call
expect_tools:
- deploy_service
- check_health
expected_tool_calls:
- name: deploy_service
argument_invariants:
- path: $.environment
exists: true
- path: $.environment
one_of: ["staging", "production", "canary"]
- path: $.replicas
type: number
gte: 1
- name: check_health
argument_invariants:
- path: $.service_name
exists: true
type: string
assertions:
output_invariants:
- path: $.tool_calls
length_gte: 2
Each entry in expected_tool_calls specifies a tool name and optional argument_invariants. The invariants use the same operators as output invariants (exists, type, equals, one_of, regex, gte, lte, length_gte, length_lte, contains).
Relative paths
Argument invariant paths are relative to the parsed arguments object used by the matcher (the runtime parses the JSON string arguments before evaluating), not the full response. Use $.field, not $.tool_calls[0].arguments.field.
# Correct — relative to the tool call's arguments
argument_invariants:
- path: $.service_name
exists: true
# Wrong — indexed paths don't work in expected_tool_calls
argument_invariants:
- path: $.tool_calls[0].arguments.service_name
exists: true
This makes expected_tool_calls order-independent by default. The matcher finds any tool call matching the name and argument invariants, regardless of position.
Nested arguments
Invariants support nested objects and arrays:
expected_tool_calls:
- name: scan_vulnerabilities
argument_invariants:
- path: $.repository
type: string
- path: $.frameworks
type: array
length_gte: 1
- path: $.frameworks[0]
type: string
one_of: ["owasp_top10", "cwe_top25", "sans_top25"]
- path: $.severity_threshold
one_of: ["critical", "high", "medium", "low"]
Match modes
tool_call_match_mode
Controls how expected_tool_calls matches actual tool calls.
| Mode | Behavior |
|---|---|
"any" (default) | Each expected call can match any actual call, in any order |
"strict" | Matched calls must appear in the order declared |
expected_tool_calls:
- name: deploy_service
argument_invariants:
- path: $.environment
equals: "staging"
- name: check_health
argument_invariants:
- path: $.service_name
exists: true
tool_call_match_mode: "strict" # deploy_service must come before check_health
Consuming multiset matching
Both expect_tools and expected_tool_calls use consuming multiset matching. Each actual tool call satisfies at most one expected entry. This matters when you expect duplicate tool names:
expect_tools:
- search
- search
assertions:
output_invariants:
- path: $.tool_calls
length_gte: 2
This requires the model to call search at least twice. A single search call satisfies one entry, leaving one unmatched.
pass_threshold with expected_tool_calls
pass_threshold works with both expect_tools and expected_tool_calls. When set, the contract passes if the ratio of matched tool calls meets the threshold.
expected_tool_calls:
- name: fetch_data
argument_invariants:
- path: $.source
exists: true
- name: transform_data
argument_invariants:
- path: $.format
exists: true
- name: publish_results
argument_invariants:
- path: $.destination
exists: true
pass_threshold: 0.66 # pass if 2 of 3 expected calls match
Without pass_threshold (or set to 1.0), all expected tool calls must match.
Combining operators
A single invariant can combine multiple operators for range checks, type + constraint checks, or existence + validation:
assertions:
output_invariants:
# Range check: 1 to 10 replicas
- path: $.tool_calls[0].arguments.replicas
type: number
gte: 1
lte: 10
# Array length bounds
- path: $.tool_calls
length_gte: 1
length_lte: 5
# Existence + type + pattern
- path: $.tool_calls[0].arguments.environment
exists: true
type: string
regex: "^(staging|production|canary)$"
Cross-provider testing
The same contract works across OpenAI, Anthropic, and other providers. ReplayCI normalizes provider responses to a common format, so $.tool_calls[0].name works regardless of whether the model returns OpenAI-style function.name or Anthropic-style name.
Running the same pack against different providers
# Test against OpenAI
npx replayci --pack packs/my-pack --provider openai --model gpt-4o-mini
# Same pack, Anthropic
npx replayci --pack packs/my-pack --provider anthropic --model claude-sonnet-4-6
Shadow mode for side-by-side comparison
Run your primary provider normally, and ReplayCI replays the same requests against a shadow provider:
npx replayci --pack packs/my-pack \
--provider openai --model gpt-4o-mini \
--shadow-capture \
--shadow-provider anthropic --shadow-model claude-sonnet-4-6
See Shadow Mode for details.
Provider-specific golden cases
Some golden cases only make sense for recorded fixtures (e.g., negative cases that force specific failure modes). Use provider_modes to restrict:
golden_cases:
- id: "success_case"
input_ref: "tool_call.success.json"
expect_ok: true
- id: "tool_not_invoked"
input_ref: "negative/tool_call.not_invoked.json"
expect_ok: false
expected_error: "tool_not_invoked"
provider_modes: ["recorded"] # only runs against recorded fixtures
Auto-generating contracts with observe
Instead of writing YAML by hand, use replayci observe to generate contracts from live provider responses. See the Observe Guide for the full walkthrough.
The two-tier model
Generated contracts have status: observed — they're drafts based on structure, not enforced truth. Promote them to truth contracts after review:
- Observe — auto-generate contracts from live provider responses
- Review — inspect the generated YAML, adjust invariants
- Run — validate with
npx replayci --pack packs/my-observed-pack - Promote — remove
status: observedto make them enforcement-ready
What observe infers
The inference engine analyzes each response and generates invariants based on structure, not values:
| Pattern | What's inferred |
|---|---|
| Single tool call | exists, type on each argument |
| Multiple tool calls | expected_tool_calls with relative argument invariants |
| Schema enums | one_of constraints from tool parameter definitions |
| Nested objects | Recursive exists + type down to depth 5 |
| Arrays | type: array, length_gte: 1, plus element-level checks |
| Text-only response (no tool calls) | Warning — review needed |
Example: generated multi-tool contract
Running replayci observe against a security audit prompt might produce:
# Auto-generated by replayci observe
# Status: OBSERVED (draft)
tool: multi_tool_call
status: observed
expect_tools:
- scan_vulnerabilities
- generate_compliance_report
tool_order: any
expected_tool_calls:
- name: scan_vulnerabilities
argument_invariants:
- path: $.repository
exists: true
type: string
- path: $.scan_type
one_of: ["sast", "dast", "dependency", "secret_scan", "full"]
- name: generate_compliance_report
argument_invariants:
- path: $.report_type
one_of: ["soc2_type1", "soc2_type2", "iso27001", "pci_dss"]
- path: $.output_format
one_of: ["pdf", "json", "markdown"]
assertions:
output_invariants:
- path: $.tool_calls
length_gte: 2
This captures structural expectations without hardcoding specific values. A model that calls scan_vulnerabilities with scan_type: "dast" and generate_compliance_report with report_type: "iso27001" both pass.
Contract field reference
Full list of contract fields, including advanced ones not covered in Writing Tests:
| Field | Type | Description |
|---|---|---|
tool | string | Tool name, or "multi_tool_call" for multi-tool contracts |
status | string | "observed" for auto-generated drafts (optional) |
expect_tools | string[] | List of tool names that must all be called |
tool_order | "any" or "strict" | Whether tool call order matters (default: "any") |
expected_tool_calls | object[] | Per-tool argument invariants (see above) |
tool_call_match_mode | "any" or "strict" | Matching order for expected_tool_calls (default: "any") |
pass_threshold | number | 0.0 to 1.0 — minimum match ratio for partial success |
side_effect | "read", "write", or "destructive" | Declare the tool's side effect level |
assertions | object | input_invariants and output_invariants arrays |
golden_cases | object[] | Test cases with fixture references |
timeouts | object | total_ms for provider call timeout |
retries | object | max_attempts, retry_on for transient errors |
rate_limits | object | on_429 handling configuration |
allowed_errors | object[] | Error types that don't count as failures |
Dashboard contract review
The dashboard provides an interactive workspace for reviewing and editing contracts. See Dashboard Guide — Contracts for details on:
- Coverage map — which response fields have assertions and which don't
- Field samples — observed values and distributions for each JSON path
- Impact preview — test contract changes against recent captures before committing
- Smart defaults — suggested assertions from schema bounds
- Pin/unpin — lock field choices so auto-refresh doesn't overwrite them
- Confidence tiers — low/medium/high based on sample count
You can also validate contracts directly in your application code using the SDK. See SDK Integration.