Skip to main content

Advanced Contracts — ReplayCI

This guide covers advanced contract patterns beyond the basics in Writing Tests. Use these when you need per-tool argument validation, cross-provider testing, or auto-generated contracts from live observations.


Expected tool calls with argument matchers

expect_tools validates that certain tools were called, but it can't check what arguments each tool received. expected_tool_calls adds per-tool argument-level invariants.

Basic example

tool: multi_tool_call

expect_tools:
- deploy_service
- check_health

expected_tool_calls:
- name: deploy_service
argument_invariants:
- path: $.environment
exists: true
- path: $.environment
one_of: ["staging", "production", "canary"]
- path: $.replicas
type: number
gte: 1

- name: check_health
argument_invariants:
- path: $.service_name
exists: true
type: string

assertions:
output_invariants:
- path: $.tool_calls
length_gte: 2

Each entry in expected_tool_calls specifies a tool name and optional argument_invariants. The invariants use the same operators as output invariants (exists, type, equals, one_of, regex, gte, lte, length_gte, length_lte, contains).

Relative paths

Argument invariant paths are relative to the parsed arguments object used by the matcher (the runtime parses the JSON string arguments before evaluating), not the full response. Use $.field, not $.tool_calls[0].arguments.field.

# Correct — relative to the tool call's arguments
argument_invariants:
- path: $.service_name
exists: true

# Wrong — indexed paths don't work in expected_tool_calls
argument_invariants:
- path: $.tool_calls[0].arguments.service_name
exists: true

This makes expected_tool_calls order-independent by default. The matcher finds any tool call matching the name and argument invariants, regardless of position.

Nested arguments

Invariants support nested objects and arrays:

expected_tool_calls:
- name: scan_vulnerabilities
argument_invariants:
- path: $.repository
type: string
- path: $.frameworks
type: array
length_gte: 1
- path: $.frameworks[0]
type: string
one_of: ["owasp_top10", "cwe_top25", "sans_top25"]
- path: $.severity_threshold
one_of: ["critical", "high", "medium", "low"]

Match modes

tool_call_match_mode

Controls how expected_tool_calls matches actual tool calls.

ModeBehavior
"any" (default)Each expected call can match any actual call, in any order
"strict"Matched calls must appear in the order declared
expected_tool_calls:
- name: deploy_service
argument_invariants:
- path: $.environment
equals: "staging"
- name: check_health
argument_invariants:
- path: $.service_name
exists: true

tool_call_match_mode: "strict" # deploy_service must come before check_health

Consuming multiset matching

Both expect_tools and expected_tool_calls use consuming multiset matching. Each actual tool call satisfies at most one expected entry. This matters when you expect duplicate tool names:

expect_tools:
- search
- search

assertions:
output_invariants:
- path: $.tool_calls
length_gte: 2

This requires the model to call search at least twice. A single search call satisfies one entry, leaving one unmatched.


pass_threshold with expected_tool_calls

pass_threshold works with both expect_tools and expected_tool_calls. When set, the contract passes if the ratio of matched tool calls meets the threshold.

expected_tool_calls:
- name: fetch_data
argument_invariants:
- path: $.source
exists: true
- name: transform_data
argument_invariants:
- path: $.format
exists: true
- name: publish_results
argument_invariants:
- path: $.destination
exists: true

pass_threshold: 0.66 # pass if 2 of 3 expected calls match

Without pass_threshold (or set to 1.0), all expected tool calls must match.


Combining operators

A single invariant can combine multiple operators for range checks, type + constraint checks, or existence + validation:

assertions:
output_invariants:
# Range check: 1 to 10 replicas
- path: $.tool_calls[0].arguments.replicas
type: number
gte: 1
lte: 10

# Array length bounds
- path: $.tool_calls
length_gte: 1
length_lte: 5

# Existence + type + pattern
- path: $.tool_calls[0].arguments.environment
exists: true
type: string
regex: "^(staging|production|canary)$"

Cross-provider testing

The same contract works across OpenAI, Anthropic, and other providers. ReplayCI normalizes provider responses to a common format, so $.tool_calls[0].name works regardless of whether the model returns OpenAI-style function.name or Anthropic-style name.

Running the same pack against different providers

# Test against OpenAI
npx replayci --pack packs/my-pack --provider openai --model gpt-4o-mini

# Same pack, Anthropic
npx replayci --pack packs/my-pack --provider anthropic --model claude-sonnet-4-6

Shadow mode for side-by-side comparison

Run your primary provider normally, and ReplayCI replays the same requests against a shadow provider:

npx replayci --pack packs/my-pack \
--provider openai --model gpt-4o-mini \
--shadow-capture \
--shadow-provider anthropic --shadow-model claude-sonnet-4-6

See Shadow Mode for details.

Provider-specific golden cases

Some golden cases only make sense for recorded fixtures (e.g., negative cases that force specific failure modes). Use provider_modes to restrict:

golden_cases:
- id: "success_case"
input_ref: "tool_call.success.json"
expect_ok: true

- id: "tool_not_invoked"
input_ref: "negative/tool_call.not_invoked.json"
expect_ok: false
expected_error: "tool_not_invoked"
provider_modes: ["recorded"] # only runs against recorded fixtures

Auto-generating contracts with observe

Instead of writing YAML by hand, use replayci observe to generate contracts from live provider responses. See the Observe Guide for the full walkthrough.

The two-tier model

Generated contracts have status: observed — they're drafts based on structure, not enforced truth. Promote them to truth contracts after review:

  1. Observe — auto-generate contracts from live provider responses
  2. Review — inspect the generated YAML, adjust invariants
  3. Run — validate with npx replayci --pack packs/my-observed-pack
  4. Promote — remove status: observed to make them enforcement-ready

What observe infers

The inference engine analyzes each response and generates invariants based on structure, not values:

PatternWhat's inferred
Single tool callexists, type on each argument
Multiple tool callsexpected_tool_calls with relative argument invariants
Schema enumsone_of constraints from tool parameter definitions
Nested objectsRecursive exists + type down to depth 5
Arraystype: array, length_gte: 1, plus element-level checks
Text-only response (no tool calls)Warning — review needed

Example: generated multi-tool contract

Running replayci observe against a security audit prompt might produce:

# Auto-generated by replayci observe
# Status: OBSERVED (draft)
tool: multi_tool_call
status: observed
expect_tools:
- scan_vulnerabilities
- generate_compliance_report
tool_order: any
expected_tool_calls:
- name: scan_vulnerabilities
argument_invariants:
- path: $.repository
exists: true
type: string
- path: $.scan_type
one_of: ["sast", "dast", "dependency", "secret_scan", "full"]
- name: generate_compliance_report
argument_invariants:
- path: $.report_type
one_of: ["soc2_type1", "soc2_type2", "iso27001", "pci_dss"]
- path: $.output_format
one_of: ["pdf", "json", "markdown"]
assertions:
output_invariants:
- path: $.tool_calls
length_gte: 2

This captures structural expectations without hardcoding specific values. A model that calls scan_vulnerabilities with scan_type: "dast" and generate_compliance_report with report_type: "iso27001" both pass.


Contract field reference

Full list of contract fields, including advanced ones not covered in Writing Tests:

FieldTypeDescription
toolstringTool name, or "multi_tool_call" for multi-tool contracts
statusstring"observed" for auto-generated drafts (optional)
expect_toolsstring[]List of tool names that must all be called
tool_order"any" or "strict"Whether tool call order matters (default: "any")
expected_tool_callsobject[]Per-tool argument invariants (see above)
tool_call_match_mode"any" or "strict"Matching order for expected_tool_calls (default: "any")
pass_thresholdnumber0.0 to 1.0 — minimum match ratio for partial success
side_effect"read", "write", or "destructive"Declare the tool's side effect level
assertionsobjectinput_invariants and output_invariants arrays
golden_casesobject[]Test cases with fixture references
timeoutsobjecttotal_ms for provider call timeout
retriesobjectmax_attempts, retry_on for transient errors
rate_limitsobjecton_429 handling configuration
allowed_errorsobject[]Error types that don't count as failures

Dashboard contract review

The dashboard provides an interactive workspace for reviewing and editing contracts. See Dashboard Guide — Contracts for details on:

  • Coverage map — which response fields have assertions and which don't
  • Field samples — observed values and distributions for each JSON path
  • Impact preview — test contract changes against recent captures before committing
  • Smart defaults — suggested assertions from schema bounds
  • Pin/unpin — lock field choices so auto-refresh doesn't overwrite them
  • Confidence tiers — low/medium/high based on sample count

You can also validate contracts directly in your application code using the SDK. See SDK Integration.