Promoting Contracts — ReplayCI
You've observed your model's behavior and generated a contract. Now you want to enforce it — across providers, across models, in CI. This guide covers the full loop: review what observe generated, promote it to a truth contract, test it, and lock it down.
The Observe-Promote-Enforce Loop
ReplayCI contracts move through three stages:
- Observe —
replayci observecalls your LLM with real prompts and tools, captures the response, and generates a contract pack. These contracts are drafts:status: observed, replay-only. - Promote — You review what the model did, decide what to enforce, and create a truth contract. Truth contracts define your quality bar — the tools, arguments, and structure your agent must produce.
- Enforce — Run truth contracts against any model. Gate merges in CI. Catch regressions before they reach production.
The key insight: observed contracts describe what the model did. Truth contracts describe what it should do. Promotion is where you make that editorial decision.
Step 1: Review Your Observed Pack
After running replayci observe, you have a pack directory (default: packs/observed/) with these files:
packs/observed/
pack.yaml # Pack metadata
contracts/get_weather.yaml # Contract with inferred invariants
golden/get_weather.success.json # Golden fixture with boundary hashes
recordings/get_weather.success.recording.json # Raw provider response
NeverNormalize.json # Normalization exclusions
nr-allowlist.json # Non-reproducible allowlist
Open the contract YAML. Here's what the key fields mean:
status: observed
This contract was auto-generated, not hand-written. Observed contracts:
- Run against recorded fixtures only (safe to test without API calls)
- Are never merge-blocking in CI
- Serve as starting points — not final enforcement gates
provider_modes: ["recorded"]
On golden cases, this restricts the test to recorded replay. The model's original response is replayed without calling the live API. This prevents an observed contract from accidentally running against a live provider and failing because the model made different choices on a second call.
Invariants
Observe infers structure, not values. You'll see checks like:
exists: true— the field is presenttype: "string"— the field has the right typeone_of: [...]— the value is in the schema's enum listgte,lte,length_gte,length_lte— schema-derived boundsregex— matches a schema-defined pattern
You will not see equals checks on argument values — those require human judgment about what the "correct" answer is.
Step 2: Create a Truth Contract
Two paths: automated with replayci promote, or manual editing.
Option A: replayci promote (recommended)
npx replayci promote --from packs/observed --to packs/my-truth
This copies the entire observed pack and applies promotion transforms:
- Removes
status: observedfrom contract YAMLs - Expands
provider_modesfrom["recorded"]to["recorded", "openai", "anthropic"] - Adds a promotion comment at the top of each contract
You'll see a checklist of next steps:
Promoted packs/observed -> packs/my-truth
Next steps:
[ ] Review expect_tools — add/remove required tools
[ ] Review expected_tool_calls — tighten argument invariants
[ ] Adjust pass_threshold (currently 1.0)
[ ] Test: npx replayci --pack packs/my-truth --provider openai --model gpt-4o
Then edit the contracts to match your quality bar.
Option B: Manual editing
Copy the pack and edit the contract YAML directly:
cp -r packs/observed packs/my-truth
In each contract file under packs/my-truth/contracts/:
- Remove
status: observed(or change tostatus: truth) - Update
provider_modeson golden cases — change["recorded"]to include your target providers:provider_modes:
- recorded
- openai
- anthropic - Tighten invariants — add the checks observe couldn't infer (see below)
- Adjust
pass_thresholdif partial success is acceptable
What to add
Observe generates structural checks. For enforcement, you typically want to add:
Tool selection requirements — which tools must be called:
expect_tools:
- "deploy_service"
- "check_health"
Argument value checks — specific values that matter:
expected_tool_calls:
- name: deploy_service
argument_invariants:
- path: $.environment
equals: "production"
- path: $.service_name
exists: true
Ordering — when tool call sequence matters:
tool_order: "strict"
tool_call_match_mode: "strict"
Threshold — when partial success is acceptable:
# 5 of 6 expected tools must match
pass_threshold: 0.85
What to remove
Not every inferred invariant is worth enforcing:
- Remove
exists: trueon optional arguments the model might skip - Remove
typechecks on fields where the type varies legitimately - Remove
length_gte: 1on tool_calls if you're already usingexpect_tools
Quick test without promoting: --draft
If you want to test an observed contract against a live provider before deciding whether to promote, use --draft:
npx replayci --pack packs/observed --provider openai --model gpt-4o-mini --draft
This bypasses the provider_modes: ["recorded"] restriction without modifying the contract. Useful for quick exploratory checks — the contract stays observed, nothing is committed.
Step 3: Test Your Truth Contract
Test in stages — recorded first, then live.
Against recorded fixtures
npx replayci --pack packs/my-truth --provider recorded
This replays the original model response against your tightened contract. If it fails here, your invariants are stricter than what the model actually produced — review and adjust.
Against a known-good model
export REPLAYCI_PROVIDER_KEY="sk-..."
npx replayci --pack packs/my-truth --provider openai --model gpt-4o
This calls the live API. The model may produce different argument values or tool ordering — that's expected. Structural checks (exists, type, expect_tools) should still pass if your contract is well-calibrated.
Against your target model
npx replayci --pack packs/my-truth --provider openai --model gpt-4o-mini
If this fails, you've found a reliability gap. The failure output shows exactly which invariants broke and a fingerprint for tracking.
Step 4: Compare Models
Save run output as JSON and compare:
npx replayci --pack packs/my-truth --provider openai --model gpt-4o --json > /tmp/baseline.json
npx replayci --pack packs/my-truth --provider openai --model gpt-4o-mini --json > /tmp/candidate.json
npx replayci compare --baseline /tmp/baseline.json --candidate /tmp/candidate.json
The comparison shows per-contract pass/fail for each model and a verdict on which is more reliable for your contract.
Step 5: Enforce in CI
Once your truth contract passes against your target model, add it to your CI pipeline. See CI Integration for GitHub Actions and GitLab setup.
The short version:
# .github/workflows/replayci.yml
- run: npx replayci --pack packs/my-truth --provider recorded
Recorded-provider runs are deterministic, free, and fast — no API keys needed in CI. They replay the golden fixture and check your contract invariants.
Common Patterns
"The model must pick the right tools"
Use expect_tools with a threshold:
expect_tools:
- "analyze_logs"
- "check_metrics"
- "create_ticket"
pass_threshold: 1.0 # all three required
"Arguments must be complete"
Use expected_tool_calls with exists checks:
expected_tool_calls:
- name: analyze_logs
argument_invariants:
- path: $.service_name
exists: true
- path: $.time_range
exists: true
- path: $.severity
exists: true
"Argument values must be valid"
Combine type, one_of, and regex:
expected_tool_calls:
- name: create_ticket
argument_invariants:
- path: $.priority
one_of: ["P1", "P2", "P3", "P4"]
- path: $.title
type: "string"
length_gte: 5
"Tools must run in order"
Use strict ordering:
expect_tools:
- "validate_input"
- "process_data"
- "send_notification"
tool_order: "strict"
tool_call_match_mode: "strict"
Worked Example: Cybersecurity APT Triage
A SOC agent must triage an APT incident involving supply chain compromise, DLL sideloading, and DNS-over-HTTPS C2. The model has 20 tools available, including several "trap" tools that sound correct but analyze the wrong data source.
Before: Observed contract (auto-generated)
tool: "multi_tool_call"
status: observed
expect_tools:
- "query_siem"
- "analyze_doh_traffic"
- "scan_memory_dump"
- "check_binary_sideloading"
- "trace_supply_chain"
- "isolate_host"
tool_order: "any"
assertions:
output_invariants:
- path: "$.tool_calls"
length_gte: 6
golden_cases:
- id: apt_triage_success
input_ref: apt_triage.success.json
expect_ok: true
provider_modes:
- recorded
This contract documents what the model did, but it doesn't enforce what it should do. A weaker model might call analyze_dns_traffic instead of analyze_doh_traffic and still pass (both are in the tool list).
After: Truth contract (promoted and tightened)
# Promoted from packs/observed. Review and tighten invariants.
tool: "multi_tool_call"
expect_tools:
- "query_siem"
- "analyze_doh_traffic" # NOT analyze_dns_traffic (port 53 != HTTPS C2)
- "scan_memory_dump" # NOT scan_disk_image (fileless attack)
- "check_binary_sideloading" # NOT check_binary_signature (DLL is the threat)
- "trace_supply_chain"
- "isolate_host"
tool_order: "any"
pass_threshold: 0.85 # 5/6 required — allows one miss
expected_tool_calls:
- name: check_binary_sideloading
argument_invariants:
- path: $.suspected_dll_path
exists: true
- path: $.process_name
contains: "rundll32"
- name: scan_memory_dump
argument_invariants:
- path: $.pid
exists: true
type: "number"
- name: isolate_host
argument_invariants:
- path: $.hostname
exists: true
golden_cases:
- id: apt_triage_success
input_ref: apt_triage.success.json
expect_ok: true
provider_modes:
- recorded
- openai
- anthropic
What changed and why:
| Change | Reason |
|---|---|
Removed status: observed | Contract is now enforceable |
| Added comments on tool names | Documents why each tool is correct (and what the trap is) |
Added pass_threshold: 0.85 | Allows one missed tool — pragmatic for complex triage |
Added expected_tool_calls | Checks that critical arguments are present and valid |
Added contains: "rundll32" | The sideloading check must target the right binary |
Expanded provider_modes | Enables live testing against OpenAI and Anthropic |
This truth contract will now catch models that fall for the trap tools or omit critical arguments — the exact failure modes discovered during cross-model testing.
Next steps
- Writing Tests — full contract format, all assertion operators
- CI Integration — enforce contracts in your pipeline
- CLI Reference —
promote,validate,compare, and all flags