Skip to main content

Promoting Contracts — ReplayCI

You've observed your model's behavior and generated a contract. Now you want to enforce it — across providers, across models, in CI. This guide covers the full loop: review what observe generated, promote it to a truth contract, test it, and lock it down.


The Observe-Promote-Enforce Loop

ReplayCI contracts move through three stages:

  1. Observereplayci observe calls your LLM with real prompts and tools, captures the response, and generates a contract pack. These contracts are drafts: status: observed, replay-only.
  2. Promote — You review what the model did, decide what to enforce, and create a truth contract. Truth contracts define your quality bar — the tools, arguments, and structure your agent must produce.
  3. Enforce — Run truth contracts against any model. Gate merges in CI. Catch regressions before they reach production.

The key insight: observed contracts describe what the model did. Truth contracts describe what it should do. Promotion is where you make that editorial decision.


Step 1: Review Your Observed Pack

After running replayci observe, you have a pack directory (default: packs/observed/) with these files:

packs/observed/
pack.yaml # Pack metadata
contracts/get_weather.yaml # Contract with inferred invariants
golden/get_weather.success.json # Golden fixture with boundary hashes
recordings/get_weather.success.recording.json # Raw provider response
NeverNormalize.json # Normalization exclusions
nr-allowlist.json # Non-reproducible allowlist

Open the contract YAML. Here's what the key fields mean:

status: observed

This contract was auto-generated, not hand-written. Observed contracts:

  • Run against recorded fixtures only (safe to test without API calls)
  • Are never merge-blocking in CI
  • Serve as starting points — not final enforcement gates

provider_modes: ["recorded"]

On golden cases, this restricts the test to recorded replay. The model's original response is replayed without calling the live API. This prevents an observed contract from accidentally running against a live provider and failing because the model made different choices on a second call.

Invariants

Observe infers structure, not values. You'll see checks like:

  • exists: true — the field is present
  • type: "string" — the field has the right type
  • one_of: [...] — the value is in the schema's enum list
  • gte, lte, length_gte, length_lte — schema-derived bounds
  • regex — matches a schema-defined pattern

You will not see equals checks on argument values — those require human judgment about what the "correct" answer is.


Step 2: Create a Truth Contract

Two paths: automated with replayci promote, or manual editing.

npx replayci promote --from packs/observed --to packs/my-truth

This copies the entire observed pack and applies promotion transforms:

  • Removes status: observed from contract YAMLs
  • Expands provider_modes from ["recorded"] to ["recorded", "openai", "anthropic"]
  • Adds a promotion comment at the top of each contract

You'll see a checklist of next steps:

Promoted packs/observed -> packs/my-truth

Next steps:
[ ] Review expect_tools — add/remove required tools
[ ] Review expected_tool_calls — tighten argument invariants
[ ] Adjust pass_threshold (currently 1.0)
[ ] Test: npx replayci --pack packs/my-truth --provider openai --model gpt-4o

Then edit the contracts to match your quality bar.

Option B: Manual editing

Copy the pack and edit the contract YAML directly:

cp -r packs/observed packs/my-truth

In each contract file under packs/my-truth/contracts/:

  1. Remove status: observed (or change to status: truth)
  2. Update provider_modes on golden cases — change ["recorded"] to include your target providers:
    provider_modes:
    - recorded
    - openai
    - anthropic
  3. Tighten invariants — add the checks observe couldn't infer (see below)
  4. Adjust pass_threshold if partial success is acceptable

What to add

Observe generates structural checks. For enforcement, you typically want to add:

Tool selection requirements — which tools must be called:

expect_tools:
- "deploy_service"
- "check_health"

Argument value checks — specific values that matter:

expected_tool_calls:
- name: deploy_service
argument_invariants:
- path: $.environment
equals: "production"
- path: $.service_name
exists: true

Ordering — when tool call sequence matters:

tool_order: "strict"
tool_call_match_mode: "strict"

Threshold — when partial success is acceptable:

# 5 of 6 expected tools must match
pass_threshold: 0.85

What to remove

Not every inferred invariant is worth enforcing:

  • Remove exists: true on optional arguments the model might skip
  • Remove type checks on fields where the type varies legitimately
  • Remove length_gte: 1 on tool_calls if you're already using expect_tools

Quick test without promoting: --draft

If you want to test an observed contract against a live provider before deciding whether to promote, use --draft:

npx replayci --pack packs/observed --provider openai --model gpt-4o-mini --draft

This bypasses the provider_modes: ["recorded"] restriction without modifying the contract. Useful for quick exploratory checks — the contract stays observed, nothing is committed.


Step 3: Test Your Truth Contract

Test in stages — recorded first, then live.

Against recorded fixtures

npx replayci --pack packs/my-truth --provider recorded

This replays the original model response against your tightened contract. If it fails here, your invariants are stricter than what the model actually produced — review and adjust.

Against a known-good model

export REPLAYCI_PROVIDER_KEY="sk-..."
npx replayci --pack packs/my-truth --provider openai --model gpt-4o

This calls the live API. The model may produce different argument values or tool ordering — that's expected. Structural checks (exists, type, expect_tools) should still pass if your contract is well-calibrated.

Against your target model

npx replayci --pack packs/my-truth --provider openai --model gpt-4o-mini

If this fails, you've found a reliability gap. The failure output shows exactly which invariants broke and a fingerprint for tracking.


Step 4: Compare Models

Save run output as JSON and compare:

npx replayci --pack packs/my-truth --provider openai --model gpt-4o --json > /tmp/baseline.json
npx replayci --pack packs/my-truth --provider openai --model gpt-4o-mini --json > /tmp/candidate.json
npx replayci compare --baseline /tmp/baseline.json --candidate /tmp/candidate.json

The comparison shows per-contract pass/fail for each model and a verdict on which is more reliable for your contract.


Step 5: Enforce in CI

Once your truth contract passes against your target model, add it to your CI pipeline. See CI Integration for GitHub Actions and GitLab setup.

The short version:

# .github/workflows/replayci.yml
- run: npx replayci --pack packs/my-truth --provider recorded

Recorded-provider runs are deterministic, free, and fast — no API keys needed in CI. They replay the golden fixture and check your contract invariants.


Common Patterns

"The model must pick the right tools"

Use expect_tools with a threshold:

expect_tools:
- "analyze_logs"
- "check_metrics"
- "create_ticket"
pass_threshold: 1.0 # all three required

"Arguments must be complete"

Use expected_tool_calls with exists checks:

expected_tool_calls:
- name: analyze_logs
argument_invariants:
- path: $.service_name
exists: true
- path: $.time_range
exists: true
- path: $.severity
exists: true

"Argument values must be valid"

Combine type, one_of, and regex:

expected_tool_calls:
- name: create_ticket
argument_invariants:
- path: $.priority
one_of: ["P1", "P2", "P3", "P4"]
- path: $.title
type: "string"
length_gte: 5

"Tools must run in order"

Use strict ordering:

expect_tools:
- "validate_input"
- "process_data"
- "send_notification"
tool_order: "strict"
tool_call_match_mode: "strict"

Worked Example: Cybersecurity APT Triage

A SOC agent must triage an APT incident involving supply chain compromise, DLL sideloading, and DNS-over-HTTPS C2. The model has 20 tools available, including several "trap" tools that sound correct but analyze the wrong data source.

Before: Observed contract (auto-generated)

tool: "multi_tool_call"
status: observed

expect_tools:
- "query_siem"
- "analyze_doh_traffic"
- "scan_memory_dump"
- "check_binary_sideloading"
- "trace_supply_chain"
- "isolate_host"
tool_order: "any"

assertions:
output_invariants:
- path: "$.tool_calls"
length_gte: 6

golden_cases:
- id: apt_triage_success
input_ref: apt_triage.success.json
expect_ok: true
provider_modes:
- recorded

This contract documents what the model did, but it doesn't enforce what it should do. A weaker model might call analyze_dns_traffic instead of analyze_doh_traffic and still pass (both are in the tool list).

After: Truth contract (promoted and tightened)

# Promoted from packs/observed. Review and tighten invariants.
tool: "multi_tool_call"

expect_tools:
- "query_siem"
- "analyze_doh_traffic" # NOT analyze_dns_traffic (port 53 != HTTPS C2)
- "scan_memory_dump" # NOT scan_disk_image (fileless attack)
- "check_binary_sideloading" # NOT check_binary_signature (DLL is the threat)
- "trace_supply_chain"
- "isolate_host"
tool_order: "any"
pass_threshold: 0.85 # 5/6 required — allows one miss

expected_tool_calls:
- name: check_binary_sideloading
argument_invariants:
- path: $.suspected_dll_path
exists: true
- path: $.process_name
contains: "rundll32"
- name: scan_memory_dump
argument_invariants:
- path: $.pid
exists: true
type: "number"
- name: isolate_host
argument_invariants:
- path: $.hostname
exists: true

golden_cases:
- id: apt_triage_success
input_ref: apt_triage.success.json
expect_ok: true
provider_modes:
- recorded
- openai
- anthropic

What changed and why:

ChangeReason
Removed status: observedContract is now enforceable
Added comments on tool namesDocuments why each tool is correct (and what the trap is)
Added pass_threshold: 0.85Allows one missed tool — pragmatic for complex triage
Added expected_tool_callsChecks that critical arguments are present and valid
Added contains: "rundll32"The sideloading check must target the right binary
Expanded provider_modesEnables live testing against OpenAI and Anthropic

This truth contract will now catch models that fall for the trap tools or omit critical arguments — the exact failure modes discovered during cross-model testing.


Next steps