Promoting Contracts — ReplayCI

You've observed your model's behavior and generated a contract. Now you want to enforce it — across providers, across models, in CI. This guide covers the full loop: review what observe generated, promote it to a truth contract, test it, and lock it down.

The Observe-Promote-Enforce Loop

ReplayCI contracts move through three stages:

Observe — replayci observe calls your LLM with real prompts and tools, captures the response, and generates a contract pack. These contracts are drafts: status: observed, replay-only.
Promote — You review what the model did, decide what to enforce, and create a truth contract. Truth contracts define your quality bar — the tools, arguments, and structure your agent must produce.
Enforce — Run truth contracts against any model. Gate merges in CI. Catch regressions before they reach production.

The key insight: observed contracts describe what the model did. Truth contracts describe what it should do. Promotion is where you make that editorial decision.

Step 1: Review Your Observed Pack

After running replayci observe, you have a pack directory (default: packs/observed/) with these files:

packs/observed/
  pack.yaml                              # Pack metadata
  contracts/get_weather.yaml             # Contract with inferred invariants
  golden/get_weather.success.json        # Golden fixture with boundary hashes
  recordings/get_weather.success.recording.json  # Raw provider response
  NeverNormalize.json                    # Normalization exclusions
  nr-allowlist.json                      # Non-reproducible allowlist

Open the contract YAML. Here's what the key fields mean:

`status: observed`

This contract was auto-generated, not hand-written. Observed contracts:

Run against recorded fixtures only (safe to test without API calls)
Are never merge-blocking in CI
Serve as starting points — not final enforcement gates

`provider_modes: ["recorded"]`

On golden cases, this restricts the test to recorded replay. The model's original response is replayed without calling the live API. This prevents an observed contract from accidentally running against a live provider and failing because the model made different choices on a second call.

Invariants

Observe infers structure, not values. You'll see checks like:

exists: true — the field is present
type: "string" — the field has the right type
one_of: [...] — the value is in the schema's enum list
gte, lte, length_gte, length_lte — schema-derived bounds
regex — matches a schema-defined pattern

You will not see equals checks on argument values — those require human judgment about what the "correct" answer is.

Step 2: Create a Truth Contract

Two paths: automated with replayci promote, or manual editing.

Option A: `replayci promote` (recommended)

npx replayci promote --from packs/observed --to packs/my-truth

This copies the entire observed pack and applies promotion transforms:

Removes status: observed from contract YAMLs
Expands provider_modes from ["recorded"] to ["recorded", "openai", "anthropic"]
Adds a promotion comment at the top of each contract

You'll see a checklist of next steps:

Promoted packs/observed -> packs/my-truth

Next steps:
  [ ] Review expect_tools — add/remove required tools
  [ ] Review expected_tool_calls — tighten argument invariants
  [ ] Adjust pass_threshold (currently 1.0)
  [ ] Test: npx replayci --pack packs/my-truth --provider openai --model gpt-4o

Then edit the contracts to match your quality bar.

Option B: Manual editing

Copy the pack and edit the contract YAML directly:

cp -r packs/observed packs/my-truth

In each contract file under packs/my-truth/contracts/:

Remove status: observed (or change to status: truth)
Update provider_modes on golden cases — change ["recorded"] to include your target providers:
```
provider_modes:
  - recorded
  - openai
  - anthropic
```
Tighten invariants — add the checks observe couldn't infer (see below)
Adjust pass_threshold if partial success is acceptable

What to add

Observe generates structural checks. For enforcement, you typically want to add:

Tool selection requirements — which tools must be called:

expect_tools:
  - "deploy_service"
  - "check_health"

Argument value checks — specific values that matter:

expected_tool_calls:
  - name: deploy_service
    argument_invariants:
      - path: $.environment
        equals: "production"
      - path: $.service_name
        exists: true

Ordering — when tool call sequence matters:

tool_order: "strict"
tool_call_match_mode: "strict"

Threshold — when partial success is acceptable:

# 5 of 6 expected tools must match
pass_threshold: 0.85

What to remove

Not every inferred invariant is worth enforcing:

Remove exists: true on optional arguments the model might skip
Remove type checks on fields where the type varies legitimately
Remove length_gte: 1 on tool_calls if you're already using expect_tools

Quick test without promoting: `--draft`

If you want to test an observed contract against a live provider before deciding whether to promote, use --draft:

npx replayci --pack packs/observed --provider openai --model gpt-4o-mini --draft

This bypasses the provider_modes: ["recorded"] restriction without modifying the contract. Useful for quick exploratory checks — the contract stays observed, nothing is committed.

Step 3: Test Your Truth Contract

Test in stages — recorded first, then live.

Against recorded fixtures

npx replayci --pack packs/my-truth --provider recorded

This replays the original model response against your tightened contract. If it fails here, your invariants are stricter than what the model actually produced — review and adjust.

Against a known-good model

export REPLAYCI_PROVIDER_KEY="sk-..."
npx replayci --pack packs/my-truth --provider openai --model gpt-4o

This calls the live API. The model may produce different argument values or tool ordering — that's expected. Structural checks (exists, type, expect_tools) should still pass if your contract is well-calibrated.

Against your target model

npx replayci --pack packs/my-truth --provider openai --model gpt-4o-mini

If this fails, you've found a reliability gap. The failure output shows exactly which invariants broke and a fingerprint for tracking.

Step 4: Compare Models

Save run output as JSON and compare:

npx replayci --pack packs/my-truth --provider openai --model gpt-4o --json > /tmp/baseline.json
npx replayci --pack packs/my-truth --provider openai --model gpt-4o-mini --json > /tmp/candidate.json
npx replayci compare --baseline /tmp/baseline.json --candidate /tmp/candidate.json

The comparison shows per-contract pass/fail for each model and a verdict on which is more reliable for your contract.

Step 5: Enforce in CI

Once your truth contract passes against your target model, add it to your CI pipeline. See CI Integration for GitHub Actions and GitLab setup.

The short version:

# .github/workflows/replayci.yml
- run: npx replayci --pack packs/my-truth --provider recorded

Recorded-provider runs are deterministic, free, and fast — no API keys needed in CI. They replay the golden fixture and check your contract invariants.

Common Patterns

"The model must pick the right tools"

Use expect_tools with a threshold:

expect_tools:
  - "analyze_logs"
  - "check_metrics"
  - "create_ticket"
pass_threshold: 1.0   # all three required

"Arguments must be complete"

Use expected_tool_calls with exists checks:

expected_tool_calls:
  - name: analyze_logs
    argument_invariants:
      - path: $.service_name
        exists: true
      - path: $.time_range
        exists: true
      - path: $.severity
        exists: true

"Argument values must be valid"

Combine type, one_of, and regex:

expected_tool_calls:
  - name: create_ticket
    argument_invariants:
      - path: $.priority
        one_of: ["P1", "P2", "P3", "P4"]
      - path: $.title
        type: "string"
        length_gte: 5

"Tools must run in order"

Use strict ordering:

expect_tools:
  - "validate_input"
  - "process_data"
  - "send_notification"
tool_order: "strict"
tool_call_match_mode: "strict"

Worked Example: Cybersecurity APT Triage

A SOC agent must triage an APT incident involving supply chain compromise, DLL sideloading, and DNS-over-HTTPS C2. The model has 20 tools available, including several "trap" tools that sound correct but analyze the wrong data source.

Before: Observed contract (auto-generated)

tool: "multi_tool_call"
status: observed

expect_tools:
  - "query_siem"
  - "analyze_doh_traffic"
  - "scan_memory_dump"
  - "check_binary_sideloading"
  - "trace_supply_chain"
  - "isolate_host"
tool_order: "any"

assertions:
  output_invariants:
    - path: "$.tool_calls"
      length_gte: 6

golden_cases:
  - id: apt_triage_success
    input_ref: apt_triage.success.json
    expect_ok: true
    provider_modes:
      - recorded

This contract documents what the model did, but it doesn't enforce what it should do. A weaker model might call analyze_dns_traffic instead of analyze_doh_traffic and still pass (both are in the tool list).

After: Truth contract (promoted and tightened)

# Promoted from packs/observed. Review and tighten invariants.
tool: "multi_tool_call"

expect_tools:
  - "query_siem"
  - "analyze_doh_traffic"        # NOT analyze_dns_traffic (port 53 != HTTPS C2)
  - "scan_memory_dump"           # NOT scan_disk_image (fileless attack)
  - "check_binary_sideloading"   # NOT check_binary_signature (DLL is the threat)
  - "trace_supply_chain"
  - "isolate_host"
tool_order: "any"
pass_threshold: 0.85             # 5/6 required — allows one miss

expected_tool_calls:
  - name: check_binary_sideloading
    argument_invariants:
      - path: $.suspected_dll_path
        exists: true
      - path: $.process_name
        contains: "rundll32"
  - name: scan_memory_dump
    argument_invariants:
      - path: $.pid
        exists: true
        type: "number"
  - name: isolate_host
    argument_invariants:
      - path: $.hostname
        exists: true

golden_cases:
  - id: apt_triage_success
    input_ref: apt_triage.success.json
    expect_ok: true
    provider_modes:
      - recorded
      - openai
      - anthropic

What changed and why:

Change	Reason
Removed `status: observed`	Contract is now enforceable
Added comments on tool names	Documents why each tool is correct (and what the trap is)
Added `pass_threshold: 0.85`	Allows one missed tool — pragmatic for complex triage
Added `expected_tool_calls`	Checks that critical arguments are present and valid
Added `contains: "rundll32"`	The sideloading check must target the right binary
Expanded `provider_modes`	Enables live testing against OpenAI and Anthropic

This truth contract will now catch models that fall for the trap tools or omit critical arguments — the exact failure modes discovered during cross-model testing.

Next steps

Writing Tests — full contract format, all assertion operators
CI Integration — enforce contracts in your pipeline
CLI Reference — promote, validate, compare, and all flags

The Observe-Promote-Enforce Loop​

Step 1: Review Your Observed Pack​

status: observed​

provider_modes: ["recorded"]​

Invariants​

Step 2: Create a Truth Contract​

Option A: replayci promote (recommended)​

Option B: Manual editing​

What to add​

What to remove​

Quick test without promoting: --draft​

Step 3: Test Your Truth Contract​

Against recorded fixtures​

Against a known-good model​

Against your target model​

Step 4: Compare Models​

Step 5: Enforce in CI​

Common Patterns​

"The model must pick the right tools"​

"Arguments must be complete"​

"Argument values must be valid"​

"Tools must run in order"​

Worked Example: Cybersecurity APT Triage​

Before: Observed contract (auto-generated)​

After: Truth contract (promoted and tightened)​

Next steps​