Observe Guide — Auto-Contract Generation

replayci observe generates contracts by watching your LLM make tool calls. Instead of writing YAML contracts by hand, you provide simple JSON files describing what to ask the model and which tools to offer — ReplayCI calls the provider, observes the response, and generates a full runnable contract pack.

Quick start

# 1. Create an observation spec
mkdir observe
cat > observe/get_weather.json << 'EOF'
{
  "messages": [
    { "role": "user", "content": "What's the weather in San Francisco?" }
  ],
  "tools": [
    {
      "name": "get_weather",
      "description": "Get current weather for a location",
      "parameters": {
        "type": "object",
        "properties": {
          "location": { "type": "string" }
        },
        "required": ["location"]
      }
    }
  ]
}
EOF

# 2. Run observe against a live provider
export REPLAYCI_PROVIDER_KEY="sk-..."
npx replayci observe --provider openai --model gpt-4o-mini

# 3. Run the generated pack
npx replayci --pack packs/observed --provider recorded

Creating observation specs

An observation spec is a JSON file with two required fields: messages and tools.

messages

The conversation to send to the model. At minimum, one user message:

{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Deploy my-service to production." }
  ]
}

Each message needs role (string) and content (string).

tools

The tool definitions to make available to the model. These follow the standard function-calling schema:

{
  "tools": [
    {
      "name": "deploy_service",
      "description": "Deploy a service to the specified environment",
      "parameters": {
        "type": "object",
        "properties": {
          "service_name": { "type": "string" },
          "environment": {
            "type": "string",
            "enum": ["staging", "production"]
          }
        },
        "required": ["service_name", "environment"]
      }
    }
  ]
}

Each tool needs name, description, and parameters (a JSON Schema object).

Optional fields

Field	Type	Default	Description
`tool_choice`	`"auto"`, `"none"`, `"required"`	`"auto"`	Whether the model must call a tool
`temperature`	number	`0`	Model temperature (0 = deterministic)
`max_tokens`	number	`1024`	Maximum response tokens

Full example

{
  "messages": [
    { "role": "system", "content": "You are a DevOps assistant." },
    { "role": "user", "content": "Deploy my-service to production and check its health." }
  ],
  "tools": [
    {
      "name": "deploy_service",
      "description": "Deploy a service to the specified environment",
      "parameters": {
        "type": "object",
        "properties": {
          "service_name": { "type": "string" },
          "environment": { "type": "string", "enum": ["staging", "production"] }
        },
        "required": ["service_name", "environment"]
      }
    },
    {
      "name": "check_health",
      "description": "Check health status of a service",
      "parameters": {
        "type": "object",
        "properties": {
          "service_name": { "type": "string" }
        },
        "required": ["service_name"]
      }
    }
  ],
  "tool_choice": "required",
  "temperature": 0
}

Running observe

npx replayci observe --provider openai --model gpt-4o-mini

Required flags:

--provider — openai or anthropic
--model — the model ID (e.g. gpt-4o-mini, claude-sonnet-4-6)

Optional flags:

--input <dir> — directory with spec files (default: observe/)
--output <dir> — where to write the generated pack (default: packs/observed)
--timeout_ms <ms> — provider call timeout (default: 30000)
--json — force JSON output

The REPLAYCI_PROVIDER_KEY environment variable must be set with your provider API key.

How inference works

ReplayCI infers structure, not values. Generated invariants check that the right keys exist and have the right types — they never assert on specific argument values (which change between runs).

Single tool call

When the model calls one tool, ReplayCI infers:

The tool name matches (equals on $.tool_calls[0].name)
Arguments exist (exists: true on $.tool_calls[0].arguments)
Each argument key exists and has the correct type (from the tool schema)
If the schema defines enum values, a one_of check is added

Example generated contract:

tool: "get_weather"
status: observed

assertions:
  output_invariants:
    - path: "$.tool_calls[0].name"
      equals: "get_weather"
    - path: "$.tool_calls[0].arguments"
      exists: true
    - path: "$.tool_calls[0].arguments.location"
      exists: true
    - path: "$.tool_calls[0].arguments.location"
      type: "string"

Multi-tool calls

When the model calls two or more tools, ReplayCI infers:

The contract uses expect_tools with all called tool names
tool_order: "any" — tools can appear in any order
A length_gte check on the tool_calls array
Per-tool name and argument invariants for each call

Example:

tool: "multi_tool_call"
status: observed

expect_tools:
  - "deploy_service"
  - "check_health"
tool_order: "any"

assertions:
  output_invariants:
    - path: "$.tool_calls"
      length_gte: 2
    - path: "$.tool_calls[0].name"
      equals: "deploy_service"
    - path: "$.tool_calls[0].arguments"
      exists: true
    - path: "$.tool_calls[1].name"
      equals: "check_health"
    - path: "$.tool_calls[1].arguments"
      exists: true

Text-only response

When the model returns text instead of calling a tool, ReplayCI generates a contract that asserts the tool_calls array is empty:

tool: "get_weather"
status: observed

assertions:
  output_invariants:
    - path: "$.tool_calls"
      length_lte: 0

This is useful for documenting cases where the model declines to use a tool.

Schema enum inference

If your tool schema defines enum values for a property, the generated contract includes a one_of check:

{
  "environment": {
    "type": "string",
    "enum": ["staging", "production"]
  }
}

Generates:

- path: "$.tool_calls[0].arguments.environment"
  one_of:
    - "staging"
    - "production"

Schema bounds inference

If your tool schema defines numeric bounds (minimum, maximum), string length bounds (minLength, maxLength), array size bounds (minItems, maxItems), or regex patterns (pattern), the generated contract includes corresponding checks.

Numeric bounds:

{
  "timeout_minutes": {
    "type": "number",
    "minimum": 1,
    "maximum": 60
  }
}

Generates:

- path: "$.timeout_minutes"
  gte: 1
- path: "$.timeout_minutes"
  lte: 60

String length bounds:

{
  "title": {
    "type": "string",
    "maxLength": 120
  }
}

Generates:

- path: "$.title"
  length_lte: 120

Array size bounds:

{
  "tags": {
    "type": "array",
    "items": { "type": "string" },
    "minItems": 2,
    "maxItems": 10
  }
}

Generates:

- path: "$.tags"
  length_gte: 2
- path: "$.tags"
  length_lte: 10

Regex patterns:

{
  "start_time": {
    "type": "string",
    "pattern": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}"
  }
}

Generates:

- path: "$.start_time"
  regex: "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}"

tip

The richer your tool schemas, the tighter your auto-generated contracts. Adding minimum, maximum, pattern, and other JSON Schema constraints to your tool definitions gives observe more to work with — no manual YAML editing needed.

What observe does NOT infer

Observe generates contracts from schema structure, not from prompt understanding. The following require manual addition:

Exact value checks (equals) — observe cannot determine which specific value is the "correct" answer for a given prompt
Substring matching (contains) — requires understanding the prompt content
Strict tool ordering (tool_order: strict) — observed order may vary across runs
Forbidden fields (exists: false) — optional field inclusion varies between runs
Cross-field reasoning — e.g., severity P1 because payment service + 5xx + >5% error rate

These are the checks you add when promoting an observed contract to a truth contract.

Reviewing generated contracts

After running observe, review the generated contracts in packs/observed/contracts/. Contracts are drafts — you should tighten or relax them based on your needs.

What to tighten

Add equals checks on argument values you know should be constant (e.g. a specific service name)
Add contains checks for partial string matches
Change tool_order from "any" to "strict" if tool call ordering matters
Add negative test cases for failure scenarios

What to relax

Remove per-key exists checks for optional arguments the model might not always include
Add pass_threshold for multi-tool contracts where partial success is acceptable
Widen type checks if the model might return different types in edge cases

Generated files

For each observation, three files are generated:

File	Purpose
`contracts/<name>.yaml`	Contract with inferred invariants
`golden/<name>.success.json`	Golden fixture with boundary hashes for recorded replay
`recordings/<name>.success.recording.json`	Raw recording of the provider call

Pack-level files:

File	Purpose
`pack.yaml`	Pack metadata listing all contracts
`NeverNormalize.json`	Fields excluded from normalization
`nr-allowlist.json`	Non-reproducible result allowlist (empty by default)

Observed vs Truth: the two-tier model

ReplayCI uses a two-tier contract system:

Truth contracts (enforced) are your source of validation truth. They run in CI, gate merges, and represent your team's agreed-upon behavior expectations. Truth contracts live in hand-curated packs (e.g. packs/openai-v0.1).

Observed contracts (draft/evidence) are auto-generated by replayci observe. They represent what the model did, not what it should do. Observed contracts have:

status: observed in the YAML
provider_modes: ["recorded"] on golden cases — preventing accidental live enforcement
A header comment marking them as auto-generated

This separation ensures that auto-generated contracts never accidentally become merge-blocking gates. Observed contracts are for exploration, coverage analysis, and as starting points for truth contracts.

Promoting observed to truth

Promotion from observed to truth is a manual process:

Review the generated contract — check that invariants match your intent
Remove status: observed from the YAML (or change to status: truth)
Remove provider_modes: ["recorded"] from golden cases if you want live enforcement
Add negative test cases — observed contracts only have success cases
Move the contract to your truth pack directory (e.g. packs/openai-v0.1/contracts/)
Update pack.yaml in the truth pack to include the new contract
Run the full pack against a live provider to verify

The key principle: no silent relaxations. If a generated invariant is looser than what you want, tighten it. If it's too strict, document why you relaxed it.

Multiple observations

Place multiple spec files in the input directory. Each .json file becomes one observation:

observe/
  get_weather.json
  deploy_service.json
  search_docs.json

npx replayci observe --provider openai --model gpt-4o-mini

This produces contracts, fixtures, and recordings for all three observations in a single pack.

If two specs produce calls to the same tool name, the generator handles collisions automatically (e.g. get_weather.yaml, get_weather_02.yaml).

Config file support

You can set observe defaults in .replayci.yml:

observe_input: "./observe"
observe_output: "./packs/observed"

CLI flags override config values. See CLI Reference for all options.

What's Next: Test Against Another Model

Now that you have an observed contract, test it against a different model to see how it compares.

Quick test with `--draft`

Use --draft to run observed contracts against a live provider without promoting first:

npx replayci --pack packs/observed --provider openai --model gpt-4o-mini --draft

This bypasses the provider_modes: ["recorded"] restriction, so you can immediately see how a model performs against the observed contract. Useful for exploratory testing before committing to a truth contract.

Promote for enforcement

When you're ready to enforce the contract in CI, promote it to a truth contract:

npx replayci promote --from packs/observed --to packs/my-truth

This copies the pack and applies promotion transforms (removes status: observed, expands provider_modes). Then review the contracts and tighten the invariants to match your requirements.

See Promoting Contracts for the full workflow — from review through enforcement.

Alternative: SDK observe

Instead of CLI-based observation, you can capture tool calls passively from your application code using the SDK:

import OpenAI from "openai";
import { observe } from "@replayci/replay";

const openai = new OpenAI();
const handle = observe(openai, {
  apiKey: process.env.REPLAYCI_API_KEY,
  agent: "my-agent",
});

// Use your client normally — all tool calls are captured automatically
const response = await openai.chat.completions.create({ ... });

handle.restore(); // stop observing

The server auto-generates contracts from SDK captures the same way it does from CLI observations. Both approaches feed into the same Contracts page in the dashboard.

When to use which:

Approach	Best for
CLI observe	Exploring a new tool setup, generating initial contracts from scratch
SDK observe	Continuous capture from production or staging, building confidence over time

See SDK Integration for the full SDK guide.

Quick start​

Creating observation specs​

messages​

tools​

Optional fields​

Full example​

Running observe​

How inference works​

Single tool call​

Multi-tool calls​

Text-only response​

Schema enum inference​

Schema bounds inference​

What observe does NOT infer​

Reviewing generated contracts​

What to tighten​

What to relax​

Generated files​

Observed vs Truth: the two-tier model​

Promoting observed to truth​

Multiple observations​

Config file support​

What's Next: Test Against Another Model​

Quick test with --draft​

Promote for enforcement​

Alternative: SDK observe​

Quick start

Creating observation specs

messages

tools

Optional fields

Full example

Running observe

How inference works

Single tool call

Multi-tool calls

Text-only response

Schema enum inference

Schema bounds inference

What observe does NOT infer

Reviewing generated contracts

What to tighten

What to relax

Generated files

Observed vs Truth: the two-tier model

Promoting observed to truth

Multiple observations

Config file support

What's Next: Test Against Another Model

Quick test with `--draft`

Promote for enforcement

Alternative: SDK observe