Skip to main content

Observe Guide — Auto-Contract Generation

replayci observe generates contracts by watching your LLM make tool calls. Instead of writing YAML contracts by hand, you provide simple JSON files describing what to ask the model and which tools to offer — ReplayCI calls the provider, observes the response, and generates a full runnable contract pack.


Quick start

# 1. Create an observation spec
mkdir observe
cat > observe/get_weather.json << 'EOF'
{
"messages": [
{ "role": "user", "content": "What's the weather in San Francisco?" }
],
"tools": [
{
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string" }
},
"required": ["location"]
}
}
]
}
EOF

# 2. Run observe against a live provider
export REPLAYCI_PROVIDER_KEY="sk-..."
npx replayci observe --provider openai --model gpt-4o-mini

# 3. Run the generated pack
npx replayci --pack packs/observed --provider recorded

Creating observation specs

An observation spec is a JSON file with two required fields: messages and tools.

messages

The conversation to send to the model. At minimum, one user message:

{
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Deploy my-service to production." }
]
}

Each message needs role (string) and content (string).

tools

The tool definitions to make available to the model. These follow the standard function-calling schema:

{
"tools": [
{
"name": "deploy_service",
"description": "Deploy a service to the specified environment",
"parameters": {
"type": "object",
"properties": {
"service_name": { "type": "string" },
"environment": {
"type": "string",
"enum": ["staging", "production"]
}
},
"required": ["service_name", "environment"]
}
}
]
}

Each tool needs name, description, and parameters (a JSON Schema object).

Optional fields

FieldTypeDefaultDescription
tool_choice"auto", "none", "required""auto"Whether the model must call a tool
temperaturenumber0Model temperature (0 = deterministic)
max_tokensnumber1024Maximum response tokens

Full example

{
"messages": [
{ "role": "system", "content": "You are a DevOps assistant." },
{ "role": "user", "content": "Deploy my-service to production and check its health." }
],
"tools": [
{
"name": "deploy_service",
"description": "Deploy a service to the specified environment",
"parameters": {
"type": "object",
"properties": {
"service_name": { "type": "string" },
"environment": { "type": "string", "enum": ["staging", "production"] }
},
"required": ["service_name", "environment"]
}
},
{
"name": "check_health",
"description": "Check health status of a service",
"parameters": {
"type": "object",
"properties": {
"service_name": { "type": "string" }
},
"required": ["service_name"]
}
}
],
"tool_choice": "required",
"temperature": 0
}

Running observe

npx replayci observe --provider openai --model gpt-4o-mini

Required flags:

  • --provideropenai or anthropic
  • --model — the model ID (e.g. gpt-4o-mini, claude-sonnet-4-6)

Optional flags:

  • --input &lt;dir&gt; — directory with spec files (default: observe/)
  • --output &lt;dir&gt; — where to write the generated pack (default: packs/observed)
  • --timeout_ms &lt;ms&gt; — provider call timeout (default: 30000)
  • --json — force JSON output

The REPLAYCI_PROVIDER_KEY environment variable must be set with your provider API key.


How inference works

ReplayCI infers structure, not values. Generated invariants check that the right keys exist and have the right types — they never assert on specific argument values (which change between runs).

Single tool call

When the model calls one tool, ReplayCI infers:

  • The tool name matches (equals on $.tool_calls[0].name)
  • Arguments exist (exists: true on $.tool_calls[0].arguments)
  • Each argument key exists and has the correct type (from the tool schema)
  • If the schema defines enum values, a one_of check is added

Example generated contract:

tool: "get_weather"
status: observed

assertions:
output_invariants:
- path: "$.tool_calls[0].name"
equals: "get_weather"
- path: "$.tool_calls[0].arguments"
exists: true
- path: "$.tool_calls[0].arguments.location"
exists: true
- path: "$.tool_calls[0].arguments.location"
type: "string"

Multi-tool calls

When the model calls two or more tools, ReplayCI infers:

  • The contract uses expect_tools with all called tool names
  • tool_order: "any" — tools can appear in any order
  • A length_gte check on the tool_calls array
  • Per-tool name and argument invariants for each call

Example:

tool: "multi_tool_call"
status: observed

expect_tools:
- "deploy_service"
- "check_health"
tool_order: "any"

assertions:
output_invariants:
- path: "$.tool_calls"
length_gte: 2
- path: "$.tool_calls[0].name"
equals: "deploy_service"
- path: "$.tool_calls[0].arguments"
exists: true
- path: "$.tool_calls[1].name"
equals: "check_health"
- path: "$.tool_calls[1].arguments"
exists: true

Text-only response

When the model returns text instead of calling a tool, ReplayCI generates a contract that asserts the tool_calls array is empty:

tool: "get_weather"
status: observed

assertions:
output_invariants:
- path: "$.tool_calls"
length_lte: 0

This is useful for documenting cases where the model declines to use a tool.

Schema enum inference

If your tool schema defines enum values for a property, the generated contract includes a one_of check:

{
"environment": {
"type": "string",
"enum": ["staging", "production"]
}
}

Generates:

- path: "$.tool_calls[0].arguments.environment"
one_of:
- "staging"
- "production"

Schema bounds inference

If your tool schema defines numeric bounds (minimum, maximum), string length bounds (minLength, maxLength), array size bounds (minItems, maxItems), or regex patterns (pattern), the generated contract includes corresponding checks.

Numeric bounds:

{
"timeout_minutes": {
"type": "number",
"minimum": 1,
"maximum": 60
}
}

Generates:

- path: "$.timeout_minutes"
gte: 1
- path: "$.timeout_minutes"
lte: 60

String length bounds:

{
"title": {
"type": "string",
"maxLength": 120
}
}

Generates:

- path: "$.title"
length_lte: 120

Array size bounds:

{
"tags": {
"type": "array",
"items": { "type": "string" },
"minItems": 2,
"maxItems": 10
}
}

Generates:

- path: "$.tags"
length_gte: 2
- path: "$.tags"
length_lte: 10

Regex patterns:

{
"start_time": {
"type": "string",
"pattern": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}"
}
}

Generates:

- path: "$.start_time"
regex: "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}"
tip

The richer your tool schemas, the tighter your auto-generated contracts. Adding minimum, maximum, pattern, and other JSON Schema constraints to your tool definitions gives observe more to work with — no manual YAML editing needed.

What observe does NOT infer

Observe generates contracts from schema structure, not from prompt understanding. The following require manual addition:

  • Exact value checks (equals) — observe cannot determine which specific value is the "correct" answer for a given prompt
  • Substring matching (contains) — requires understanding the prompt content
  • Strict tool ordering (tool_order: strict) — observed order may vary across runs
  • Forbidden fields (exists: false) — optional field inclusion varies between runs
  • Cross-field reasoning — e.g., severity P1 because payment service + 5xx + >5% error rate

These are the checks you add when promoting an observed contract to a truth contract.


Reviewing generated contracts

After running observe, review the generated contracts in packs/observed/contracts/. Contracts are drafts — you should tighten or relax them based on your needs.

What to tighten

  • Add equals checks on argument values you know should be constant (e.g. a specific service name)
  • Add contains checks for partial string matches
  • Change tool_order from "any" to "strict" if tool call ordering matters
  • Add negative test cases for failure scenarios

What to relax

  • Remove per-key exists checks for optional arguments the model might not always include
  • Add pass_threshold for multi-tool contracts where partial success is acceptable
  • Widen type checks if the model might return different types in edge cases

Generated files

For each observation, three files are generated:

FilePurpose
contracts/&lt;name&gt;.yamlContract with inferred invariants
golden/&lt;name&gt;.success.jsonGolden fixture with boundary hashes for recorded replay
recordings/&lt;name&gt;.success.recording.jsonRaw recording of the provider call

Pack-level files:

FilePurpose
pack.yamlPack metadata listing all contracts
NeverNormalize.jsonFields excluded from normalization
nr-allowlist.jsonNon-reproducible result allowlist (empty by default)

Observed vs Truth: the two-tier model

ReplayCI uses a two-tier contract system:

Truth contracts (enforced) are your source of validation truth. They run in CI, gate merges, and represent your team's agreed-upon behavior expectations. Truth contracts live in hand-curated packs (e.g. packs/openai-v0.1).

Observed contracts (draft/evidence) are auto-generated by replayci observe. They represent what the model did, not what it should do. Observed contracts have:

  • status: observed in the YAML
  • provider_modes: ["recorded"] on golden cases — preventing accidental live enforcement
  • A header comment marking them as auto-generated

This separation ensures that auto-generated contracts never accidentally become merge-blocking gates. Observed contracts are for exploration, coverage analysis, and as starting points for truth contracts.


Promoting observed to truth

Promotion from observed to truth is a manual process:

  1. Review the generated contract — check that invariants match your intent
  2. Remove status: observed from the YAML (or change to status: truth)
  3. Remove provider_modes: ["recorded"] from golden cases if you want live enforcement
  4. Add negative test cases — observed contracts only have success cases
  5. Move the contract to your truth pack directory (e.g. packs/openai-v0.1/contracts/)
  6. Update pack.yaml in the truth pack to include the new contract
  7. Run the full pack against a live provider to verify

The key principle: no silent relaxations. If a generated invariant is looser than what you want, tighten it. If it's too strict, document why you relaxed it.


Multiple observations

Place multiple spec files in the input directory. Each .json file becomes one observation:

observe/
get_weather.json
deploy_service.json
search_docs.json
npx replayci observe --provider openai --model gpt-4o-mini

This produces contracts, fixtures, and recordings for all three observations in a single pack.

If two specs produce calls to the same tool name, the generator handles collisions automatically (e.g. get_weather.yaml, get_weather_02.yaml).


Config file support

You can set observe defaults in .replayci.yml:

observe_input: "./observe"
observe_output: "./packs/observed"

CLI flags override config values. See CLI Reference for all options.


What's Next: Test Against Another Model

Now that you have an observed contract, test it against a different model to see how it compares.

Quick test with --draft

Use --draft to run observed contracts against a live provider without promoting first:

npx replayci --pack packs/observed --provider openai --model gpt-4o-mini --draft

This bypasses the provider_modes: ["recorded"] restriction, so you can immediately see how a model performs against the observed contract. Useful for exploratory testing before committing to a truth contract.

Promote for enforcement

When you're ready to enforce the contract in CI, promote it to a truth contract:

npx replayci promote --from packs/observed --to packs/my-truth

This copies the pack and applies promotion transforms (removes status: observed, expands provider_modes). Then review the contracts and tighten the invariants to match your requirements.

See Promoting Contracts for the full workflow — from review through enforcement.


Alternative: SDK observe

Instead of CLI-based observation, you can capture tool calls passively from your application code using the SDK:

import OpenAI from "openai";
import { observe } from "@replayci/replay";

const openai = new OpenAI();
const handle = observe(openai, {
apiKey: process.env.REPLAYCI_API_KEY,
agent: "my-agent",
});

// Use your client normally — all tool calls are captured automatically
const response = await openai.chat.completions.create({ ... });

handle.restore(); // stop observing

The server auto-generates contracts from SDK captures the same way it does from CLI observations. Both approaches feed into the same Contracts page in the dashboard.

When to use which:

ApproachBest for
CLI observeExploring a new tool setup, generating initial contracts from scratch
SDK observeContinuous capture from production or staging, building confidence over time

See SDK Integration for the full SDK guide.