Observe Guide — Auto-Contract Generation
replayci observe generates contracts by watching your LLM make tool calls. Instead of writing YAML contracts by hand, you provide simple JSON files describing what to ask the model and which tools to offer — ReplayCI calls the provider, observes the response, and generates a full runnable contract pack.
Quick start
# 1. Create an observation spec
mkdir observe
cat > observe/get_weather.json << 'EOF'
{
"messages": [
{ "role": "user", "content": "What's the weather in San Francisco?" }
],
"tools": [
{
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string" }
},
"required": ["location"]
}
}
]
}
EOF
# 2. Run observe against a live provider
export REPLAYCI_PROVIDER_KEY="sk-..."
npx replayci observe --provider openai --model gpt-4o-mini
# 3. Run the generated pack
npx replayci --pack packs/observed --provider recorded
Creating observation specs
An observation spec is a JSON file with two required fields: messages and tools.
messages
The conversation to send to the model. At minimum, one user message:
{
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Deploy my-service to production." }
]
}
Each message needs role (string) and content (string).
tools
The tool definitions to make available to the model. These follow the standard function-calling schema:
{
"tools": [
{
"name": "deploy_service",
"description": "Deploy a service to the specified environment",
"parameters": {
"type": "object",
"properties": {
"service_name": { "type": "string" },
"environment": {
"type": "string",
"enum": ["staging", "production"]
}
},
"required": ["service_name", "environment"]
}
}
]
}
Each tool needs name, description, and parameters (a JSON Schema object).
Optional fields
| Field | Type | Default | Description |
|---|---|---|---|
tool_choice | "auto", "none", "required" | "auto" | Whether the model must call a tool |
temperature | number | 0 | Model temperature (0 = deterministic) |
max_tokens | number | 1024 | Maximum response tokens |
Full example
{
"messages": [
{ "role": "system", "content": "You are a DevOps assistant." },
{ "role": "user", "content": "Deploy my-service to production and check its health." }
],
"tools": [
{
"name": "deploy_service",
"description": "Deploy a service to the specified environment",
"parameters": {
"type": "object",
"properties": {
"service_name": { "type": "string" },
"environment": { "type": "string", "enum": ["staging", "production"] }
},
"required": ["service_name", "environment"]
}
},
{
"name": "check_health",
"description": "Check health status of a service",
"parameters": {
"type": "object",
"properties": {
"service_name": { "type": "string" }
},
"required": ["service_name"]
}
}
],
"tool_choice": "required",
"temperature": 0
}
Running observe
npx replayci observe --provider openai --model gpt-4o-mini
Required flags:
--provider—openaioranthropic--model— the model ID (e.g.gpt-4o-mini,claude-sonnet-4-6)
Optional flags:
--input <dir>— directory with spec files (default:observe/)--output <dir>— where to write the generated pack (default:packs/observed)--timeout_ms <ms>— provider call timeout (default:30000)--json— force JSON output
The REPLAYCI_PROVIDER_KEY environment variable must be set with your provider API key.
How inference works
ReplayCI infers structure, not values. Generated invariants check that the right keys exist and have the right types — they never assert on specific argument values (which change between runs).
Single tool call
When the model calls one tool, ReplayCI infers:
- The tool name matches (
equalson$.tool_calls[0].name) - Arguments exist (
exists: trueon$.tool_calls[0].arguments) - Each argument key exists and has the correct type (from the tool schema)
- If the schema defines
enumvalues, aone_ofcheck is added
Example generated contract:
tool: "get_weather"
status: observed
assertions:
output_invariants:
- path: "$.tool_calls[0].name"
equals: "get_weather"
- path: "$.tool_calls[0].arguments"
exists: true
- path: "$.tool_calls[0].arguments.location"
exists: true
- path: "$.tool_calls[0].arguments.location"
type: "string"
Multi-tool calls
When the model calls two or more tools, ReplayCI infers:
- The contract uses
expect_toolswith all called tool names tool_order: "any"— tools can appear in any order- A
length_gtecheck on thetool_callsarray - Per-tool name and argument invariants for each call
Example:
tool: "multi_tool_call"
status: observed
expect_tools:
- "deploy_service"
- "check_health"
tool_order: "any"
assertions:
output_invariants:
- path: "$.tool_calls"
length_gte: 2
- path: "$.tool_calls[0].name"
equals: "deploy_service"
- path: "$.tool_calls[0].arguments"
exists: true
- path: "$.tool_calls[1].name"
equals: "check_health"
- path: "$.tool_calls[1].arguments"
exists: true
Text-only response
When the model returns text instead of calling a tool, ReplayCI generates a contract that asserts the tool_calls array is empty:
tool: "get_weather"
status: observed
assertions:
output_invariants:
- path: "$.tool_calls"
length_lte: 0
This is useful for documenting cases where the model declines to use a tool.
Schema enum inference
If your tool schema defines enum values for a property, the generated contract includes a one_of check:
{
"environment": {
"type": "string",
"enum": ["staging", "production"]
}
}
Generates:
- path: "$.tool_calls[0].arguments.environment"
one_of:
- "staging"
- "production"
Schema bounds inference
If your tool schema defines numeric bounds (minimum, maximum), string length bounds (minLength, maxLength), array size bounds (minItems, maxItems), or regex patterns (pattern), the generated contract includes corresponding checks.
Numeric bounds:
{
"timeout_minutes": {
"type": "number",
"minimum": 1,
"maximum": 60
}
}
Generates:
- path: "$.timeout_minutes"
gte: 1
- path: "$.timeout_minutes"
lte: 60
String length bounds:
{
"title": {
"type": "string",
"maxLength": 120
}
}
Generates:
- path: "$.title"
length_lte: 120
Array size bounds:
{
"tags": {
"type": "array",
"items": { "type": "string" },
"minItems": 2,
"maxItems": 10
}
}
Generates:
- path: "$.tags"
length_gte: 2
- path: "$.tags"
length_lte: 10
Regex patterns:
{
"start_time": {
"type": "string",
"pattern": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}"
}
}
Generates:
- path: "$.start_time"
regex: "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}"
The richer your tool schemas, the tighter your auto-generated contracts. Adding minimum, maximum, pattern, and other JSON Schema constraints to your tool definitions gives observe more to work with — no manual YAML editing needed.
What observe does NOT infer
Observe generates contracts from schema structure, not from prompt understanding. The following require manual addition:
- Exact value checks (
equals) — observe cannot determine which specific value is the "correct" answer for a given prompt - Substring matching (
contains) — requires understanding the prompt content - Strict tool ordering (
tool_order: strict) — observed order may vary across runs - Forbidden fields (
exists: false) — optional field inclusion varies between runs - Cross-field reasoning — e.g., severity P1 because payment service + 5xx + >5% error rate
These are the checks you add when promoting an observed contract to a truth contract.
Reviewing generated contracts
After running observe, review the generated contracts in packs/observed/contracts/. Contracts are drafts — you should tighten or relax them based on your needs.
What to tighten
- Add
equalschecks on argument values you know should be constant (e.g. a specific service name) - Add
containschecks for partial string matches - Change
tool_orderfrom"any"to"strict"if tool call ordering matters - Add negative test cases for failure scenarios
What to relax
- Remove per-key
existschecks for optional arguments the model might not always include - Add
pass_thresholdfor multi-tool contracts where partial success is acceptable - Widen
typechecks if the model might return different types in edge cases
Generated files
For each observation, three files are generated:
| File | Purpose |
|---|---|
contracts/<name>.yaml | Contract with inferred invariants |
golden/<name>.success.json | Golden fixture with boundary hashes for recorded replay |
recordings/<name>.success.recording.json | Raw recording of the provider call |
Pack-level files:
| File | Purpose |
|---|---|
pack.yaml | Pack metadata listing all contracts |
NeverNormalize.json | Fields excluded from normalization |
nr-allowlist.json | Non-reproducible result allowlist (empty by default) |
Observed vs Truth: the two-tier model
ReplayCI uses a two-tier contract system:
Truth contracts (enforced) are your source of validation truth. They run in CI, gate merges, and represent your team's agreed-upon behavior expectations. Truth contracts live in hand-curated packs (e.g. packs/openai-v0.1).
Observed contracts (draft/evidence) are auto-generated by replayci observe. They represent what the model did, not what it should do. Observed contracts have:
status: observedin the YAMLprovider_modes: ["recorded"]on golden cases — preventing accidental live enforcement- A header comment marking them as auto-generated
This separation ensures that auto-generated contracts never accidentally become merge-blocking gates. Observed contracts are for exploration, coverage analysis, and as starting points for truth contracts.
Promoting observed to truth
Promotion from observed to truth is a manual process:
- Review the generated contract — check that invariants match your intent
- Remove
status: observedfrom the YAML (or change tostatus: truth) - Remove
provider_modes: ["recorded"]from golden cases if you want live enforcement - Add negative test cases — observed contracts only have success cases
- Move the contract to your truth pack directory (e.g.
packs/openai-v0.1/contracts/) - Update
pack.yamlin the truth pack to include the new contract - Run the full pack against a live provider to verify
The key principle: no silent relaxations. If a generated invariant is looser than what you want, tighten it. If it's too strict, document why you relaxed it.
Multiple observations
Place multiple spec files in the input directory. Each .json file becomes one observation:
observe/
get_weather.json
deploy_service.json
search_docs.json
npx replayci observe --provider openai --model gpt-4o-mini
This produces contracts, fixtures, and recordings for all three observations in a single pack.
If two specs produce calls to the same tool name, the generator handles collisions automatically (e.g. get_weather.yaml, get_weather_02.yaml).
Config file support
You can set observe defaults in .replayci.yml:
observe_input: "./observe"
observe_output: "./packs/observed"
CLI flags override config values. See CLI Reference for all options.
What's Next: Test Against Another Model
Now that you have an observed contract, test it against a different model to see how it compares.
Quick test with --draft
Use --draft to run observed contracts against a live provider without promoting first:
npx replayci --pack packs/observed --provider openai --model gpt-4o-mini --draft
This bypasses the provider_modes: ["recorded"] restriction, so you can immediately see how a model performs against the observed contract. Useful for exploratory testing before committing to a truth contract.
Promote for enforcement
When you're ready to enforce the contract in CI, promote it to a truth contract:
npx replayci promote --from packs/observed --to packs/my-truth
This copies the pack and applies promotion transforms (removes status: observed, expands provider_modes). Then review the contracts and tighten the invariants to match your requirements.
See Promoting Contracts for the full workflow — from review through enforcement.
Alternative: SDK observe
Instead of CLI-based observation, you can capture tool calls passively from your application code using the SDK:
import OpenAI from "openai";
import { observe } from "@replayci/replay";
const openai = new OpenAI();
const handle = observe(openai, {
apiKey: process.env.REPLAYCI_API_KEY,
agent: "my-agent",
});
// Use your client normally — all tool calls are captured automatically
const response = await openai.chat.completions.create({ ... });
handle.restore(); // stop observing
The server auto-generates contracts from SDK captures the same way it does from CLI observations. Both approaches feed into the same Contracts page in the dashboard.
When to use which:
| Approach | Best for |
|---|---|
| CLI observe | Exploring a new tool setup, generating initial contracts from scratch |
| SDK observe | Continuous capture from production or staging, building confidence over time |
See SDK Integration for the full SDK guide.