Dashboard Guide — ReplayCI
The ReplayCI dashboard at app.replayci.com gives you a control center for your LLM tool-call reliability. It shows health metrics, auto-generated contracts, validation coverage, model comparisons, and change history.
Today
The landing page shows a snapshot of your current reliability state.
What you'll see:
- Health metrics — determinism score, shadow readiness, contract confidence across your tools
- Run activity — recent pass/fail/incomparable counts
- Attention items — things that need action (failing contracts, low-confidence tools, pending reviews)
- Activity feed — recent events like baseline promotions and change dismissals
- Reference state — current baseline status across your tool contracts
The Today page surfaces what matters so you don't have to dig through individual runs.
Contracts
The Contracts page lists every tool the system has observed, with auto-generated contract drafts.
How contracts get created
When you use the SDK observe() or push runs via the CLI, the server automatically infers contracts from captured tool calls. No manual YAML writing required to get started.
Contract list
Each tool shows:
| Field | Description |
|---|---|
| Tool name | The tool function name observed |
| Agent | Which agent made the calls |
| Confidence | Low (< 5 samples), Medium (5–9), High (≥ 10) |
| Sample count | How many captures contributed to the contract |
| Status | Generated (contract exists) or Pending (waiting for more data) |
| Last captured | When the most recent call was observed |
Two-source contracts
Contracts are either inferred (auto-generated, refreshable) or customer (your edits, immutable to auto-refresh). When both exist, customer assertions take precedence. See Writing Tests — Two-source contracts for details.
Contract review workspace
Click a tool to open the review workspace:
- YAML editor — view and edit the contract with syntax highlighting
- Field samples — see observed values and their distribution for each JSON path
- Coverage map — which fields have assertions and which don't
- Smart defaults — suggested assertions based on schema bounds (minimum, maximum, pattern, etc.)
- Impact preview — before committing changes, see how the new contract would perform against recent captures
- Pin/unpin — lock specific field choices so they aren't overwritten by auto-refresh
Confidence tiers
Contracts are rated Low (< 5 samples), Medium (5–9), or High (≥ 10). High-confidence contracts are candidates for CI promotion. See Writing Tests — Confidence tiers for details.
Guard
The Guard page shows validation coverage across all captured tool calls.
Overview metrics
- Total observed calls — how many tool calls were captured in the time window
- Validated — how many were checked against contracts (pass + fail)
- Pass rate — percentage of validated calls that passed
- Pending — calls with no validation data yet
Time window
Select 24 hours, 7 days, or 30 days. The page recalculates all metrics for the selected window.
Per-tool breakdown
Each tool shows:
- Observed / pass / fail / pending counts
- Dominant validation source (SDK validation, server re-evaluation, or pending)
- Pass rate for validated calls
- Link to drill into failure details
Failure patterns
Common failures are ranked by frequency:
- Path — which JSON path failed (e.g.,
$.tool_calls[0].arguments.location) - Operator — which check failed (e.g.,
type,exists,length_gte) - Detail — what was expected vs. what was found
- Count — how many times this pattern occurred
Three-tier validation
Guard evaluates calls through three sources:
- Capture validation — the SDK
validate()result embedded in the capture at call time - Server re-evaluation — the server re-runs evaluation with the latest contract (catches drift since capture)
- Pending — no validation data available
The dominant source tells you whether most of your coverage comes from client-side or server-side validation.
Shadow
The Shadow page shows side-by-side comparisons between your primary and shadow providers.
Verdict system
Each primary–shadow pair gets a verdict:
| Verdict | Meaning | Action |
|---|---|---|
needs_review | Shadow failed contracts the primary passed | Investigate before switching |
needs_better_evidence | Not enough comparable data for a conclusion | Run more shadow comparisons |
changed_but_passing | Responses differ but both pass contracts | Review if differences matter |
clean | Both providers behave identically | Safe to switch |
Overview
The Shadow overview shows:
- Candidate pairs — each primary/shadow provider+model combination
- Window runs — total shadow runs in the current window
- Needs review — pairs with contract failures on the shadow side
- Needs evidence — pairs without enough comparable data
Pair detail
Click a pair to see:
- Decision guidance — a summary of what the comparison means and what to do
- Confidence description — based on how many comparable runs exist
- Incomparable breakdown — why some runs couldn't be compared (schema mismatch, etc.)
- Step-by-step comparison table — per-contract results for primary vs. shadow
Scopes
Filter the overview by verdict:
- All — everything
- Needs review — only pairs requiring investigation
- Needs evidence — only pairs needing more data
- Changed — pairs where behavior differs
- Clean — pairs that are safe
Changes
The Changes page shows what changed across your contracts, runs, and baselines — and why it matters.
Semantic diffs
Instead of raw data diffs, Changes surfaces what changed and why:
- Contract assertion added or removed
- New failure pattern appeared
- Baseline promoted or went stale
- Shadow comparison revealed drift
Each change includes:
- Significance — low, medium, or high
- Guidance — what the change means for your reliability posture
- Related context — links to the affected contract, shadow pair, or baseline
Visibility states
- Visible — active change items
- Hidden — automatically faded after resolution
- Dismissed — manually dismissed with attribution (who dismissed, when)
References
The References page manages your trusted comparison points — baselines that determine what "normal" looks like.
Trust lifecycle
References progress through states:
| State | Meaning | CI behavior |
|---|---|---|
| Collecting | Building confidence toward the 7-run promotion threshold | Evidence-only |
| Ready to promote | Enough stable runs to become a trusted reference | Promotion CTA shown |
| Active | Promoted and enforced — every new run is compared against it | Merge-blocking |
| Stale | Drift detected — model behavior has shifted since promotion | Advisory only |
| Retired | Superseded by a newer promoted reference | Historical |
What you'll see
- Trust posture summary — overall health (Healthy, Partially degraded, Action needed, etc.)
- Current trusted reference — the active baseline with provider, model, promotion date, qualifying runs, and consistency score
- Ready to promote — candidates with enough runs, with a Promote button
- Needs refresh — stale references where drift was detected, with guidance
- Collecting runs — candidates building toward the 7-run threshold, with progress bars
- Retired — collapsed section of superseded references
Promotion
Promotion happens from this page via the Promote button. A candidate needs at least 7 successful runs with the same configuration. When promoted, any stale references with the same contract hash are automatically retired.
Models
The Models page compares behavior across different LLM models.
Select two models to see a side-by-side comparison of their test results — pass rates, drift overlap, and per-run breakdown. Useful for deciding whether a newer or cheaper model can replace your current one.
Runs
The Runs page lists all test runs with filtering, search, and drilldown.
The list shows:
- Status (Succeeded, Failed, NonReproducible, RunAborted)
- Provider and model
- Fingerprint and baseline key
- Run mode (manual, ci) and lane (A or B)
- Created timestamp
Filter by: All, Failed, Needs review, or Live. Search by run ID, tool name, provider, or model. Sort by newest, oldest, or failures first.
Click a run to see the full traceability envelope, step-level results, contract pass/fail breakdown, and artifact details.
Settings
Manage your account:
- API keys — create (max 5 active), view, and revoke API keys
- Email verification — verify your email to unlock API access
- Password — change your password
Next steps
- Start capturing — SDK Integration
- Write contracts — Writing Tests
- Add to CI — CI Integration