Dashboard Guide — ReplayCI

The ReplayCI dashboard at app.replayci.com gives you a control center for your LLM tool-call reliability. It shows health metrics, auto-generated contracts, validation coverage, model comparisons, and change history.

Today

The landing page shows a snapshot of your current reliability state.

What you'll see:

Health metrics — determinism score, shadow readiness, contract confidence across your tools
Run activity — recent pass/fail/incomparable counts
Attention items — things that need action (failing contracts, low-confidence tools, pending reviews)
Activity feed — recent events like baseline promotions and change dismissals
Reference state — current baseline status across your tool contracts

The Today page surfaces what matters so you don't have to dig through individual runs.

Contracts

The Contracts page lists every tool the system has observed, with auto-generated contract drafts.

How contracts get created

When you use the SDK observe() or push runs via the CLI, the server automatically infers contracts from captured tool calls. No manual YAML writing required to get started.

Contract list

Each tool shows:

Field	Description
Tool name	The tool function name observed
Agent	Which agent made the calls
Confidence	Low (< 5 samples), Medium (5–9), High (≥ 10)
Sample count	How many captures contributed to the contract
Status	Generated (contract exists) or Pending (waiting for more data)
Last captured	When the most recent call was observed

Two-source contracts

Contracts are either inferred (auto-generated, refreshable) or customer (your edits, immutable to auto-refresh). When both exist, customer assertions take precedence. See Writing Tests — Two-source contracts for details.

Contract review workspace

Click a tool to open the review workspace:

YAML editor — view and edit the contract with syntax highlighting
Field samples — see observed values and their distribution for each JSON path
Coverage map — which fields have assertions and which don't
Smart defaults — suggested assertions based on schema bounds (minimum, maximum, pattern, etc.)
Impact preview — before committing changes, see how the new contract would perform against recent captures
Pin/unpin — lock specific field choices so they aren't overwritten by auto-refresh

Confidence tiers

Contracts are rated Low (< 5 samples), Medium (5–9), or High (≥ 10). High-confidence contracts are candidates for CI promotion. See Writing Tests — Confidence tiers for details.

Guard

The Guard page shows validation coverage across all captured tool calls.

Overview metrics

Total observed calls — how many tool calls were captured in the time window
Validated — how many were checked against contracts (pass + fail)
Pass rate — percentage of validated calls that passed
Pending — calls with no validation data yet

Time window

Select 24 hours, 7 days, or 30 days. The page recalculates all metrics for the selected window.

Per-tool breakdown

Each tool shows:

Observed / pass / fail / pending counts
Dominant validation source (SDK validation, server re-evaluation, or pending)
Pass rate for validated calls
Link to drill into failure details

Failure patterns

Common failures are ranked by frequency:

Path — which JSON path failed (e.g., $.tool_calls[0].arguments.location)
Operator — which check failed (e.g., type, exists, length_gte)
Detail — what was expected vs. what was found
Count — how many times this pattern occurred

Three-tier validation

Guard evaluates calls through three sources:

Capture validation — the SDK validate() result embedded in the capture at call time
Server re-evaluation — the server re-runs evaluation with the latest contract (catches drift since capture)
Pending — no validation data available

The dominant source tells you whether most of your coverage comes from client-side or server-side validation.

Shadow

The Shadow page shows side-by-side comparisons between your primary and shadow providers.

Verdict system

Each primary–shadow pair gets a verdict:

Verdict	Meaning	Action
`needs_review`	Shadow failed contracts the primary passed	Investigate before switching
`needs_better_evidence`	Not enough comparable data for a conclusion	Run more shadow comparisons
`changed_but_passing`	Responses differ but both pass contracts	Review if differences matter
`clean`	Both providers behave identically	Safe to switch

Overview

The Shadow overview shows:

Candidate pairs — each primary/shadow provider+model combination
Window runs — total shadow runs in the current window
Needs review — pairs with contract failures on the shadow side
Needs evidence — pairs without enough comparable data

Pair detail

Click a pair to see:

Decision guidance — a summary of what the comparison means and what to do
Confidence description — based on how many comparable runs exist
Incomparable breakdown — why some runs couldn't be compared (schema mismatch, etc.)
Step-by-step comparison table — per-contract results for primary vs. shadow

Scopes

Filter the overview by verdict:

All — everything
Needs review — only pairs requiring investigation
Needs evidence — only pairs needing more data
Changed — pairs where behavior differs
Clean — pairs that are safe

Changes

The Changes page shows what changed across your contracts, runs, and baselines — and why it matters.

Semantic diffs

Instead of raw data diffs, Changes surfaces what changed and why:

Contract assertion added or removed
New failure pattern appeared
Baseline promoted or went stale
Shadow comparison revealed drift

Each change includes:

Significance — low, medium, or high
Guidance — what the change means for your reliability posture
Related context — links to the affected contract, shadow pair, or baseline

Visibility states

Visible — active change items
Hidden — automatically faded after resolution
Dismissed — manually dismissed with attribution (who dismissed, when)

References

The References page manages your trusted comparison points — baselines that determine what "normal" looks like.

Trust lifecycle

References progress through states:

State	Meaning	CI behavior
Collecting	Building confidence toward the 7-run promotion threshold	Evidence-only
Ready to promote	Enough stable runs to become a trusted reference	Promotion CTA shown
Active	Promoted and enforced — every new run is compared against it	Merge-blocking
Stale	Drift detected — model behavior has shifted since promotion	Advisory only
Retired	Superseded by a newer promoted reference	Historical

What you'll see

Trust posture summary — overall health (Healthy, Partially degraded, Action needed, etc.)
Current trusted reference — the active baseline with provider, model, promotion date, qualifying runs, and consistency score
Ready to promote — candidates with enough runs, with a Promote button
Needs refresh — stale references where drift was detected, with guidance
Collecting runs — candidates building toward the 7-run threshold, with progress bars
Retired — collapsed section of superseded references

Promotion

Promotion happens from this page via the Promote button. A candidate needs at least 7 successful runs with the same configuration. When promoted, any stale references with the same contract hash are automatically retired.

Models

The Models page compares behavior across different LLM models.

Select two models to see a side-by-side comparison of their test results — pass rates, drift overlap, and per-run breakdown. Useful for deciding whether a newer or cheaper model can replace your current one.

Runs

The Runs page lists all test runs with filtering, search, and drilldown.

The list shows:

Status (Succeeded, Failed, NonReproducible, RunAborted)
Provider and model
Fingerprint and baseline key
Run mode (manual, ci) and lane (A or B)
Created timestamp

Filter by: All, Failed, Needs review, or Live. Search by run ID, tool name, provider, or model. Sort by newest, oldest, or failures first.

Click a run to see the full traceability envelope, step-level results, contract pass/fail breakdown, and artifact details.

Settings

Manage your account:

API keys — create (max 5 active), view, and revoke API keys
Email verification — verify your email to unlock API access
Password — change your password

Next steps

Start capturing — SDK Integration
Write contracts — Writing Tests
Add to CI — CI Integration

Today​

Contracts​

How contracts get created​

Contract list​

Two-source contracts​

Contract review workspace​

Confidence tiers​

Guard​

Overview metrics​

Time window​

Per-tool breakdown​

Failure patterns​

Three-tier validation​

Shadow​

Verdict system​

Overview​

Pair detail​

Scopes​

Changes​

Semantic diffs​

Visibility states​

References​

Trust lifecycle​

What you'll see​

Promotion​

Models​

Runs​

Settings​

Next steps​