Skip to main content

Dashboard Guide — ReplayCI

The ReplayCI dashboard at app.replayci.com gives you a control center for your LLM tool-call reliability. It shows health metrics, auto-generated contracts, validation coverage, model comparisons, and change history.


Today

The landing page shows a snapshot of your current reliability state.

What you'll see:

  • Health metrics — determinism score, shadow readiness, contract confidence across your tools
  • Run activity — recent pass/fail/incomparable counts
  • Attention items — things that need action (failing contracts, low-confidence tools, pending reviews)
  • Activity feed — recent events like baseline promotions and change dismissals
  • Reference state — current baseline status across your tool contracts

The Today page surfaces what matters so you don't have to dig through individual runs.


Contracts

The Contracts page lists every tool the system has observed, with auto-generated contract drafts.

How contracts get created

When you use the SDK observe() or push runs via the CLI, the server automatically infers contracts from captured tool calls. No manual YAML writing required to get started.

Contract list

Each tool shows:

FieldDescription
Tool nameThe tool function name observed
AgentWhich agent made the calls
ConfidenceLow (< 5 samples), Medium (5–9), High (≥ 10)
Sample countHow many captures contributed to the contract
StatusGenerated (contract exists) or Pending (waiting for more data)
Last capturedWhen the most recent call was observed

Two-source contracts

Contracts are either inferred (auto-generated, refreshable) or customer (your edits, immutable to auto-refresh). When both exist, customer assertions take precedence. See Writing Tests — Two-source contracts for details.

Contract review workspace

Click a tool to open the review workspace:

  • YAML editor — view and edit the contract with syntax highlighting
  • Field samples — see observed values and their distribution for each JSON path
  • Coverage map — which fields have assertions and which don't
  • Smart defaults — suggested assertions based on schema bounds (minimum, maximum, pattern, etc.)
  • Impact preview — before committing changes, see how the new contract would perform against recent captures
  • Pin/unpin — lock specific field choices so they aren't overwritten by auto-refresh

Confidence tiers

Contracts are rated Low (< 5 samples), Medium (5–9), or High (≥ 10). High-confidence contracts are candidates for CI promotion. See Writing Tests — Confidence tiers for details.


Guard

The Guard page shows validation coverage across all captured tool calls.

Overview metrics

  • Total observed calls — how many tool calls were captured in the time window
  • Validated — how many were checked against contracts (pass + fail)
  • Pass rate — percentage of validated calls that passed
  • Pending — calls with no validation data yet

Time window

Select 24 hours, 7 days, or 30 days. The page recalculates all metrics for the selected window.

Per-tool breakdown

Each tool shows:

  • Observed / pass / fail / pending counts
  • Dominant validation source (SDK validation, server re-evaluation, or pending)
  • Pass rate for validated calls
  • Link to drill into failure details

Failure patterns

Common failures are ranked by frequency:

  • Path — which JSON path failed (e.g., $.tool_calls[0].arguments.location)
  • Operator — which check failed (e.g., type, exists, length_gte)
  • Detail — what was expected vs. what was found
  • Count — how many times this pattern occurred

Three-tier validation

Guard evaluates calls through three sources:

  1. Capture validation — the SDK validate() result embedded in the capture at call time
  2. Server re-evaluation — the server re-runs evaluation with the latest contract (catches drift since capture)
  3. Pending — no validation data available

The dominant source tells you whether most of your coverage comes from client-side or server-side validation.


Shadow

The Shadow page shows side-by-side comparisons between your primary and shadow providers.

Verdict system

Each primary–shadow pair gets a verdict:

VerdictMeaningAction
needs_reviewShadow failed contracts the primary passedInvestigate before switching
needs_better_evidenceNot enough comparable data for a conclusionRun more shadow comparisons
changed_but_passingResponses differ but both pass contractsReview if differences matter
cleanBoth providers behave identicallySafe to switch

Overview

The Shadow overview shows:

  • Candidate pairs — each primary/shadow provider+model combination
  • Window runs — total shadow runs in the current window
  • Needs review — pairs with contract failures on the shadow side
  • Needs evidence — pairs without enough comparable data

Pair detail

Click a pair to see:

  • Decision guidance — a summary of what the comparison means and what to do
  • Confidence description — based on how many comparable runs exist
  • Incomparable breakdown — why some runs couldn't be compared (schema mismatch, etc.)
  • Step-by-step comparison table — per-contract results for primary vs. shadow

Scopes

Filter the overview by verdict:

  • All — everything
  • Needs review — only pairs requiring investigation
  • Needs evidence — only pairs needing more data
  • Changed — pairs where behavior differs
  • Clean — pairs that are safe

Changes

The Changes page shows what changed across your contracts, runs, and baselines — and why it matters.

Semantic diffs

Instead of raw data diffs, Changes surfaces what changed and why:

  • Contract assertion added or removed
  • New failure pattern appeared
  • Baseline promoted or went stale
  • Shadow comparison revealed drift

Each change includes:

  • Significance — low, medium, or high
  • Guidance — what the change means for your reliability posture
  • Related context — links to the affected contract, shadow pair, or baseline

Visibility states

  • Visible — active change items
  • Hidden — automatically faded after resolution
  • Dismissed — manually dismissed with attribution (who dismissed, when)

References

The References page manages your trusted comparison points — baselines that determine what "normal" looks like.

Trust lifecycle

References progress through states:

StateMeaningCI behavior
CollectingBuilding confidence toward the 7-run promotion thresholdEvidence-only
Ready to promoteEnough stable runs to become a trusted referencePromotion CTA shown
ActivePromoted and enforced — every new run is compared against itMerge-blocking
StaleDrift detected — model behavior has shifted since promotionAdvisory only
RetiredSuperseded by a newer promoted referenceHistorical

What you'll see

  • Trust posture summary — overall health (Healthy, Partially degraded, Action needed, etc.)
  • Current trusted reference — the active baseline with provider, model, promotion date, qualifying runs, and consistency score
  • Ready to promote — candidates with enough runs, with a Promote button
  • Needs refresh — stale references where drift was detected, with guidance
  • Collecting runs — candidates building toward the 7-run threshold, with progress bars
  • Retired — collapsed section of superseded references

Promotion

Promotion happens from this page via the Promote button. A candidate needs at least 7 successful runs with the same configuration. When promoted, any stale references with the same contract hash are automatically retired.


Models

The Models page compares behavior across different LLM models.

Select two models to see a side-by-side comparison of their test results — pass rates, drift overlap, and per-run breakdown. Useful for deciding whether a newer or cheaper model can replace your current one.


Runs

The Runs page lists all test runs with filtering, search, and drilldown.

The list shows:

  • Status (Succeeded, Failed, NonReproducible, RunAborted)
  • Provider and model
  • Fingerprint and baseline key
  • Run mode (manual, ci) and lane (A or B)
  • Created timestamp

Filter by: All, Failed, Needs review, or Live. Search by run ID, tool name, provider, or model. Sort by newest, oldest, or failures first.

Click a run to see the full traceability envelope, step-level results, contract pass/fail breakdown, and artifact details.


Settings

Manage your account:

  • API keys — create (max 5 active), view, and revoke API keys
  • Email verification — verify your email to unlock API access
  • Password — change your password

Next steps