# Testing Overview

The platform provides a testing and evaluation framework for verifying agent behavior before deployment, measuring quality in production, and detecting degradation over time.

## Testing Philosophy

Healthcare workflows are long and multi-step. A single patient interaction might span 20 or more steps across scheduling, insurance verification, clinical documentation, EHR writeback, and outbound follow-up calls. These workflows touch multiple external systems, each with its own availability characteristics, rate limits, and failure modes.

Testing these workflows against live systems repeatedly is impractical. External dependencies are unreliable, test cycles become slow, and results are flaky. Developers should not need access to a live EHR or a working telephony stack to verify that their agent logic is correct.

The platform addresses this by supporting three principles:

1. **Freeze the world model at a known state.** Simulations can run against a snapshot of a known patient population with known data. The agent sees the same world every time, so test results are deterministic and reproducible.
2. **Test agent logic in isolation.** Complex multi-step reasoning, branching, escalation rules, and context graph transitions can all be validated without calling external systems. The simulation framework exercises the agent's decision-making, not the availability of downstream infrastructure.
3. **Validate integration boundaries independently.** EHR writes, FHIR calls, outbound calls, and other external interactions have contract-based guarantees at their integration boundaries. These contracts define expected request shapes and response structures, and can be validated separately from the agent logic that produces them.
4. **Pre-flight connection probes.** Each integration can be probed independently to verify that credentials resolve and the upstream is reachable, without invoking any specific endpoint. The platform classifies probe results into actionable statuses - healthy, auth failed, unreachable, timeout, TLS error, or misconfigured - so teams can diagnose connectivity issues before they affect agent behavior. The most recent probe result is persisted on each integration, giving dashboards a health-at-a-glance view without re-running probes on every page load.

This gives developers fast, reliable test cycles. You can run a full regression suite in minutes without provisioning test environments, managing credentials for external systems, or coordinating with IT teams that control EHR access.

```mermaid
flowchart LR
    S["Simulations\n(pre-deployment)"] --> M["Metrics\n(per-interaction scoring)"]
    M --> D["Drift Detection\n(trend analysis)"]
    D -->|Degradation found| S
    M -->|Quality data| Dash["Dashboards +\nAlerts"]
```

## Simulations

Simulations let you test your agent against synthetic users in controlled scenarios. Simulation-originated calls appear in the Developer Console call log as "Test Call" entries, distinguished from inbound and outbound production calls. Simulation usage is metered separately from production usage across all billing meters - every meter carries a traffic class (production or simulation), and the billing system prices each class independently. This means testing activity does not inflate production cost reporting, and simulation traffic can carry different rates. You define personas (who the simulated user is), scenarios (what situation they are in), and success criteria (what the agent should do). The platform runs these conversations automatically and scores the results. Personas and scenarios are managed through Agent Forge or the API.

Simulated callers are automatically grounded in your workspace's real-world data. The platform resolves scheduling facts - locations, appointment types, visit reasons, providers, transfer targets, and escalation rules - from your service configuration, context graph, and connected clinical data sources. Simulated callers use only these facts when interacting with your agent, so test conversations reflect actual workspace data rather than invented details. When no facts are resolved for a category, simulated callers acknowledge uncertainty instead of fabricating information. This grounding ensures that simulation results are representative of real production conversations.

Simulation sessions generate call intelligence data that flows through the same analytical pipeline used for production calls. When a simulation run completes successfully, the platform emits one call intelligence event per session, carrying the full conversation transcript, tool call details, state traversal history, and scoring data. These events are ingested by the same metric evaluation pipeline that handles production calls, so custom metrics, quality scores, and analytics run against simulation results without any additional configuration. The unified data path ensures that what you measure in testing matches exactly what you measure in production.

Errored runs skip call intelligence emission because session data is incomplete and metric evaluation would produce misleading results. Only cleanly completed runs feed into the metric pipeline.

Scenario generation streams results progressively - as each scenario is generated, it is persisted to the run immediately. UI clients polling the run see scenarios appear one by one rather than waiting for the entire batch to finish. This is especially useful for large scenario sets, where generation can take over a minute. The run also reports the target scenario count so clients can render progress (e.g. "3 of 10 scenarios generated") without waiting for the full batch.

Session execution now starts per-scenario as each scenario lands from the streaming generator, rather than waiting for all scenarios to finish generating before running any sessions. With a typical generation stream lasting around a minute, the first session begins within seconds of the first scenario arriving. This collapses perceived run time for large scenario sets. If generation fails partway through, sessions for scenarios that already arrived continue to run and their results are preserved on the run record alongside the partial-generation failure.

Simulations answer the question: **does the agent handle this situation correctly?**

{% content-ref url="/pages/ge01EpnrtOpPWn6hpB7M" %}
[Simulations](/testing/testing/simulations.md)
{% endcontent-ref %}

## Metrics

Metrics measure the quality of agent conversations across dimensions that matter to your organization. You configure metrics for safety, clinical accuracy, empathy, goal completion, and any other dimension relevant to your use case. Metrics can be evaluated automatically after every session, during simulation runs, or through manual human review.

Metrics answer the question: **how well is the agent performing?**

{% content-ref url="/pages/R2FZnfpyXlTJjCCoPYwH" %}
[Metrics and Quality](/testing/testing/metrics.md)
{% endcontent-ref %}

## Voice Simulation

Voice simulation (VoiceSim) evaluates how changes to voice configuration parameters affect call quality. VoiceSim runs configurations across scenarios covering normal conversations, crisis situations, barge-ins, silence, and speech recognition failures, then scores results to identify optimal settings.

VoiceSim answers the question: **what voice configuration works best for this scenario?**

{% content-ref url="/pages/KaEECAS9Ciko9oisktO5" %}
[Voice Simulation](/testing/testing/voice-simulation.md)
{% endcontent-ref %}

## Agent Readiness

The Agent Readiness dashboard provides a structured view of how ready an agent is for production deployment. It evaluates agents against a tiered readiness rubric - basic, intermediate, and advanced - across categories like task completion, coverage, safety, and communication quality.

Each criterion is evaluated automatically from simulation run data, coverage graph state, and session history. Criteria show pass, fail, or not-yet-measured status. Tiers must be fully passing before the agent advances to the next readiness level (1 through 5). The dashboard surfaces the specific sessions, untested states, or quality gaps behind each result, so teams can identify exactly what needs improvement before going live.

The readiness rubric enforces minimum evidence thresholds - for example, sustained pass rate criteria require a minimum number of completed simulation runs before they are evaluated. This prevents premature pass/fail judgments on insufficient data.

## Conversation Quality Check

The `forge quality check` command scans production conversations against behavioral detectors that catch agent issues automated metrics may miss: stuck loops where the agent repeats itself, character degeneration, repetitive patterns, incoherent output, and phantom success (the agent claims a tool call succeeded when it actually failed). Quality checks query conversation data directly and report findings with severity, timestamps, and optional message excerpts. See [Agent Forge CLI](/reference/agent-forge.md#conversation-quality-check) for usage.

## Text Conversation Smoke Tests

Agent Forge includes smoke test commands for verifying text conversation endpoints. The `forge platform conversation send-message` command creates a conversation (when no conversation ID is provided) and sends a turn through the REST API, displaying the agent's response. If the conversation is created but the turn fails, the command surfaces the conversation ID for recovery. The `forge platform conversation text-ws-smoke` command opens a WebSocket streaming connection, sends a message, and verifies the agent responds. Both commands support JSON output for CI integration. See [Agent Forge CLI](/reference/agent-forge.md#text-conversation-smoke-tests) for usage.

The Developer Console playground provides interactive equivalents of these smoke tests. The unified playground lets you test agents through four modes - voice call, text chat simulation, REST API conversations, and real-time WebSocket streaming - all from a single interface. REST API mode shows the full request/response exchange for each turn, making it useful for debugging integration behavior. WebSocket mode streams messages in real time with automatic reconnection. See [Text Sessions](/channels/text-sessions.md) for details on the underlying channels.

## Scribe End-to-End Testing

The platform includes an end-to-end test harness for Scribe sessions that synthesizes a realistic clinical visit audio stream, sends it through the WebSocket pipeline, and evaluates the resulting transcripts and documentation against configurable quality thresholds. The harness supports configurable session duration, playback speed, silence intervals between visit segments, and minimum transcript and documentation expectations. Authentication uses dedicated environment variables and supports both query-parameter and subprotocol-based token delivery for compatibility with different proxy configurations.

The E2E script produces a structured JSON report indicating whether the session passed or failed, along with the thresholds that were applied and the observed values. Production long-session runs validate sustained audio streaming, clean session completion, transcript volume, SOAP section coverage, and connection health. This makes the harness suitable for CI pipelines and automated regression checks.

## Tool Testing

The tool testing playground lets you execute individual context graph tools - world tools, skills, and integrations - in isolation without making a phone call or starting a text session. This is useful during development when you need to verify that a tool behaves correctly before wiring it into a full conversation flow.

### Test Call Error Reporting

When a test call cannot start because of a configuration problem - such as a missing agent version, an unpublished context graph, or a service that does not exist - the platform rejects the connection with a typed error code and a human-readable explanation. This gives developers immediate feedback about what to fix instead of silently starting a degraded session. Error codes are stable identifiers that frontends can use to display targeted banners or dialogs. Production calls are not affected by this behavior.

### Call Timeline

The call playback timeline provides a structured representation of a call's conversation flow, including turns, segments, and duration. This timeline powers the call detail visualization in the Developer Console and is available both as part of the full call detail response and as a standalone resource for timeline-only consumers such as embedded playback components.

The call detail view in the Developer Console includes a multi-track timeline that visualizes what happened during a call. The timeline organizes segments by actor role - Caller, Agent, Operator, Tools, and System - so you can see each participant's activity in a dedicated horizontal lane.

Each track displays colored blocks representing segments like speech, tool calls, state transitions, silence, and barge-in events. Block colors communicate meaning: caller blocks reflect conversational tone, agent blocks distinguish greetings from filler and interrupted speech, tool blocks indicate success or failure, and system blocks show infrastructure events like state transitions and silence gaps.

When a recording is available, a playhead tracks the current position across all lanes, and you can click or use keyboard controls to seek to any point in the call. The Caller and Agent tracks remain visible even when no segments are present for those actors, so the waveform visualization and playhead are always accessible.

### Turn Timeline

The turn timeline is a structured representation of everything that happened during a call, broken into segments that show who spoke, when, and what occurred. Each segment carries actor semantics - it identifies the participant responsible (agent, caller, operator, tool, or system) along with their role in the conversation. This makes it possible to filter, group, and visualize call activity by participant rather than just by time.

Segments are organized into tracks that correspond to the actors in the conversation. For example, caller speech appears on the caller track, agent responses on the agent track, and tool invocations on the tool track. System-level events like state transitions and processing gaps appear on a dedicated system track. This track structure supports multi-lane timeline visualizations in the Developer Console and through the API.

Actor information is inferred automatically for segments produced by older pipeline versions, so the timeline is consistent regardless of when the call was made.

The voice playground includes a turn-by-turn timeline that provides structured inspection of every conversation turn during a live or completed session. The timeline displays a master-detail layout: a compact turn list on the left with color-coded event summaries, and a detail panel on the right showing the selected action, state transitions, tool calls (with expandable input/output), latency breakdown, and caller emotion data.

The timeline supports bidirectional linking with the conversation transcript. Clicking an agent message in the transcript selects the corresponding turn in the timeline. Clicking a turn in the timeline scrolls and highlights the corresponding message in the transcript. When a state is clicked in the context graph panel, the timeline detail updates to show available actions for that state.

### Caller ID in Playground Sessions

Both voice and text playgrounds support setting a simulated caller phone number before starting a session. The caller identity is forwarded to the engine so it can run patient resolution, letting you test caller-specific behavior (such as greeting a known patient by name or loading their clinical context) without making a real phone call or connecting to a live telephony provider.

### Channel-Appropriate Output

Text simulation sessions automatically use the web channel profile. This means the agent omits voice-specific markup (such as TTS pronunciation hints and vocal annotations) that would otherwise appear as literal text in the chat interface. Voice playground sessions use a separate connection path and apply the voice channel profile. You do not need to configure this - the platform selects the correct profile based on which playground or simulation path you use.

### Caller ID in Simulation Commands

The Agent Forge CLI simulation commands (`session-create`, `smoke-test`, and `bridge`) accept a `--caller-id` flag to set a simulated caller phone number in E.164 format. When provided, the agent resolves the number as a known caller, so the session starts with full patient context. This is useful for testing caller-specific flows - greeting a known patient by name, loading clinical history, or verifying identity-dependent routing - without making a real phone call. Omit the flag to simulate an unknown caller. See [Agent Forge CLI](/reference/agent-forge.md#simulation-caller-id) for usage.

The caller number is saved per playground mode (voice and text are stored independently) and restored on your next visit. Leaving the field blank omits the caller identity, which causes the engine to use its default caller with no patient match.

Tool testing runs against the live world model but with safety guardrails:

* **Source isolation** - All writes from tool tests are tagged with a dedicated test source, which is excluded from outbound sync. No real EHR writes, SMS deliveries, or external side effects happen.
* **Surface delivery blocking** - Surfaces created during tool tests are blocked from delivery to patients.
* **Dry run mode** - Write tools can be executed in dry run mode, which simulates the operation and reports what would have happened without persisting anything.

Simulation and playground sessions also produce visible entity state in the world model. Events from these sessions are included in entity state projections so that testing workflows reflect the same data the agent would see in production. Other analytical pipelines (metrics, encounter detection, gap detection) continue to filter these events out, keeping production analytics clean.

For each tool, the playground resolves the full tool definition (input schema, description, tier, write/read classification) and lets you provide custom input parameters and an optional entity context. After execution, you see the result, execution duration, any sub-tool calls that were made, and a list of blocked side effects.

Tool testing answers the question: **does this tool do what I expect with this input?**

{% hint style="info" %}
**Permissions**: Tool testing requires admin or owner access to the workspace. This is a developer workflow, not a production testing mechanism - use simulations for pre-deployment validation.
{% endhint %}

## Drift Detection

Drift detection tracks agent performance over time and alerts you when quality degrades. The platform monitors metric trends across conversation cohorts and compares actual behavior distributions against expected baselines. When drift is detected, it can trigger alerts, block promotions, or initiate automatic rollbacks.

Drift detection answers the question: **is the agent getting worse?**

{% content-ref url="/pages/ogceEl19CRp2g6SKFaRb" %}
[Drift Detection](/testing/testing/drift-detection.md)
{% endcontent-ref %}

***

## How the Pillars Work Together

These three capabilities form a continuous loop:

1. **Before deployment**, simulations verify that the agent handles target scenarios correctly and meets metric thresholds.
2. **In production**, metrics evaluate every conversation to track ongoing quality.
3. **Over time**, drift detection watches for degradation and triggers re-evaluation when performance shifts.

When drift is detected, you update your simulations and re-verify before promoting changes. This creates a feedback cycle where testing improves alongside the agent.

{% hint style="info" %}
**For Developers**: See the [REST API reference](https://docs.amigo.ai/developer-guide/core-api/metrics) and [Simulations reference](https://docs.amigo.ai/developer-guide/core-api/simulations/) for endpoint details, request/response schemas, and SDK code examples.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.amigo.ai/testing/testing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.