> For the complete documentation index, see [llms.txt](https://docs.amigo.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.amigo.ai/testing/testing.md).

# Testing Overview

The platform provides a testing and evaluation framework for verifying agent behavior before deployment, measuring quality in production, and detecting degradation over time. The agent execution layer supports configurable agent harness selection, so simulation and production runs can target different harness backends and models as needed. The platform currently supports two harness backends - one optimized for interactive code generation and one for non-interactive batch execution - and the harness used for each run is recorded on the result for auditability.

## Testing Philosophy

Healthcare workflows are long and multi-step. A single patient interaction might span 20 or more steps across scheduling, insurance verification, EHR writeback, and outbound follow-up calls. These workflows touch multiple external systems, each with its own availability characteristics, rate limits, and failure modes.

Testing these workflows against live systems repeatedly is impractical. External dependencies are unreliable, test cycles become slow, and results are flaky. Developers should not need access to a live EHR or a working telephony stack to verify that their agent logic is correct.

The platform addresses this by supporting three principles:

1. **Freeze the world model at a known state.** Simulations can run against a snapshot of a known patient population with known data. The agent sees the same world every time, so test results are deterministic and reproducible.
2. **Test agent logic in isolation.** Complex multi-step reasoning, branching, escalation rules, and context graph transitions can all be validated without calling external systems. The simulation framework exercises the agent's decision-making, not the availability of downstream infrastructure.
3. **Validate integration boundaries independently.** EHR writes, outbound calls, and other external interactions have contract-based guarantees at their integration boundaries. These contracts define expected request shapes and response structures, and can be validated separately from the agent logic that produces them.

This gives developers fast, reliable test cycles. You can run a full regression suite in minutes without provisioning test environments, managing credentials for external systems, or coordinating with IT teams that control EHR access.

```mermaid
flowchart LR
    S["Simulations\n(pre-deployment)"] --> M["Metrics\n(per-interaction scoring)"]
    M --> D["Drift Detection\n(trend analysis)"]
    D -->|Degradation found| S
    M -->|Quality data| Dash["Dashboards +\nAlerts"]
```

## Simulations

Simulations let you test your agent against synthetic users in controlled scenarios. The Developer Console playground - found at the top of the Conversations section in the sidebar (note: the Calls page is accessible via direct URL but is not listed in the sidebar) - displays only active services in the order returned by the server, with pagination for workspaces with large service catalogs, and supports Voice, Text, and Realtime (streaming) modes for interactive testing. The inline agent trace in the playground displays tool call durations next to each completed tool call, formatted in seconds with one decimal place, so you can see at a glance how long each tool invocation took during a conversation. The turn detail panel shows a loading indicator for available actions while a turn response is in-flight, so you always see accurate action state rather than stale data from a previous turn. Text mode starts manually - you land on the playground page, optionally configure an entity ID, choose whether the user or the agent speaks first, and click start when ready. While a text message is in flight and the agent response is pending, the chat input displays an "Awaiting agent response" indicator and is disabled until the agent replies, providing clear feedback that the message was sent. The entity ID input validates that the value is either a UUID or a phone number before allowing the session to start - invalid formats are flagged inline with a descriptive error, and the start action is blocked until the value is corrected or cleared. If a text turn fails mid-session (for example, due to a transient backend error), the playground surfaces the error in a banner rather than silently showing no response. When an integration with an approval gate parks a write during a Text playground conversation, an inline approval card appears above the chat input showing the tool name and Approve/Reject controls. Approving lets the agent proceed with the call; rejecting declines it with an optional reason the agent sees. The card clears after resolution, and the conversation continues. When a rejection is consumed, the agent narrates the decline on that same turn - it does not echo a stale "pending review" status before correcting itself. Approval resume is transport-agnostic - the decision is durably recorded and the agent consumes it on the conversation's next turn regardless of channel, so approval works in REST, streaming, and SMS playground modes without requiring an active streaming connection. When an approved write fires, it appears as a first-class tool call in the turn's tool-call logs - with its own call ID, input parameters, result, success status, and execution duration - so approved writes are visible alongside LLM-initiated calls in the playground's inline agent trace and tool-call history. The agent explicitly narrates the approval outcome, confirming to the user that the action was approved and completed (or approved but failed). In user-first mode (the default), the session waits for the user to send the opening message. In agent-first mode, the platform sends an initial turn to the agent immediately after session creation, so the agent generates the opening message without waiting for user input. Simulation sessions also support an outbound (agent-first) conversation direction, where the agent speaks first against task context rather than waiting for the caller. Outbound sessions reuse the same opening path as production outbound voice calls, so the agent's first message reflects the task context (such as the target patient entity) rather than a generic inbound greeting. This is useful for testing proactive outreach workflows, appointment reminders, and other agent-initiated scenarios. Outbound mode is incompatible with lazy session initialization - outbound requires the agent to produce an opening message at session creation time, while lazy defers initialization to the first caller step. The playground layout supports toggling side panels (context graph, event log) via keyboard shortcuts or toolbar buttons. Simulation sessions can also be bound to a specific patient entity at creation time, giving the simulated conversation access to the same patient context resolution used in production calls. When an entity ID is provided and matches a world entity in the workspace, the session resolves caller context directly from that entity rather than relying on phone-based lookup. If the entity does not exist (stale, deleted, or wrong workspace), the session falls back to phone lookup with no error - the simulation continues without interruption. The supplied caller phone number is still recorded and surfaced in greeting metadata regardless of which resolution path is used. Entity binding is workspace-scoped - an entity from one workspace cannot leak data into a simulation running in a different workspace. Bridge runs also accept an entity ID, which is forwarded to every scenario session and inherited by forked sessions, letting you pin an entire regression suite to a specific test patient. Entity IDs must be valid non-zero UUIDs - malformed values are rejected before the request reaches the agent engine. Simulation-originated calls appear in the Developer Console call log as "Test Call" entries, distinguished from inbound and outbound production calls. Simulation usage is metered separately from production usage across all billing meters - every meter carries a traffic class (production or simulation), and the billing system prices each class independently. This means testing activity does not inflate production cost reporting, and simulation traffic can carry different rates. Billing periods are anchored to when events actually occurred rather than when they were recorded, so backfilled events land in the correct historical billing period instead of the current month. Simulation metering covers run completions, wall-clock simulation time, evaluation results, and LLM token consumption for scenario generation, evaluation judging, simulated caller turns, and auxiliary LLM calls such as greeting pre-generation and sim recommendation requests. Each metering event uses a deterministic identifier derived from the run and event type, so retries and backfills are idempotent - duplicate emissions do not result in double-billing. You define personas (who the simulated user is), scenarios (what situation they are in), and success criteria (what the agent should do). The platform runs these conversations automatically and scores the results. Personas and scenarios are managed through Agent Forge or the API.

Simulated callers are automatically grounded in your workspace's real-world data. The platform resolves scheduling facts - locations, appointment types, visit reasons, providers, transfer targets, and escalation rules - from your service configuration, context graph, and connected clinical data sources. Simulated callers use only these facts when interacting with your agent, so test conversations reflect actual workspace data rather than invented details. When no facts are resolved for a category, simulated callers acknowledge uncertainty instead of fabricating information. This grounding ensures that simulation results are representative of real production conversations.

Simulation sessions use the same incremental session state engine as production text sessions and voice calls. All three session types - text, voice, and simulation - persist interaction state through a shared engine, so session data is stored, retrieved, and cleaned up identically regardless of channel. Each step appends only new interaction log entries to the session store. The agent's navigation position, visited states, and message history are derived from the interaction log on load rather than stored in a separate record. The session store writes per-session data (conversation turns and call intelligence) to a fast cache with automatic expiration and emits events to the durable analytics pipeline. The engine loads state exclusively from this fast cache on the live interaction path to keep per-request latency predictable. When cached call intelligence is unavailable - for example, after cache expiration - the platform falls back to the durable analytical store automatically, so historical conversations retain their intelligence data without manual intervention. Historical replay and backfill belong in one-time data repair jobs and analytical projections, not in request-path reads. Each entry carries journal metadata that defines its absolute position in the session timeline, and the engine requires canonical journal metadata on every entry - entries missing a journal kind, schema version, or index are rejected rather than silently accepted. This strict validation ensures that returning sessions always replay with deterministic ordering and full interaction history. Events emitted to the durable analytics pipeline carry deterministic identifiers derived from the entry's journal position, so retries are idempotent - duplicate deliveries do not create duplicate downstream records. The event pipeline flushes delivery confirmation before caching, so a delivery failure is never masked by a successful cache write. Call and conversation detail endpoints share a single set of read-path dependencies initialized at startup, so turn resolution behaves consistently regardless of which endpoint serves the request. Call intelligence artifacts follow the same tiered resolution pattern - the hot session cache is checked first, with automatic fallback to the durable analytical store. This applies to both individual call detail requests and batch conversation list queries, so intelligence fields (quality scores, completion reasons, duration, and analytics summaries) are served with consistent freshness regardless of access pattern. The call list is driven by a unified view that joins call intelligence records with conversation records, so calls that have intelligence data but no conversation record still appear in the list. This includes Atlas voice calls, which emit a lightweight call intelligence envelope at call end to ensure list membership even before full post-call analysis is available. The envelope omits quality scores entirely so that envelope-only calls do not affect aggregate quality metrics. This means calls processed through the analytics pipeline are surfaced even if they were not tracked as a traditional conversation - no calls are silently dropped from the list view. Batch list queries resolve intelligence data for an entire page of conversations in a single lookup rather than issuing per-call queries, keeping list latency predictable as page sizes grow. Text and voice conversations both resolve turns through the same session log pipeline - reading from the hot session store first and falling back to the durable analytical store for historical data. This two-tier resolution applies to both text and voice conversation detail endpoints, so turns are always available even after the hot cache expires. This unified turn resolution ensures consistent turn format, timestamps, and role mapping across all conversation channels. Every call that appears in the call list is guaranteed to resolve through the call detail endpoint - even calls that exist only as call intelligence records without a corresponding conversation, and historical simulation or playground calls whose turn-level data was not preserved in current durable stores. For these calls, the detail response includes all available metadata and scoring fields with an empty turns array, rather than returning a not-found error. The platform validates all session persistence dependencies at startup - if any required backing service is unavailable, the engine fails immediately rather than silently dropping writes during conversations. When turn persistence fails during a conversation - for example, due to a transient storage error or timeout during session log sync - the platform surfaces the failure explicitly rather than returning a success response with silently lost data. Non-streaming text interactions return an error status indicating that conversation history may be incomplete, and streaming interactions emit an error event on the stream. This ensures that callers are always aware when turn data has not been durably persisted, so they can retry or take corrective action. This fail-fast behavior extends to all hot-path dependencies, including the active-session cache and the event pipeline. The same principle applies to conversation routing: text session routing and conversation directory components require their hot-path cache to be present and healthy. Missing or unavailable cache infrastructure is a startup error, not a condition that silently degrades routing onto a fallback path. This ensures a single, predictable failure mode rather than split behavior across stores. In deployed environments, services refuse to start when critical dependencies are misconfigured or unreachable, ensuring that degraded infrastructure is caught at deploy time rather than during live conversations. This unified persistence model means simulation results are stored and retrieved identically to production sessions, forked sessions receive independent state copies that cannot interfere with each other, and session cleanup removes all persisted data cleanly.

Simulation sessions generate call intelligence data that flows through the same session store and event pipeline used for production calls. When a simulation run completes successfully, the platform emits one call intelligence event per session, carrying the full conversation transcript, tool call details, state traversal history, and scoring data. The simulation artifact uses the same validated schema as voice and text sessions, so all channels produce structurally identical payloads. These events are ingested by the same metric evaluation pipeline that handles production calls, so custom metrics, quality scores, and analytics run against simulation results without any additional configuration. The unified data path ensures that what you measure in testing matches exactly what you measure in production. Playground-originated calls are explicitly excluded from metric projection, so interactive testing in the Developer Console does not affect production or simulation metrics.

Errored runs skip call intelligence emission because session data is incomplete and metric evaluation would produce misleading results. Only cleanly completed runs feed into the metric pipeline.

When a session completes and the durable event is emitted successfully but the hot cache write fails, metric projection still proceeds. The platform distinguishes between durable event emission failures (which skip metric projection entirely) and cache-only failures (which allow metric projection to continue). This ensures that transient cache issues do not silently drop analytics data for completed sessions.

Simulation sessions that fail before producing any transcript - for example, due to a timeout or internal error during the first turn - are automatically scored as failures and the run is marked with an error indicating how many sessions failed before generating conversation data. This prevents zero-turn sessions from being silently treated as successful completions and ensures that runs surface meaningful diagnostics when the agent or its dependencies are unavailable.

When a simulation run completes, the platform computes on-the-fly metric values for any active metrics that use AI-based evaluation before the metric evaluation loop begins polling. This means metric results for AI-evaluated metrics are available in the hot store immediately when the eval checks run, rather than requiring the downstream batch pipeline to process them first. The platform builds the same prompt and applies the same parsing rules as the batch pipeline, so on-the-fly values match what the batch would produce. Each model call is metered as simulation usage, consistent with other simulation LLM costs. If on-the-fly computation fails for a particular metric or session, that metric falls through to pending status rather than erroring the run - the batch pipeline will eventually produce the value.

Because metric evaluation happens asynchronously after a run completes, metric results may not be available immediately. The platform tracks metric availability for each simulation run, session, and benchmark. Every run and session exposes a metric status that indicates whether metric results are pending, available, or unavailable, along with a count of results produced so far and a timestamp of the last availability check. This lets dashboards and API consumers show accurate progress indicators (e.g. "metrics processing" or "3 metric results available") without polling the metric pipeline directly. Benchmark results aggregate metric availability across all constituent runs, so you can see at a glance whether the full benchmark has metric coverage or is still waiting for evaluation to complete.

Scenario generation streams results progressively - as each scenario is generated, it is persisted to the run immediately. UI clients polling the run see scenarios appear one by one rather than waiting for the entire batch to finish. This is especially useful for large scenario sets, where generation can take over a minute. The run also reports the target scenario count so clients can render progress (e.g. "3 of 10 scenarios generated") without waiting for the full batch.

Generated scenarios are automatically saved as durable simulation cases. Each case captures the persona, scenario content (instructions, initial message, temperament), patient bindings, and evaluation criteria used during generation. Each case can define evaluation criteria (evals) that are executed automatically when a simulation run completes. Evals come in two types: assertions and metric checks. Assertions validate conversation outcomes directly - for example, checking whether a specific phrase appeared in the transcript, whether a particular tool was called, whether the conversation ended in an expected state, or through an AI judge that evaluates the transcript against a natural-language criterion. Metric checks compare observed metric values against configured expectations (exact match, numeric range, or string containment). Eval results are computed per-run and include a status (passed, failed, pending, skipped, or error), an optional numeric score, and a rationale explaining the outcome. The GET run detail endpoint returns eval results inline alongside the run data, with summary counts for total, passed, failed, and errored evals. Saved cases can be browsed, edited, and re-run independently of the original simulation run. Cases created by the scenario generator include metadata indicating their source, so you can distinguish hand-authored cases from machine-generated ones. If a run is replayed, previously persisted cases are reused rather than duplicated. The case library is accessible through the API, with support for creating cases in bulk, listing cases with tag and service filters, fetching individual cases by ID, updating case fields, and deleting cases. All case endpoints guarantee that scenario instructions are present and non-empty in responses - legacy cases that predate the current schema are backfilled automatically, falling back to the case description when original scenario content is unavailable. Bulk creation supports seeding up to 100 cases per request, enabling automated import from external test management systems or CI pipelines. Each case carries a metadata object that can record how it was created, so you can distinguish hand-authored cases from those seeded by automation or generated by the scenario generator. The list endpoint includes a total count alongside paginated results, so clients can render accurate pagination controls without issuing a separate count request.

The Developer Console provides a Case Library page for browsing and searching saved cases, with filters for service and suite. The Case Library and Suites pages are currently available as a private preview feature - workspaces must be enrolled in the private preview program to access these pages and their associated navigation links. The console loads cases incrementally with pagination, displaying loaded and total case counts in the table header and offering a "Load more cases" button to fetch additional pages. When a suite filter is active, the console loads only the cases belonging to that suite - suites with explicit case lists load those cases directly, suites with required tags fetch matching cases using tag filters, and hybrid suites that combine both load from both sources. Each case row in the library can be expanded inline to show the full scenario, persona, eval criteria, service, labels, patient binding, and opening message without leaving the list view. Cases can also be opened in a dedicated detail view showing persona, scenario content (including temperament), patient bindings, evaluation criteria, and metadata. The case detail view displays the case source (how it was created), case ID, workspace ID, service ID, and created-by fields, and organizes grounding data, evaluation criteria, and metadata into structured key-value sections rather than raw data views. Evaluation criteria are rendered as individual cards showing type, key, expected values, weight, and parameters. The case library displays which suites each case belongs to, derived from suite membership. Cases with a saved service can be run directly from the library with a single click. Cases are also organized into suites. Suites are first-class resources that define a reusable collection of simulation cases for batch execution. Each suite has a name, description, an explicit list of case IDs, optional required tags that dynamically match additional cases, and its own tags and metadata. Suites do not bind to a specific service - the service is supplied at run time by the benchmark request or by each saved case's own configuration. You can create, list, get, update, and delete suites through the API. The Suites page in the Developer Console lists all suites with case counts, descriptions, and timestamps, loading suite metadata directly from the API without requiring the full case library. Suite filtering on the Case Library page loads suites from the platform API rather than deriving them from case tags, so the suite list is always consistent with the suites visible on the Suites page. When a suite filter is active, the service filter is disabled since suite membership already determines which cases are shown. An entire suite can be executed as a batch benchmark run from the console or the API, with results appearing in the standard simulation runs list. The simulation runs list groups suite runs into collapsible rows that show aggregate status, session and turn totals, and the services involved. Expanding a suite run row reveals the individual child runs. A run source filter lets you narrow the list to suite runs, case runs, or spot checks. The run detail page displays outcome summary cards for session scores, eval results, and run status at the top of the page. The Case Library and Suites pages are currently available as a private preview feature - workspaces must be enrolled in the private preview program to access these pages and their associated navigation links.

Simulation sessions inject caller emotion data only when the session is configured for voice modality. Text and web simulation sessions do not apply acoustic emotion signals, matching the behavior of production text channels where acoustic emotion data is not available. This ensures that simulated text conversations produce the same empathy and engagement behavior as real production text sessions, rather than artificially elevated empathy responses caused by emotion data that would not exist in a real text interaction.

Simulation bridge batch runs use lazy session initialization by default. When a batch creates many sessions at once, each session is created as a lightweight shell without booting the agent engine or generating a greeting. The engine initializes on the first step request for each session, which collapses the upfront cost of session creation across large batches. The caller ID provided at session creation is preserved and used when the engine boots, so caller context resolution works correctly even though initialization is deferred.

Session execution now starts per-scenario as each scenario lands from the streaming generator, rather than waiting for all scenarios to finish generating before running any sessions. With a typical generation stream lasting around a minute, the first session begins within seconds of the first scenario arriving. This collapses perceived run time for large scenario sets. If generation fails partway through, sessions for scenarios that already arrived continue to run and their results are preserved on the run record alongside the partial-generation failure.

Simulations also support benchmark runs - batch execution of saved cases selected by suite or by tag, with aggregated scoring and capability-level breakdowns. Benchmarks can be triggered by referencing a suite ID (which resolves the suite's explicit case IDs and required tags into a combined case set) or by specifying required tags directly. Suite ID and required tags are mutually exclusive on a single benchmark request. Benchmarks can include up to 200 cases per run, enabling large regression suites to execute as a single batch. Suites can also be run directly through a dedicated suite run endpoint that enforces suite-based execution without requiring tag parameters. Each suite run is assigned a durable identifier that groups all constituent case runs together. You can list past suite runs for any suite and retrieve aggregate results for a specific suite run, including status breakdowns, session and turn totals, case coverage, and metric availability. Suite run results use the same aggregation and metric enrichment as benchmark results, so the scoring view is consistent whether you triggered execution via the suite run endpoint or a benchmark request. Benchmark execution uses queued bridge scheduling - all case runs are prepared and validated up front, and the platform returns stable run IDs immediately. The actual bridge executions are then dispatched as a single batch with a concurrency cap, preventing a burst of simultaneous bridge tasks from overwhelming the system. This means benchmark callers get fast responses with run IDs they can poll, while the platform controls the execution fan-out internally. Each case in a benchmark can carry its own patient entity ID in its grounding, so different cases can target different test patients within the same benchmark. When an explicit entity ID is provided on the benchmark request it applies to all cases; when omitted, each case falls back to its own grounding-level patient entity ID if one is present. This per-case resolution makes multi-patient benchmarks possible without splitting them into separate runs. Benchmarks answer the question: **does the agent handle an entire test suite correctly?**

All simulation operations are gated by role-based permissions. Read operations - such as listing runs, viewing coverage graphs, querying session turns, and retrieving benchmark results - require the Service view permission. Write operations - such as creating runs, stepping sessions, executing bridge requests, running benchmarks, and deleting coverage graphs - require the Service update permission. API keys whose role does not carry the required permission receive a 403 response. This ensures that viewer and operator roles cannot inadvertently trigger simulation runs or modify coverage data.

The Developer Console enforces the same permission model in the UI. Simulation write actions - running a saved case, executing a suite, and starting a new simulation run - are disabled for users whose workspace role does not carry the Service update permission (viewer, operator, and member roles). Buttons are visually disabled with an explanatory message, and the forms prevent submission. If a permission change occurs between page load and action submission, the console surfaces the 403 error from the API rather than showing a generic failure message. Only admin and owner roles can trigger simulation runs from the console.

Simulations answer the question: **does the agent handle this situation correctly?**

{% content-ref url="/pages/ge01EpnrtOpPWn6hpB7M" %}
[Simulations](/testing/testing/simulations.md)
{% endcontent-ref %}

## Metrics

Metrics measure the quality of agent conversations across dimensions that matter to your organization. You configure metrics for safety, clinical accuracy, empathy, goal completion, and any other dimension relevant to your use case. Metrics can be evaluated automatically after every session, during simulation runs, or through manual human review.

The Eval Summary dashboard provides a consolidated view of simulation and eval metric scores across your workspace - broken down by metric, over time, and by categorical outcome. See [Intelligence and Analytics](/intelligence-and-analytics/intelligence.md) for details on built-in dashboards.

Metrics answer the question: **how well is the agent performing?**

{% content-ref url="/pages/R2FZnfpyXlTJjCCoPYwH" %}
[Metrics and Quality](/testing/testing/metrics.md)
{% endcontent-ref %}

## Voice Simulation

Voice simulation (VoiceSim) evaluates how changes to voice configuration parameters affect call quality. VoiceSim runs configurations across scenarios covering normal conversations, crisis situations, barge-ins, silence, and speech recognition failures, then scores results to identify optimal settings.

VoiceSim answers the question: **what voice configuration works best for this scenario?**

{% content-ref url="/pages/KaEECAS9Ciko9oisktO5" %}
[Voice Simulation](/testing/testing/voice-simulation.md)
{% endcontent-ref %}

## Agent Readiness

The Agent Readiness dashboard provides a structured view of how ready an agent is for production deployment. It evaluates agents against a tiered readiness rubric - basic, intermediate, and advanced - across categories like task completion, coverage, safety, and communication quality.

Each criterion is evaluated automatically from simulation run data, coverage graph state, and session history. Criteria show pass, fail, or not-yet-measured status. Tiers must be fully passing before the agent advances to the next readiness level (1 through 5). The dashboard surfaces the specific sessions, untested states, or quality gaps behind each result, so teams can identify exactly what needs improvement before going live.

The readiness rubric enforces minimum evidence thresholds - for example, sustained pass rate criteria require a minimum number of completed simulation runs before they are evaluated. This prevents premature pass/fail judgments on insufficient data.

## Conversation Quality Check

The `forge quality check` command scans production conversations against behavioral detectors that catch agent issues automated metrics may miss: stuck loops where the agent repeats itself, character degeneration, repetitive patterns, incoherent output, and phantom success (the agent claims a tool call succeeded when it actually failed). Quality checks query conversation data directly and report findings with severity, timestamps, and optional message excerpts. See [Agent Forge CLI](/reference/agent-forge.md#conversation-quality-check) for usage.

## Text Conversation Smoke Tests

Agent Forge includes smoke test commands for verifying text conversation endpoints. Use `forge platform conversation create` to create a conversation, then `forge platform conversation send-message` to send turns through the REST API and display the agent's response. Agent Forge also includes a WebSocket smoke-test command, `forge platform conversation text-ws-smoke`, which opens the session WebSocket, sends one message, and requires an agent response. See [Agent Forge CLI](/reference/agent-forge.md#text-conversation-smoke-tests) for usage.

The Developer Console playground provides interactive equivalents of these smoke tests. The unified playground lets you test agents through three modes - a browser-based voice call with live audio, a turn-by-turn text conversation, and a real-time streaming text conversation over WebSocket - all from a single interface. The streaming mode shows messages in real time with automatic reconnection. See [Text Sessions](/channels/text-sessions.md) for details on the underlying channels.

## Tool Testing

The tool testing playground lets you execute individual context graph tools - world tools, skills, and integrations - in isolation without making a phone call or starting a text session. This is useful during development when you need to verify that a tool behaves correctly before wiring it into a full conversation flow.

### Test Call Error Reporting

When a test call cannot start because of a configuration problem - such as a missing agent version, an unpublished context graph, or a service that does not exist - the platform rejects the connection with a typed error code and a human-readable explanation. This gives developers immediate feedback about what to fix instead of silently starting a degraded session. Error codes are stable identifiers that frontends can use to display targeted banners or dialogs. Production calls are not affected by this behavior.

### Call Timeline

The call playback timeline provides a structured representation of a call's conversation flow, including turns, segments, and duration. This timeline powers the call detail visualization in the Developer Console and is available both as part of the full call detail response and as a standalone resource for timeline-only consumers such as embedded playback components.

The call detail view in the Developer Console includes a multi-track timeline that visualizes what happened during a call. The timeline organizes segments by actor role - Caller, Agent, Operator, Tools, and System - so you can see each participant's activity in a dedicated horizontal lane.

Each track displays colored blocks representing segments like speech, tool calls, state transitions, silence, and barge-in events. Block colors communicate meaning: caller blocks reflect conversational tone, agent blocks distinguish greetings from filler and interrupted speech, tool blocks indicate success or failure, and system blocks show infrastructure events like state transitions and silence gaps.

When a recording is available, a playhead tracks the current position across all lanes, and you can click or use keyboard controls to seek to any point in the call. The Caller and Agent tracks remain visible even when no segments are present for those actors, so the waveform visualization and playhead are always accessible.

### Turn Timeline

The turn timeline is a structured representation of everything that happened during a call, broken into segments that show who spoke, when, and what occurred. Each segment carries actor semantics - it identifies the participant responsible (agent, caller, operator, tool, or system) along with their role in the conversation. This makes it possible to filter, group, and visualize call activity by participant rather than just by time.

Segments are organized into tracks that correspond to the actors in the conversation. For example, caller speech appears on the caller track, agent responses on the agent track, and tool invocations on the tool track. System-level events like state transitions and processing gaps appear on a dedicated system track. This track structure supports multi-lane timeline visualizations in the Developer Console and through the API.

Actor information is inferred automatically for segments produced by older pipeline versions, so the timeline is consistent regardless of when the call was made.

The voice playground includes a turn-by-turn timeline that provides structured inspection of every conversation turn during a live or completed session. The timeline displays a master-detail layout: a compact turn list on the left with color-coded event summaries, and a detail panel on the right showing the selected action, state transitions, tool calls (with expandable input/output), latency breakdown, and caller emotion data.

The timeline supports bidirectional linking with the conversation transcript. Clicking an agent message in the transcript selects the corresponding turn in the timeline. Clicking a turn in the timeline scrolls and highlights the corresponding message in the transcript. When a state is clicked in the context graph panel, the timeline detail updates to show available actions for that state.

### Caller ID in Playground Sessions

Both voice and text playgrounds support setting a simulated caller phone number before starting a session. The caller identity is forwarded to the engine so it can run patient resolution, letting you test caller-specific behavior (such as greeting a known patient by name or loading their clinical context) without making a real phone call or connecting to a live telephony provider.

### Channel-Appropriate Output

Text simulation sessions automatically use the web channel profile. This means the agent omits voice-specific markup (such as TTS pronunciation hints and vocal annotations) that would otherwise appear as literal text in the chat interface. Voice playground sessions use a separate connection path and apply the voice channel profile. You do not need to configure this - the platform selects the correct profile based on which playground or simulation path you use.

### Caller ID in Simulation Commands

The Agent Forge CLI simulation commands (`session-create`, `smoke-test`, and `bridge`) accept a `--caller-id` flag to set a simulated caller phone number in E.164 format. When provided, the agent resolves the number as a known caller, so the session starts with full patient context. This is useful for testing caller-specific flows - greeting a known patient by name, loading clinical history, or verifying identity-dependent routing - without making a real phone call. Omit the flag to simulate an unknown caller. See [Agent Forge CLI](/reference/agent-forge.md#simulation-caller-and-entity-context) for usage.

The caller number is saved per playground mode (voice and text are stored independently) and restored on your next visit. Leaving the field blank omits the caller identity, which causes the engine to use its default caller with no patient match.

Tool testing runs against the live world model but with safety guardrails:

* **Source isolation** - All writes from tool tests are tagged with a dedicated test source, which is excluded from outbound sync. No real EHR writes, SMS deliveries, or external side effects happen.
* **Surface delivery blocking** - Surfaces created during tool tests are blocked from delivery to patients.
* **Dry run mode** - Write tools can be executed in dry run mode, which simulates the operation and reports what would have happened without persisting anything.

Simulation and playground sessions also produce visible entity state in the world model. Events from these sessions are included in entity state projections so that testing workflows reflect the same data the agent would see in production. Other analytical pipelines (metrics, encounter detection, gap detection) continue to filter these events out, keeping production analytics clean.

For each tool, the playground resolves the full tool definition (input schema, description, tier, write/read classification) and lets you provide custom input parameters and an optional entity context. After execution, you see the result, execution duration, any sub-tool calls that were made, and a list of blocked side effects.

Tool testing answers the question: **does this tool do what I expect with this input?**

{% hint style="info" %}
**Permissions**: Tool testing requires admin or owner access to the workspace. This is a developer workflow, not a production testing mechanism - use simulations for pre-deployment validation.
{% endhint %}

## Drift Detection

Drift detection tracks agent performance over time and alerts you when quality degrades. The platform monitors metric trends across conversation cohorts and compares actual behavior distributions against expected baselines. When drift is detected, it can trigger alerts, block promotions, or initiate automatic rollbacks.

Drift detection answers the question: **is the agent getting worse?**

{% content-ref url="/pages/ogceEl19CRp2g6SKFaRb" %}
[Drift Detection](/testing/testing/drift-detection.md)
{% endcontent-ref %}

***

## How the Pillars Work Together

These three capabilities form a continuous loop:

1. **Before deployment**, simulations verify that the agent handles target scenarios correctly and meets metric thresholds.
2. **In production**, metrics evaluate every conversation to track ongoing quality.
3. **Over time**, drift detection watches for degradation and triggers re-evaluation when performance shifts.

When drift is detected, you update your simulations and re-verify before promoting changes. This creates a feedback cycle where testing improves alongside the agent.

{% hint style="info" %}
**For Developers**: See the [REST API reference](https://docs.amigo.ai/developer-guide/core-api/metrics) and [Simulations reference](https://docs.amigo.ai/developer-guide/core-api/simulations/) for endpoint details, request/response schemas, and SDK code examples.
{% endhint %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.amigo.ai/testing/testing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
