# Testing Overview

The platform provides a testing and evaluation framework for verifying agent behavior before deployment, measuring quality in production, and detecting degradation over time.

## Testing Philosophy

Healthcare workflows are long and multi-step. A single patient interaction might span 20 or more steps across scheduling, insurance verification, clinical documentation, EHR writeback, and outbound follow-up calls. These workflows touch multiple external systems, each with its own availability characteristics, rate limits, and failure modes.

Testing these workflows against live systems repeatedly is impractical. External dependencies are unreliable, test cycles become slow, and results are flaky. Developers should not need access to a live EHR or a working telephony stack to verify that their agent logic is correct.

The platform addresses this by supporting three principles:

1. **Freeze the world model at a known state.** Simulations can run against a snapshot of a known patient population with known data. The agent sees the same world every time, so test results are deterministic and reproducible.
2. **Test agent logic in isolation.** Complex multi-step reasoning, branching, escalation rules, and context graph transitions can all be validated without calling external systems. The simulation framework exercises the agent's decision-making, not the availability of downstream infrastructure.
3. **Validate integration boundaries independently.** EHR writes, FHIR calls, outbound calls, and other external interactions have contract-based guarantees at their integration boundaries. These contracts define expected request shapes and response structures, and can be validated separately from the agent logic that produces them.

This gives developers fast, reliable test cycles. You can run a full regression suite in minutes without provisioning test environments, managing credentials for external systems, or coordinating with IT teams that control EHR access.

{% @mermaid/diagram content="flowchart LR
S\["Simulations\n(pre-deployment)"] --> M\["Metrics\n(per-interaction scoring)"]
M --> D\["Drift Detection\n(trend analysis)"]
D -->|Degradation found| S
M -->|Quality data| Dash\["Dashboards +\nAlerts"]" %}

## Simulations

Simulations let you test your agent against synthetic users in controlled scenarios. You define personas (who the user is), scenarios (what situation they are in), and success criteria (what the agent should do). The platform runs these conversations automatically and scores the results.

Simulations answer the question: **does the agent handle this situation correctly?**

{% content-ref url="testing/simulations" %}
[simulations](https://docs.amigo.ai/testing/testing/simulations)
{% endcontent-ref %}

## Metrics

Metrics measure the quality of agent conversations across dimensions that matter to your organization. You configure metrics for safety, clinical accuracy, empathy, goal completion, and any other dimension relevant to your use case. Metrics can be evaluated automatically after every session, during simulation runs, or through manual human review.

Metrics answer the question: **how well is the agent performing?**

{% content-ref url="testing/metrics" %}
[metrics](https://docs.amigo.ai/testing/testing/metrics)
{% endcontent-ref %}

## Voice Simulation

Voice simulation (VoiceSim) evaluates how changes to voice configuration parameters affect call quality. VoiceSim runs configurations across scenarios covering normal conversations, crisis situations, barge-ins, silence, and speech recognition failures, then scores results to identify optimal settings.

VoiceSim answers the question: **what voice configuration works best for this scenario?**

{% content-ref url="testing/voice-simulation" %}
[voice-simulation](https://docs.amigo.ai/testing/testing/voice-simulation)
{% endcontent-ref %}

## Conversation Quality Check

The `forge quality check` command scans production conversations against behavioral detectors that catch agent issues automated metrics may miss: stuck loops where the agent repeats itself, character degeneration, repetitive patterns, incoherent output, and phantom success (the agent claims a tool call succeeded when it actually failed). Quality checks query conversation data directly and report findings with severity, timestamps, and optional message excerpts. See [Agent Forge CLI](https://docs.amigo.ai/reference/agent-forge#conversation-quality-check) for usage.

## Tool Testing

The tool testing playground lets you execute individual context graph tools - world tools, skills, and integrations - in isolation without making a phone call or starting a text session. This is useful during development when you need to verify that a tool behaves correctly before wiring it into a full conversation flow.

Tool testing runs against the live world model but with safety guardrails:

* **Source isolation** - All writes from tool tests are tagged with a dedicated test source, which is excluded from outbound sync. No real EHR writes, SMS deliveries, or external side effects happen.
* **Surface delivery blocking** - Surfaces created during tool tests are blocked from delivery to patients.
* **Dry run mode** - Write tools can be executed in dry run mode, which simulates the operation and reports what would have happened without persisting anything.

For each tool, the playground resolves the full tool definition (input schema, description, tier, write/read classification) and lets you provide custom input parameters and an optional entity context. After execution, you see the result, execution duration, any sub-tool calls that were made, and a list of blocked side effects.

Tool testing answers the question: **does this tool do what I expect with this input?**

{% hint style="info" %}
**Permissions**: Tool testing requires admin or owner access to the workspace. This is a developer workflow, not a production testing mechanism - use simulations for pre-deployment validation.
{% endhint %}

## Drift Detection

Drift detection tracks agent performance over time and alerts you when quality degrades. The platform monitors metric trends across conversation cohorts and compares actual behavior distributions against expected baselines. When drift is detected, it can trigger alerts, block promotions, or initiate automatic rollbacks.

Drift detection answers the question: **is the agent getting worse?**

{% content-ref url="testing/drift-detection" %}
[drift-detection](https://docs.amigo.ai/testing/testing/drift-detection)
{% endcontent-ref %}

***

## How the Pillars Work Together

These three capabilities form a continuous loop:

1. **Before deployment**, simulations verify that the agent handles target scenarios correctly and meets metric thresholds.
2. **In production**, metrics evaluate every conversation to track ongoing quality.
3. **Over time**, drift detection watches for degradation and triggers re-evaluation when performance shifts.

When drift is detected, you update your simulations and re-verify before promoting changes. This creates a feedback cycle where testing improves alongside the agent.

{% hint style="info" %}
**For Developers**: See the [REST API reference](https://docs.amigo.ai/developer-guide/core-api/metrics) and [Simulations reference](https://docs.amigo.ai/developer-guide/core-api/simulations/) for endpoint details, request/response schemas, and SDK code examples.
{% endhint %}
