# Clinical Verification

Before an AI agent interacts with patients, you need to verify that it behaves correctly in clinical scenarios. This guide covers how to use Amigo's testing framework to verify clinical accuracy, establish quality gates, and build human review into your deployment process.

## Why Clinical Verification Matters

Healthcare AI operates under constraints that most other domains do not have. A missed safety escalation can harm a patient. A scope-of-practice violation can create liability. A subtle bias in how the agent handles certain populations can produce inequitable care.

Generic testing is not sufficient. You need to test against your specific clinical workflows, patient populations, and safety standards. A model that scores well on general medical knowledge benchmarks may still fail to follow your organization's escalation protocols correctly.

## Simulation with Medical Scenarios

Build simulations that reflect the clinical situations your agent will encounter. Start with your highest-risk workflows and expand from there.

### Designing Clinical Personas

Clinical personas should represent the diversity of your patient population, including the patients who are hardest to serve well:

* **Complex medication regimens**: Patients on multiple medications where interaction detection is critical
* **Cognitive limitations**: Elderly patients or those with cognitive impairment who may provide unreliable information
* **Communication barriers**: Patients with limited health literacy, non-native speakers, or patients who minimize symptoms
* **High acuity**: Patients presenting with symptoms that require urgent escalation

Each persona tests a specific capability. A persona with cognitive impairment tests whether the agent detects confusion and adapts its communication. A persona who downplays symptoms tests whether the agent probes further when clinical indicators suggest concern.

### Building Clinical Scenarios

Pair personas with scenarios that test the boundaries of safe agent behavior:

* **Routine follow-up that reveals a problem**: A standard post-discharge check-in where the patient casually mentions a new symptom that requires escalation
* **Conflicting information**: The patient's self-report contradicts EHR data, and the agent must handle the discrepancy appropriately
* **Scope boundary**: The patient asks a question that falls outside the agent's defined scope of practice
* **Emotional distress**: The patient becomes upset, anxious, or frustrated during a clinical interaction

{% hint style="warning" %}
Do not only test happy paths. The scenarios that matter most for clinical safety are the ones where things go sideways: the patient who lies about taking their medication, the patient whose symptoms escalate mid-conversation, or the patient who insists on advice the agent should not give.
{% endhint %}

## Metric-Based Quality Gates

Define metrics that serve as hard gates on deployment. These metrics must be met before any agent configuration reaches production.

### Safety Metrics (Hard Gates)

| Metric                       | Target | Notes                                                               |
| ---------------------------- | ------ | ------------------------------------------------------------------- |
| Escalation accuracy          | 100%   | Agent correctly identifies situations requiring clinical escalation |
| Scope-of-practice adherence  | 100%   | Agent never provides advice outside its defined boundaries          |
| Privacy protocol compliance  | 100%   | Agent follows all PHI handling requirements                         |
| Medical information accuracy | 99.5%+ | Factual correctness of clinical information provided                |
| Risk disclosure completeness | 99%+   | Agent discloses relevant risks when appropriate                     |

Safety metrics are non-negotiable. A single failure in escalation accuracy means the configuration does not deploy. There is no acceptable tradeoff between safety and other dimensions.

### Quality Metrics (Improvement Targets)

| Metric                | Target | Notes                                                        |
| --------------------- | ------ | ------------------------------------------------------------ |
| Explanation clarity   | 90%+   | Information presented in language appropriate to the patient |
| Empathy score         | 85%+   | Agent demonstrates appropriate emotional support             |
| Response completeness | 90%+   | Agent fully addresses the patient's question or concern      |
| Goal completion       | 85%+   | Agent accomplishes the intended purpose of the interaction   |

Quality metrics guide improvement but do not block deployment on their own. A configuration that passes all safety gates but falls slightly below an empathy target may still be appropriate to deploy while improvement work continues.

## Human Review Workflows

Automated metrics do not catch everything. Human review adds a layer of clinical judgment that automated evaluation cannot replicate.

### When to Use Human Review

* **Initial deployment**: Have clinical staff review a meaningful sample of conversations before and during early production use.
* **After configuration changes**: Review conversations from the first few days after any update to agent behavior, context graphs, or dynamic behaviors.
* **Flagged conversations**: Configure the platform to flag conversations where metric scores are borderline or where the agent's confidence was low. Route these for human review.
* **Ongoing sampling**: Regularly review a random sample of production conversations to catch issues that metrics and drift detection may miss.

### Structuring Reviews

Provide reviewers with clear rubrics aligned to your metrics. Reviewers should assess:

* Did the agent stay within its scope of practice?
* Were escalation decisions appropriate?
* Was the clinical information accurate and complete?
* Was the communication appropriate for the patient's situation?
* Were there any missed opportunities or concerns?

Compare human review scores against automated metric scores. If they diverge, update your automated metrics to better capture what human reviewers are catching.

## Putting It Together

Clinical verification is not a one-time event. It is an ongoing process that runs in parallel with deployment:

1. **Before deployment**: Run clinical simulation suites. Pass all safety gates. Complete initial human review.
2. **During early deployment**: Monitor metrics daily. Review flagged conversations. Expand simulation coverage based on production patterns.
3. **In steady state**: Track metrics across cohorts. Detect drift. Update simulations when clinical workflows or guidelines change. Maintain ongoing human review sampling.

{% hint style="info" %}
For detailed guidance on phased deployment with quality gates at each stage, see [Simulations](https://docs.amigo.ai/testing/testing/simulations). For the testing framework details, see [Testing and Evaluation](https://docs.amigo.ai/testing/testing).
{% endhint %}
