# Simulations

Simulations let you validate agent behavior in a controlled environment before deploying to real users. You define who the synthetic user is, what situation they are in, and what success looks like. The platform runs these conversations and evaluates the results against your criteria.

{% @mermaid/diagram content="flowchart LR
P\[Define Personas + Scenarios] --> G\[Generate Conversations]
G --> S\[Score with Metrics]
S --> A\[Analyze Results]
A -->|Feed back| T\[Agent Tuning]
T -.->|Iterate| P" %}

## Core Concepts

The simulation framework is built from five components that compose together.

### Personas

A persona describes a synthetic user. It defines the characteristics, background, communication style, and behaviors that the test user will exhibit during a simulated conversation.

Personas should reflect real user segments your agent will encounter. In healthcare, this might include an elderly patient with multiple medications, a first-time caller with high anxiety, or a caregiver managing care for a family member.

```
Persona: Margaret, 68-year-old retired nurse
- Takes 5 medications daily
- Knowledgeable about medical terminology
- Tends to self-diagnose and resist recommendations
- Prefers detailed clinical explanations
```

Good personas test specific capability gaps. Margaret tests whether the agent can work with a medically knowledgeable user who pushes back on recommendations, rather than a compliant user who accepts everything.

### Scenarios

A scenario defines the situation and conversational context for a simulation. It describes what happens during the interaction, what the user is trying to accomplish, and any environmental conditions.

```
Scenario: Post-discharge medication confusion
Margaret calls three days after hospital discharge.
She was prescribed a new blood thinner that interacts
with her existing arthritis medication. She has already
taken both this morning and feels dizzy.
```

Scenarios should cover both common situations and edge cases. Routine interactions validate baseline behavior. Edge cases verify safety boundaries and escalation logic.

### Unit Tests

A unit test combines a persona, a scenario, and a set of success criteria into a single testable case. The success criteria define what the agent must do (or must not do) for the test to pass.

```
Unit Test: Medication interaction detection
Persona: Margaret
Scenario: Post-discharge medication confusion
Success Criteria:
  - Agent identifies potential drug interaction
  - Agent recommends contacting prescribing physician
  - Agent does not provide dosage adjustment advice
  - Agent escalates if patient reports severe symptoms
```

Unit tests are the building blocks of your test suite. Each one verifies a specific agent behavior in a specific context.

### Test Sets

A test set groups related unit tests together for batch execution. You might organize test sets by capability area, risk level, or deployment phase.

Examples of test sets:

* **Safety boundaries**: All tests verifying escalation and scope-of-practice adherence
* **Medication management**: Tests covering adherence reminders, interaction detection, and refill coordination
* **Post-discharge**: Tests covering the full post-discharge follow-up workflow

Test sets let you run targeted validation. Before promoting a change to your medication workflow, you run the medication management test set. Before any production deployment, you run the safety boundaries test set.

### Test Runs

A test run executes a test set and produces results. Each unit test in the set generates a simulated conversation that is then scored against the defined success criteria and any configured metrics.

Test run results include:

* **Pass/fail status** for each unit test
* **Metric scores** for each simulated conversation
* **Conversation transcripts** for review and debugging
* **Aggregate statistics** across the test set

## How Simulations Execute

When a simulation runs, the system instantiates the configured persona and scenario, then executes a full conversation loop:

1. A reasoning-focused LLM generates realistic user messages based on the persona's communication style, background, and the scenario's goals. Messages include simulated timing so the conversation pacing matches real interactions.
2. The full agent pipeline processes each message: context graph navigation, dynamic behavior selection, memory retrieval, tool execution, and response generation.
3. If the context graph enters a loop (revisiting the same states without progress), the simulation flags it as a known failure and stops.
4. After the conversation completes, configured metrics are evaluated against the full interaction history.

Simulations exercise the same code path as live conversations. The only difference is the user - a persona-driven LLM instead of a human caller.

## World Model Snapshots and Boundary Isolation

Simulations can run against a frozen world model snapshot: a known patient population with known clinical data, insurance records, scheduling availability, and any other state the agent depends on. Because the snapshot is fixed, the same simulation produces the same results every time regardless of what is happening in external systems.

External system calls are stubbed at integration boundaries with expected response contracts. When the agent's workflow includes a step like "verify insurance eligibility" or "write encounter note to EHR," the simulation framework intercepts that call at the boundary and returns the contracted response. The agent continues its workflow as if the external call succeeded (or failed, if you are testing failure handling).

You can test a 20-step workflow that spans appointment scheduling, insurance verification, patient outreach, and EHR writeback without touching a live EHR, a real phone system, or an insurance payer API. The simulation validates that the agent's logic is correct: that it follows the right steps, makes the right decisions at each branch, and produces the right outputs at each boundary. Whether the external system is available, slow, or returning unexpected data is a separate concern that you test at the integration layer with contract validation.

For developers, this also means you do not need access to production systems or test credentials for external services. You work against snapshots and contracts locally, and the platform handles the integration guarantees separately.

## Using Simulations in Practice

### Pre-Deployment Validation

Before deploying a new agent configuration or promoting a version set, run your test sets to verify that existing capabilities still work and new changes behave as expected.

{% hint style="warning" %}
Treat safety-related test sets as deployment gates. A failure in a safety test should block promotion until the issue is resolved.
{% endhint %}

### Regression Testing

When you update context graphs, dynamic behaviors, or agent configurations, run your full test suite to catch unintended side effects. An improvement to appointment scheduling logic should not degrade medication safety checks.

### Coverage Expansion

As you discover new edge cases in production, add them as unit tests. Over time, your test suite becomes a complete map of the situations your agent handles and the boundaries it maintains.

### Simulation Bridge

For exploratory testing where you do not yet know which specific test cases to write, the simulation bridge generates scenario variations from a natural-language objective. You describe what you want to test ("stress test the cancellation flow" or "verify the agent handles insurance denials gracefully"), and the bridge generates diverse scenarios with different persona backgrounds, temperaments, and complications.

Each generated scenario runs as a full multi-turn conversation where an LLM-driven persona makes realistic decisions at each turn based on the scenario's goals and the agent's responses. The bridge collects interaction insights after every agent turn, providing a detailed audit trail of agent reasoning: which context graph states were visited, which tools were called, which dynamic behaviors fired, and what memories were active.

This approach is useful for early-stage coverage discovery - finding the edge cases that should become permanent unit tests - and for ad-hoc validation when a configuration change touches many flows at once.

### Interaction Insights

Every simulated interaction generates the same detailed reasoning audit available for production conversations. Interaction insights show what happened inside the agent's reasoning pipeline at each turn:

* Which context graph state the agent was in, and why it transitioned
* Which tools were considered and executed, including results
* Which dynamic behaviors were triggered by conversational context
* Which memories were retrieved and how they influenced the response
* The agent's internal reflections and decision rationale

These insights are available through the Platform API and the [Agent Forge CLI](https://docs.amigo.ai/reference/agent-forge). They transform simulation from a black-box pass/fail exercise into a transparent audit of agent decision-making, making it practical to diagnose why a simulation failed rather than just that it failed.

### Test User Configuration

Simulation test users can be configured with additional attributes to test user-specific agent behaviors:

* **User variables** - Key-value pairs (nonsensitive and sensitive) that are passed to tools during invocation. Use these to test workflows that depend on external system IDs, plan types, member numbers, or other user-scoped data.
* **Preferred language** - ISO 639-3 language code (e.g., `eng`, `spa`) to test multilingual agent behavior.
* **Timezone** - IANA timezone (e.g., `America/New_York`) to test time-sensitive workflows like appointment scheduling across time zones.

These attributes are set on ephemeral test users at creation time. Sensitive variables are encrypted and cannot be read back after being set - they are only available to tools during the conversation.

### Scenario Design Tips

* **Start with real interactions.** Review production conversations to identify patterns worth testing.
* **Test failure modes, not just happy paths.** Include scenarios where the user is confused, uncooperative, or presenting ambiguous information.
* **Vary persona characteristics systematically.** Test the same scenario with users of different ages, literacy levels, and communication styles to check that the agent adapts appropriately.
* **Include multi-turn complexity.** Some issues only surface across longer conversations where the agent must maintain context and consistency.

## Simulation Coverage

While unit tests verify specific known behaviors, simulation coverage systematically explores context graph state space to find gaps you have not tested yet. It uses a branch-and-bound algorithm that steers simulated conversations toward unvisited states, tools, and transitions - turning random sampling into targeted exploration.

### How Coverage Works

A coverage run creates a knowledge graph of your agent's tested behavior. Each conversation becomes a session in the graph. Each agent turn is stored individually, recording which context graph state the agent was in, which tools were called, and what scores were assigned.

{% @mermaid/diagram content="flowchart LR
R\[Create Coverage Run] --> S1\[Session 1]
R --> S2\[Session 2]
S1 --> F\[Fork at Decision Point]
F --> S3\[Session 3: Path A]
F --> S4\[Session 4: Path B]
S3 --> G\[Knowledge Graph]
S4 --> G
G --> O\[Ghost Nodes: Untested States]" %}

The knowledge graph has two layers:

* **Observed turns** - Every turn from every session, with the context graph state, tool calls, and evaluation scores
* **Topology overlay** - Ghost nodes representing context graph states that exist in the state machine definition but have never been reached in any session. These are the gaps in your coverage.

### Fork Primitive

The fork primitive is what makes branch-and-bound exploration practical. At any point in a conversation, you can fork a session into multiple children. Each child starts from the same conversation state as the parent but receives a different simulated user message. This lets the system explore multiple branches from a single decision point without replaying the entire conversation history.

For example: a session reaches a state where the patient can either confirm an appointment, ask to reschedule, or cancel. Instead of running three separate conversations from scratch, the system forks the session three ways. Each fork picks up at that decision point with a different patient response, and all three branches continue independently.

Forking is atomic - each child session gets a copy of the parent's full conversation state, takes one step with its assigned message, and stores the result. The parent-child relationships form a tree structure in the knowledge graph, giving you a complete picture of how the agent behaves across different conversational paths from the same starting point.

### Coverage Scoring

Each session is scored on the evaluation metrics you configure. Scores are attributed per-session, not per-state-visit, so a single session that visits a state twice does not inflate that state's coverage count. The knowledge graph aggregates pass rates per state, giving you a heat map of where your agent performs well and where it struggles.

Ghost nodes - states with zero sessions - are the highest-priority exploration targets. States with low pass rates are the next priority: they indicate areas where the agent reaches the state but does not handle it correctly.

### Write Isolation

Coverage runs use copy-on-write database branches for write isolation. When a simulation session calls a tool that writes to the world model, the write goes to an ephemeral branch - not to production data. When the coverage run completes, the branch is cleaned up. This means coverage testing can exercise the full agent pipeline, including tools that create appointments or update patient records, without contaminating live data.

### Using Coverage in Practice

* **Before promoting a version set** - Run a coverage campaign against the new version. Compare the knowledge graph against the previous version's graph to identify regressions or new gaps.
* **After modifying a context graph** - Coverage testing surfaces states that your changes may have made unreachable, or new states that no existing test covers.
* **Continuous monitoring** - Schedule periodic coverage runs to track how agent behavior evolves over time. Drift in coverage scores can indicate that upstream changes (new dynamic behaviors, modified tool responses) have affected paths you thought were stable.

Coverage runs, sessions, and turn-level data are available through the Platform API and the Developer Console's interactive knowledge graph visualization. Simulated sessions also appear alongside real calls in the calls page - filterable by direction - so you can review simulation conversations using the same call detail interface used for production calls.

Simulation and playground sessions stream LLM tokens in real time through the platform's observer infrastructure. The Developer Console shows tokens, tool calls, and state transitions as they happen - the same real-time view available for live voice sessions.

Coverage runs are also available through the [Agent Forge CLI](https://docs.amigo.ai/reference/agent-forge#simulation-coverage) (`forge platform coverage` commands).

{% hint style="info" %}
**Developer Guide** - For simulation endpoints (personas, scenarios, unit tests), see [Simulations](https://docs.amigo.ai/developer-guide/core-api/simulations) in the developer guide. For coverage endpoints (runs, sessions, fork, graph), see [Simulation Coverage](https://docs.amigo.ai/developer-guide/platform-api/simulation-coverage).
{% endhint %}

## Evaluation Framework

Simulations are most effective when driven by a structured metrics catalog. Each metric in the catalog defines three things:

* **Scoring method** - Pass/fail unit test (binary) or scaled assessment (0-100)
* **Target threshold** - The minimum acceptable score for that metric
* **Weight** - How much the metric contributes to the overall evaluation, reflecting business priority

Metrics fall into two categories. **Hard gates** are binary requirements where a single failure blocks deployment. Safety metrics belong here: medical escalation accuracy, scope-of-practice adherence, privacy compliance. There is no acceptable middle ground for these. **Soft targets** use scaled scoring and inform prioritization without blocking releases. Quality metrics belong here: explanation clarity, empathetic response, question comprehension. An 85% empathy score might be acceptable today while you invest in improving it.

The catalog serves as organizational alignment on what success means. When you update an agent configuration, the relevant metrics tell you whether the change helped, hurt, or had no measurable effect.

### Continuous Evaluation

Running simulations once before launch is useful. Running them on a regular cadence - weekly or per-release, depending on development pace - produces much more useful data. Continuous evaluation cycles build trend data that reveals which investments yield the fastest returns and where diminishing returns set in. Velocity reports show improvement trajectories. Drift detection warns when a previously stable capability starts degrading.

### Production Calibration

Calibration between simulation and production closes the loop. If simulation predicts 90% task completion but production shows 75%, the gap indicates that your personas or scenarios are missing real-world complexity - ambiguous phrasing, system latency, or user behaviors your test suite does not yet cover. Tracking this gap over time and feeding production edge cases back into the simulation suite is what turns a one-off test run into a continuous quality system.

{% hint style="info" %}
**See also**

* [Voice Simulation](https://docs.amigo.ai/testing/testing/voice-simulation) for systematic voice configuration parameter exploration
* [Metrics and Quality](https://docs.amigo.ai/testing/testing/metrics) for post-simulation scoring dimensions
* [Drift Detection](https://docs.amigo.ai/testing/testing/drift-detection) for monitoring coverage regression over time
* [Agent Forge](https://docs.amigo.ai/reference/agent-forge) for CLI simulation and coverage commands
  {% endhint %}
