# Voice Agent

The Amigo voice agent powers real-time, emotionally intelligent voice conversations. It handles inbound and outbound phone calls - executing context graph logic (based on a Hierarchical State Machine architecture) with speech understanding, text-to-speech, tool execution, safety monitoring, and continuous emotional adaptation. Every call connects to the [world model](https://docs.amigo.ai/developer-guide/platform-api/platform-api/data-world-model), reads live patient context, writes clinical events with multi-stage verification, and adapts its behavior based on real-time vocal emotion analysis.

{% hint style="warning" %}
**Reliability target:** This system handles healthcare scheduling calls where callers may be in distress, pain, or crisis. Every design decision prioritizes graceful degradation - if any intelligence layer fails, the call continues with the next-best behavior, never silence.
{% endhint %}

{% hint style="info" %}
**Voice settings and Classic API differences** - Voice settings (tone, speed, keyterms, sensitive topics, post-call flags) are configured at the workspace level; see [Workspaces - Voice Settings](https://docs.amigo.ai/developer-guide/platform-api/workspaces#voice-settings). Classic API offers [WebSocket voice streaming](https://docs.amigo.ai/developer-guide/classic-api/core-api/conversations/conversations-voice) for text-based apps; Platform API voice is phone-based with emotion detection, EHR context, and operator escalation.
{% endhint %}

## Audio Pipeline Architecture

Every voice call flows through a five-layer pipeline that transforms the caller's audio into emotionally adaptive agent speech - while simultaneously reading from and writing to the world model.

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
subgraph Input\["1. Signal Capture"]
A\["Caller Audio\n(telephony stream)"]
A --> B\["Speech-to-Text\n(streaming, sub-300ms)"]
A --> C\["Emotion Detection\n(prosody + burst + language)"]
end

```
subgraph Intel["2. Intelligence Layer"]
    D["Emotional State\n(rolling 30s window,\nrecency-weighted)"]
    E["Voice Context\n(per-turn, pure function)"]
    C --> D --> E
end

subgraph Engine["3. Context Graph Engine"]
    F["Navigator\n(select action + filler)"]
    G["Engage LLM\n(generate response)"]
    B --> F --> G
    E -->|"emotional steering\n+ filler guidelines"| F
    E -->|"emotional steering\n+ micro-behaviors"| G
end

subgraph Output["4. Audio Output"]
    H["Text-to-Speech\n(emotion-adaptive,\nword-level timestamps)"]
    G --> H
    E -->|"emotion + speed\n+ volume"| H
end

subgraph Post["5. Post-Call Intelligence"]
    I["Transcript Verification\n(batch re-transcription)"]
    J["Quality Analysis\n(5-dimension scoring)"]
    JJ["STT Keyword Feedback\n(self-improving loop)"]
    J --> JJ
end

subgraph World["World Model"]
    K["Patient Context\n(ambient injection)"]
    L["Verified Clinical Events\n(confidence-gated)"]
end

K -.->|"ambient context\n(3 channels)"| G
G -.->|"tool results\n(confidence 0.3 → review → 0.7)"| L" %}
```

### Layer 1: Signal Capture

Two parallel streams process the caller's audio simultaneously - neither blocks the other. This dual-stream architecture is fundamental: speech recognition and emotion detection are completely independent. A failure in one never impacts the other.

**Speech-to-Text** - Real-time streaming transcription with sub-300ms latency. Three layers of domain vocabulary boost recognition accuracy:

1. **Service-level keyterms** - Managed by workspace administrators, applied to all calls for that service
2. **Workspace voice settings keyterms** - API-configurable per workspace (see [voice settings](https://docs.amigo.ai/developer-guide/platform-api/workspaces#voice-settings))
3. **System defaults** - Engineering-level fallback vocabulary

All sources are merged and deduplicated per call. Configurable end-of-turn detection with tunable confidence thresholds determines when the caller has finished speaking - balancing responsiveness against cutting off mid-sentence.

**Emotion Detection** - Parallel audio analysis with zero impact on the voice pipeline. Three concurrent models analyze every audio segment on a single persistent connection:

| Model           | Input                   | Output                                                          | Unique Capabilities                                                                                      |
| --------------- | ----------------------- | --------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- |
| **Prosody**     | 2-second audio segments | 48 emotions from vocal tone, pitch, rhythm, timbre              | Real-time voice quality analysis                                                                         |
| **Vocal Burst** | Same audio segments     | 67 non-speech vocal types (laughs, sighs, cries, gasps, groans) | Captures sounds that transcription loses entirely                                                        |
| **Language**    | Final STT transcripts   | 53 emotions + 9-point sentiment + 6-category toxicity           | Detects sarcasm, tiredness, annoyance, disapproval, enthusiasm - 5 emotions unavailable from audio alone |

**Dual-payload multiplexing**: Audio segments request prosody + burst models; text transcripts request the language model. Responses are unambiguous - each contains only the models requested. This separation is architecturally important: the language model requires text input, not audio, and runs on STT output rather than raw audio - ensuring it analyzes what the caller *said*, not just how they *sounded*.

**Audio buffering**: 2-second segments with non-blocking queues (max 5 segments for audio, max 20 for text). The emotion pipeline never blocks the voice pipeline, and dropped segments gracefully degrade to slightly less precise emotion detection rather than failure.

**Circuit breaker protection**: Emotion detection is protected by a circuit breaker (2 failures triggers 10-second recovery). If the emotion service is degraded, calls continue smoothly with workspace defaults - the circuit breaker prevents cascading latency from affecting the critical voice path.

### Layer 2: Intelligence

The **Emotional State** maintains a rolling 30-second window (\~15 segments) of recent caller signals with **recency-weighted linear averaging** - the most recent signals have the highest influence. The agent responds to the caller's *current* emotional state, not an average of the whole call.

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
subgraph Inputs\["Signal Sources"]
P\["Prosody\n(48 emotions)"]
B\["Bursts\n(67 vocal types)"]
L\["Language\n(53 emotions +\nsentiment + toxicity)"]
BH\["Behavioral\n(barge-ins, silences,\nshort responses)"]
end

```
subgraph State["Emotional State (Rolling 30s Window)"]
    V["Valence: -1.0 to +1.0"]
    AR["Arousal: 0.0 to 1.0"]
    DOM["Dominant emotion + score"]
    TR["Trend: improving/stable/deteriorating"]
    COH["Coherence: prosody vs language"]
    BUR["Burst events (last 5s)"]
    PH["Call phase: early/mid/late"]
end

subgraph Output["Per-Turn Voice Context"]
    TTS["TTS: emotion + speed + volume"]
    FG["Filler: guidelines + suppression"]
    ES["Emotional steering → system prompt"]
    FE["Filler enabled/disabled"]
end

P --> State
B --> State
L --> State
BH --> State
State --> Output" %}
```

**Valence/Arousal computation**: Every emotion maps to a (valence, arousal) coordinate via a complete emotion-dimension mapping. Weighted sums across all detected emotions per segment, then recency-weighted across the rolling window, produce stable yet responsive emotional tracking.

**Trend detection**: Compares first-half vs second-half valence of the rolling window. A delta > 0.1 → **improving**; delta < -0.1 → **deteriorating**; otherwise **stable**. This powers the call-phase escalation system - deteriorating trends trigger increasingly urgent adaptation.

**Coherence (prosody vs language agreement)**: Measures whether what the caller *says* matches how they *sound*:

| Condition                                                | Coherence      | Agent Response                          |
| -------------------------------------------------------- | -------------- | --------------------------------------- |
| Same valence sign (both positive or both negative)       | High (0.7-1.0) | Normal adaptation                       |
| Opposite valence (words say fine, voice says distressed) | Low (0.0-0.3)  | **Trust the vocal tone over the words** |
| One signal neutral                                       | Mild (0.8)     | Use the available signal                |

**When coherence < 0.4**: The caller may be masking their true state. The agent's emotional steering instruction shifts: *"Respond to how they sound, not what they claim."* This is injected into the system prompt without the agent ever explicitly mentioning the discrepancy.

**Behavioral signal tracking**: Updated in real-time from the session and turn controller:

| Signal                  | Detection                       | Threshold | Meaning                              |
| ----------------------- | ------------------------------- | --------- | ------------------------------------ |
| `barge_in_count`        | Caller interrupts agent speech  | ≥ 2       | Frustration - agent talking too much |
| `short_response_streak` | Consecutive responses ≤ 4 words | ≥ 3       | Disengagement - caller withdrawing   |
| `silence_gap_count`     | Gaps ≥ 5 seconds                | ≥ 2       | Confusion, hesitation, or distress   |

{% hint style="info" %}
**Semantic barge-in detection** - Barge-in detection uses semantic confirmation - it requires actual recognized words from the STT engine (not just voice activity detection). This filters false triggers from coughs, breathing, echo, and background noise. Minimum speech duration is 0.5 seconds with recognized words, with a 1.0-second fallback for delayed word recognition.
{% endhint %}

These behavioral signals are injected into the system prompt alongside emotional steering - the LLM receives a complete picture of both *how the caller sounds* and *how they're behaving*.

### Layer 3: Context Graph Engine

Each turn processes through a **two-stage LLM pipeline**:

1. **Navigator** - Selects the next action or exit condition from the current context graph state. Also generates a filler phrase to cover processing latency and determines whether to trigger [audio verification](#audio-verification) for structured data capture. Uses structured output validation with automatic retry (up to 3 attempts) and fallback to first valid action.
2. **Engage LLM** - Generates the caller-facing response, informed by the selected action, full conversation history with per-message emotion annotations (`[VOICE: EmotionName, valence=V.VVV]`), audio correction results, emotional steering context, ambient patient context, and available tools.

**Emotion reaches the LLM via two independent paths:**

| Path                        | Scope              | What It Contains                                                                                                                |
| --------------------------- | ------------------ | ------------------------------------------------------------------------------------------------------------------------------- |
| **Per-message annotations** | Every user message | Inline `[VOICE: Anxiety, valence=-0.312]` - the LLM sees the emotional trajectory across the full conversation                  |
| **Session-level steering**  | System prompt      | Dominant emotion + trend, quadrant-specific adaptation instructions, behavioral signals, call-phase urgency, coherence warnings |

**Communication micro-behaviors**: The engage template contains hardcoded guidelines that instruct the LLM on micro-level conversational behaviors that are always active - not gated by emotion:

| Behavior                    | Description                                                                              |
| --------------------------- | ---------------------------------------------------------------------------------------- |
| **Speech rhythm mirroring** | If the caller speaks in short bursts, respond concisely; if conversational, match warmth |
| **Emotional name usage**    | Use the caller's name at moments of emotional significance, not mechanically             |
| **Pause injection**         | When delivering difficult information, pause naturally before the key detail             |
| **Pace inversion**          | When the caller is rushing, slow the pace with longer sentences and gentle transitions   |
| **Completion inference**    | When the caller trails off mid-sentence, acknowledge what they were trying to say        |
| **Emotion concealment**     | Never explicitly mention that the system can detect emotions                             |
| **Natural laughter**        | Contextual laughter available for naturally warm moments - used sparingly                |

### Layer 4: Audio Output

The engage LLM's text streams to the TTS engine for speech synthesis with **per-turn dynamic controls**:

* **Emotion** - Derived from the [voice tone priority chain](#voice-tone-priority-chain)
* **Speed** - From workspace [voice settings](https://docs.amigo.ai/developer-guide/platform-api/workspaces#voice-settings)
* **Volume** - From workspace voice settings

**Word-level timestamps** are collected for every generated word - start time and end time - enabling transcript-to-audio scrubbing in the call playback UI. This is critical for the review queue workflow where operators need to jump to specific moments in a call.

### Layer 5: Post-Call Intelligence

Two optional analyses run after every call (controlled via [voice settings](https://docs.amigo.ai/developer-guide/platform-api/workspaces#voice-settings)):

**Transcript verification** - Re-transcribes the full call audio with a high-accuracy batch model and computes Word Error Rate (WER) against the real-time transcript. Produces `verified_transcript`, `verified_words`, and `transcript_accuracy` - enabling quality comparisons between the real-time and batch transcription.

**Quality analysis** - Listens to the full stereo recording (caller + agent) and scores on 5 dimensions (1-5 each):

| Dimension                | What It Measures                              |
| ------------------------ | --------------------------------------------- |
| **Task Completion**      | Did the agent achieve the caller's goal?      |
| **Information Accuracy** | Was the information provided correct?         |
| **Conversation Flow**    | Was the conversation natural and smooth?      |
| **Error Recovery**       | How well did the agent recover from mistakes? |
| **Caller Experience**    | How did the caller feel at the end?           |

#### Call Intelligence Persistence

Alongside the LLM-based quality analysis, the voice agent computes a structured intelligence summary from in-memory session state at call end. This runs synchronously during session cleanup (before the session is torn down) and captures operational telemetry that the async quality analysis cannot see.

Each call intelligence record contains:

| Field                  | Type   | Description                                                                                            |
| ---------------------- | ------ | ------------------------------------------------------------------------------------------------------ |
| `quality_score`        | float  | Rule-based composite score (0-100), penalty-based                                                      |
| `emotion_summary`      | object | `dominant_emotion`, `average_valence`, arousal, peak negative, shifts, final trend                     |
| `risk_summary`         | object | Composite risk score, level, contributing signals with weights                                         |
| `latency_summary`      | object | Engine response time (avg/p50/p95), audio TTFB (avg/p50/p95), silence ratio                            |
| `conversation_summary` | object | `turn_count`, `states_visited_count`, `unique_states`, `loop_count`, barge-in count, completion reason |
| `tool_summary`         | object | Total calls, success/failure counts, failure rate, per-tool breakdown                                  |
| `safety_summary`       | object | `match_count` (safety rule matches), `actions` taken                                                   |
| `operator_summary`     | object | `escalated` (boolean), operator connect time, resolution                                               |
| `completion_reason`    | string | Why the call ended (hangup, terminal state, silence, etc.)                                             |
| `final_state`          | string | Last context graph state at call end                                                                   |

**Quality score penalties:**

| Signal        | Threshold               | Penalty    |
| ------------- | ----------------------- | ---------- |
| High latency  | p95 audio TTFB > 1000ms | -5 to -15  |
| Silence       | Silence ratio > 0.2     | -10 to -20 |
| Barge-ins     | > 2                     | -5 to -15  |
| Agent loops   | > 0 revisited states    | -10 to -20 |
| Escalation    | Any                     | -10        |
| Tool failures | Failure rate > 5%       | -5 to -15  |

The computation is pure (no I/O, no external calls) - all data comes from in-memory session state. If the write fails, the error is logged but does not affect the caller or post-call processing.

#### Call Intelligence Endpoints

Two endpoints expose intelligence data for completed and active calls:

**`GET /calls/{call_id}/intelligence`** - Full intelligence profile for a completed call.

Joins persisted call intelligence summaries with per-turn data reconstructed from conversation history:

| Response Field                     | Source                  | Description                                                                             |
| ---------------------------------- | ----------------------- | --------------------------------------------------------------------------------------- |
| `quality_score`                    | Persisted summary       | Composite 0-100 score                                                                   |
| `emotion_trajectory`               | Per-turn reconstruction | `EmotionTurnPoint[]` - turn number, timestamp, emotion, valence                         |
| `risk_timeline`                    | Per-turn reconstruction | `RiskTurnPoint[]` - turn number, timestamp, risk score, state                           |
| `latency_profile`                  | Both                    | `LatencyProfile` - per-turn waterfall (engine/nav/render/audio TTFB ms) + summary stats |
| `tool_performance`                 | Per-turn reconstruction | `ToolPerformanceItem[]` - per-tool invocations, success/fail, avg ms                    |
| `conversation_quality`             | Per-turn reconstruction | Loop events (turn + state) and barge-in events (turn + interrupted text)                |
| `*_summary` fields                 | Persisted summary       | Full summaries (emotion, risk, latency, conversation, tool, safety, operator)           |
| `completion_reason`, `final_state` | Persisted summary       | Why the call ended and the last context graph state                                     |

Returns 404 if the call or intelligence data is not found.

**`GET /calls/active/intelligence`** - Active calls with live intelligence overlay.

Enriches the active call listing with per-turn intelligence from cached snapshots:

| Response Field       | Source     | Description                           |
| -------------------- | ---------- | ------------------------------------- |
| `current_emotion`    | Live cache | Current detected emotion              |
| `current_valence`    | Live cache | Current emotional valence             |
| `current_risk_score` | Live cache | Current composite risk score          |
| `risk_trend`         | Live cache | `rising`, `stable`, or `falling`      |
| `turn_count`         | Live cache | Number of turns completed             |
| `escalation_active`  | Live cache | Whether operator escalation is active |
| `current_state`      | Live cache | Current context graph state           |

Supports `workspace_id` query parameter for filtering.

#### Live Intelligence Pipeline

The voice agent writes a compact intelligence snapshot after each caller speech turn. The snapshot includes current emotion, risk score, turn count, escalation status, and current state.

Intelligence data is refreshed alongside the active call heartbeat. If the session ends or is lost, the live data expires automatically.

The active intelligence endpoint reads all live intelligence for active calls in a single operation for efficient dashboard polling.

**Self-improving feedback loop**: Quality analysis also produces `stt_suggestions` - words the STT misheard, formatted as recognition keywords for future calls. This creates a closed loop:

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
A\["Quality analysis\nfinds STT errors"] --> B\["Suggests keywords"]
B --> C\["Keywords added to\nworkspace voice settings"]
C --> D\["Future calls\nbetter recognition"]
D -->|"Next call"| A" %}

## How Calls Work

Every call runs inside a **conference architecture** - a multi-party audio bridge that enables the caller, AI agent, and optionally a human [operator](https://docs.amigo.ai/developer-guide/platform-api/platform-api/operators) to all participate simultaneously.

### Inbound Call Flow (Instant Greeting)

The system eliminates dead air at call start through **parallel pre-warming** - the engine, greeting, and agent connection all initialize while the phone is still ringing.

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"actorBkg": "#083241", "actorTextColor": "#FFFFFF", "actorBorder": "#083241", "signalColor": "#575452", "signalTextColor": "#100F0F", "labelBoxBkgColor": "#F1EAE7", "labelBoxBorderColor": "#D7D2D0", "labelTextColor": "#100F0F", "loopTextColor": "#100F0F", "noteBkgColor": "#F1EAE7", "noteBorderColor": "#D7D2D0", "noteTextColor": "#100F0F", "activationBkgColor": "#E8E2EB", "activationBorderColor": "#083241", "altSectionBkgColor": "#F1EAE7", "altSectionColor": "#100F0F"}}}%%
sequenceDiagram
participant Caller
participant Tel as Telephony
participant Agent as Voice Agent

```
Caller->>Tel: Dials phone number
Tel->>Agent: Webhook (T=0)
Note over Agent: Resolve: phone → workspace → service → version set

par Pre-warm during ring time (parallel)
    Agent->>Agent: Initialize engine + load context graph + load tools
    Agent->>Agent: Generate greeting text via LLM
    Agent->>Agent: Resolve caller → patient context from world model
    Agent->>Tel: Create agent conference leg (via conference name)
    Tel->>Agent: Agent WebSocket connects
    Note over Agent: Agent fully ready - greeting cached
end

Note over Tel: Phone rings...

Caller->>Tel: Picks up (T=Xs)
Tel->>Agent: Caller joins conference
Agent-->>Caller: ✅ Instant greeting (~200-300ms)

loop Conversation Turns
    Caller->>Agent: Speech audio (bidirectional stream)
    par Signal Processing
        Agent->>Agent: STT (transcript + end-of-turn)
        Agent->>Agent: Emotion (prosody + burst + language)
    end
    Agent->>Agent: Navigator → Filler → Engage LLM → TTS
    Agent-->>Caller: Emotionally adaptive audio response
end

Note over Agent: Call ends (terminal state, silence, or hangup)
Note over Agent: Persist call record + emotional summary
Note over Agent: Post-call analysis (background)" %}
```

**Key insight**: The telephony conference API accepts friendly names, not just IDs. The conference name is known at webhook time. The agent leg is created immediately - the conference is created on-demand when the agent joins. This means the agent can be fully connected and waiting *before the caller even picks up*.

**Timeline comparison:**

| Phase                  | Without Pre-warm      | With Pre-warm                        |
| ---------------------- | --------------------- | ------------------------------------ |
| Webhook → Engine ready | After pickup (+1-3s)  | During ring (hidden)                 |
| Agent leg creation     | After pickup (+200ms) | During ring (hidden)                 |
| WebSocket connection   | After pickup (+200ms) | During ring (hidden)                 |
| Greeting generation    | After pickup (+500ms) | During ring (hidden)                 |
| **Total dead air**     | **\~1200ms**          | **\~200-300ms** (TTS streaming only) |

**Safety guarantees:**

* Caller hangs up during ring → cache entry expires (30s TTL), resources cleaned up lazily
* WebSocket lands on different pod → cache miss, standard initialization (no degradation)
* Pre-warm exceeds timeout → TwiML returned anyway, standard initialization on pickup
* Session capacity is NOT consumed during pre-warm (no active session yet)

{% hint style="info" %}
**Pre-warm** is best-effort. If initialization takes longer than expected, the system falls back to standard initialization - no degradation in call quality, just a slightly longer time to first greeting.
{% endhint %}

### Outbound Call Flow

Outbound calls are **world-model-native** - scheduled as `outbound_task` entities via the `schedule_outbound_call` tool during inbound calls, then dispatched by the [connector runner](https://docs.amigo.ai/developer-guide/platform-api/platform-api/connector-runner) when they become due.

**Five business logic patterns** can produce outbound tasks:

| Pattern                           | Description                              | Example                                                        |
| --------------------------------- | ---------------------------------------- | -------------------------------------------------------------- |
| **Scheduled**                     | Decision made, execution deferred        | "I'll call you back tomorrow at 2pm"                           |
| **Event-Reactive**                | Trigger → evaluate → maybe act           | New lab result → is it critical? → call patient                |
| **Continuous Monitoring**         | Periodic population sweep                | Patients with no contact in 30 days                            |
| **Conversational Follow-Through** | Track preconditions from agent promises  | "I'll call after the doctor reviews" → pending on doctor event |
| **Orchestrated Campaign**         | Achieve outcome for population over time | "Get all 200 patients to complete annual wellness by Q4"       |

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
subgraph Producers\["Task Producers"]
A\["Voice agent\n(mid-call promise)"]
B\["Autonomous agent\n(panel review)"]
C\["Connector runner\n(reactive rules)"]
D\["Dashboard\n(manual schedule)"]
end

```
subgraph WM["World Model"]
    E["outbound_task\nentity"]
end

subgraph Dispatch["Dispatch Loop"]
    F{"Due?\nBusiness hours?\nRetry budget?"}
    G["Build rich context\nfrom patient projection"]
    H["Dispatch call"]
end

Producers --> E
E --> F
F -->|Yes| G --> H
F -->|No| I["Wait for\nnext window"]
H --> J["Voice agent\nexecutes with\nfull patient context"]" %}
```

Each outbound task carries: patient reference, reason, goal, priority (1-10), business-hours window (timezone-aware), retry config (max attempts with configurable backoff), and rich context from the patient's world model projection. The dispatch loop enriches the system prompt so the agent starts the call with full patient knowledge - **the agent never needs to "look up" the patient**.

**Outbound prewarm**: Outbound calls use the same parallel pre-warming as inbound calls. During the dialing/ringing phase (typically 5-15 seconds), the engine initializes, loads the context graph, resolves patient context, and generates the greeting. When the patient answers, the engine and greeting are already cached - the patient hears an instant greeting instead of several seconds of silence. Prewarm is best-effort: if initialization takes longer than the ring time, the system falls back to standard cold initialization.

### Conference Architecture

<details>

<summary>Conference architecture - telephony details</summary>

The conference architecture supports multiple simultaneous audio participants with independent per-participant streams:

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
subgraph Conference\["Telephony Conference"]
C\["Caller\n(PSTN - phone network)"]
A\["AI Agent\n(WebSocket stream)"]
O\["Operator\n(PSTN or WebRTC)\n\[optional]"]
end

```
subgraph STT["Per-Participant Speech-to-Text"]
    CS["Caller STT\n(speaker attribution)"]
    OS["Operator STT\n(human transcript capture)"]
    AS["Agent STT\n(turn processing)"]
end

C --> CS
O --> OS
A --> AS

subgraph Resolution["Speaker Resolution"]
    R["Priority: Operator > Caller > Default"]
end

CS --> Resolution
OS --> Resolution" %}
```

| Participant  | Role                              | Audio Transport         | STT                              |
| ------------ | --------------------------------- | ----------------------- | -------------------------------- |
| **Caller**   | Person who called or was called   | PSTN                    | Dedicated per-participant stream |
| **Agent**    | AI voice agent                    | Bidirectional WebSocket | Main session STT                 |
| **Operator** | Human monitor/takeover (optional) | PSTN or browser WebRTC  | Dedicated per-participant stream |

**Three-party speaker resolution**: When multiple parties are on the call, speaker attribution uses a priority chain: operator STT → caller STT → default (caller). Every turn in the call record carries `speaker_id` and `speaker_role` for accurate attribution in the transcript.

</details>

## Context Graph Engine

The voice agent executes a **Hierarchical State Machine** loaded from the service's version set. Each call gets its own engine instance with an in-memory state database for zero-latency state tracking, flushed to persistent storage after the call ends.

### State Types

| State Type          | Purpose                                                            | LLM Call?            |
| ------------------- | ------------------------------------------------------------------ | -------------------- |
| **ActionState**     | Agent performs actions and evaluates exit conditions to transition | Yes - Engage LLM     |
| **DecisionState**   | Agent evaluates conditions and chooses a transition                | Yes - Navigator only |
| **ReflectionState** | Agent reasons deeply over a problem with optional tool calls       | Yes - deep reasoning |
| **ToolCallState**   | Enforces execution of a designated tool before transitioning       | No - automatic       |
| **RecallState**     | Retrieves information from memory before transitioning             | No - automatic       |
| **AnnotationState** | Injects an inner thought and transitions immediately               | No - automatic       |

### Per-Turn Flow

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"actorBkg": "#083241", "actorTextColor": "#FFFFFF", "actorBorder": "#083241", "signalColor": "#575452", "signalTextColor": "#100F0F", "labelBoxBkgColor": "#F1EAE7", "labelBoxBorderColor": "#D7D2D0", "labelTextColor": "#100F0F", "loopTextColor": "#100F0F", "noteBkgColor": "#F1EAE7", "noteBorderColor": "#D7D2D0", "noteTextColor": "#100F0F", "activationBkgColor": "#E8E2EB", "activationBorderColor": "#083241", "altSectionBkgColor": "#F1EAE7", "altSectionColor": "#100F0F"}}}%%
sequenceDiagram
participant Caller
participant STT as Speech-to-Text
participant Emo as Emotion Detection
participant Nav as Navigator
participant Engage as Engage LLM
participant TTS as Text-to-Speech
participant Tools as Tool Executor

```
Caller->>STT: Speech audio
par Signal Processing
    STT->>Nav: Transcript + end-of-turn
    Caller->>Emo: Same audio (parallel)
    Emo->>Nav: Emotional state update
end

Nav->>Nav: Select action + generate filler
Nav->>TTS: Filler phrase (immediate)
TTS-->>Caller: Filler audio plays

Nav->>Engage: Action + emotional steering + patient context
Engage->>TTS: Response text (streaming)
TTS-->>Caller: Response audio (emotion-adaptive)

opt Tool calls in response
    Engage->>Tools: Dispatch tool (async)
    Tools-->>Engage: Result → continuation turn
end" %}
```

The navigator handles multi-state traversal automatically - decision states, annotation states, and recall states are resolved without user interaction before landing on an action state for the engage LLM.

**Navigator resilience**: Structured output validation with automatic retry (up to 3 total attempts). On all retries exhausted, falls back to the first valid action or exit. Filler text from earlier attempts is preserved across retries (first-wins) - the caller never hears silence even during recovery.

### Action State Extensions

Action states support three optional extension fields for asynchronous workflows:

| Field                   | Type   | Description                                                                                                                                       |
| ----------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| `wait_for`              | string | Pause navigation until an async condition clears. Values: `surface_submission`, `human_approval`.                                                 |
| `channel_overrides`     | object | Per-channel overrides keyed by channel kind (`voice`, `sms`). Each override can set `objective`, `action_guidelines`, and `suppress_filler`.      |
| `surface_spec_template` | object | Surface spec auto-created on state entry. Uses the same field schema as `POST /surfaces`. The entity ID defaults to the session's primary entity. |

**Wait conditions**: When the navigator returns a `waiting_for` value, the engine skips context graph navigation on subsequent turns. The engage prompt includes a `WAITING_FOR_CONDITION` block that constrains the agent to empathetic small-talk until the condition clears. For voice sessions, clearance comes via the real-time event stream. For text sessions, the session blocks on a dedicated event listener.

**Channel overrides**: The `channel_kind` is set on the engine session (`voice` for calls, `sms` for text sessions). Prompt rendering merges the channel-specific objective and guidelines into the engage prompt. `suppress_filler: true` nulls all filler text for text channels where conversational fillers are unnecessary.

**Surface templates**: On state entry, if the new state has a `surface_spec_template`, the engine creates the surface via the platform API and tracks the `surface_id` in the session's active surface set. This enables deterministic surface creation as part of the context graph design rather than relying on agent tool calls.

### Terminal State & Auto-Hangup

When the context graph reaches its terminal state (an `ActionState` with one action and zero exits), the agent speaks its goodbye and automatically ends the call:

1. Navigator lands on terminal state → `is_terminal = true`
2. Agent speaks the goodbye response
3. Waits for TTS to finish + grace period (audio buffer flush)
4. Terminates the call via telephony API

**Silence detection**: When the caller goes silent, the silence monitor fires check-ins at increasing intervals (10s → 20s → 40s). After 3 unanswered check-ins, the agent says a brief goodbye and auto-disconnects.

**Session shutdown contract**: Every code path that stops the session must also stop the audio speaker - otherwise the speaker blocks indefinitely. This is enforced across all shutdown triggers: hangup, STT failure, WebSocket disconnect, and terminal state.

## World Model Integration

The voice agent connects to the workspace's [world model](https://docs.amigo.ai/developer-guide/platform-api/platform-api/data-world-model) through three data channels - this architecture is informed by the [Liquid World Model thesis](https://docs.amigo.ai/developer-guide/platform-api/data-world-model#design-thesis) where the distinction between data infrastructure and intelligence dissolves.

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
subgraph WM\["World Model (Event-Sourced)"]
EV\["Events\n(immutable, confidence-scored)"]
EN\["Entities\n(projected state + embeddings)"]
EG\["Entity Graph\n(relationships)"]
EV -->|"projection"| EN
EN --- EG
end

```
subgraph Channels["Three Data Channels"]
    direction TB
    A["🔵 Ambient (pushed)\nPatient state in system prompt\nLocation context\nRelated entities"]
    B["🟢 Queried (pulled)\nSlot search, patient lookup\nSemantic search"]
    C["🟡 Extracted (captured)\nInsurance details from speech\nContact info from conversation"]
end

subgraph LLM["LLM Context"]
    CTX["System prompt + conversation history\n+ tool results + emotional steering"]
end

EN -->|"At session start +\nmid-call refresh"| A
B <-->|"Tool calls\n↔ results"| EN
C -->|"Transcript extraction\n(confidence 0.7)"| EV
A --> CTX
B --> CTX
C -.->|"implicit capture"| EV" %}
```

### Channel 1: Ambient (Pushed)

Data the LLM should always have without asking. Injected into the system prompt at session start and refreshed as the conversation evolves:

* **Patient demographics** - name, DOB, MRN, phone, email, address
* **Clinical context** - active conditions, medications, allergies (filtered to text-only for LLM consumption)
* **Upcoming appointments** - with patient entity references for cross-referencing
* **Insurance coverage** - active plans and subscriber info
* **Location context** - clinic details, available appointment types, hours (resolved from the inbound phone number)

**Design principle: ambient over queried.** If the LLM will almost certainly need this data, push it into context. Don't make it ask. A voice agent that already has the patient's insurance in context doesn't need to dispatch a tool call to look it up.

### Channel 2: Queried (Pulled)

Data that can't be ambient because the search space is too large. The agent calls [built-in clinical tools](#built-in-clinical-tools) to retrieve specific information.

**Key simplification**: Queried tools return human-readable results, not database internals. Slot search returns doctor names and times, not template IDs and slot UUIDs. When the agent says "book the 1:45 with Dr. Jones," the system resolves scheduling internals from cached slot data. **The LLM never touches scheduling internals.**

### Channel 3: Extracted (Captured)

Structured data mentioned in conversation - insurance details, contact information, preferences - is automatically captured and written to the world model without requiring explicit tool calls. This eliminates the mode switch where the LLM stops being a conversationalist and becomes a database operator. **The conversation IS the data entry.**

Extracted data is written with moderate confidence (below verified threshold) - the LLM can still use explicit write tools for high-stakes data where precision matters. Extraction is a complement, not a replacement.

### Multi-Stage Verification

All data written by the voice agent during calls starts at a low confidence level and must pass through a verification pipeline before syncing to external systems. This is the trust architecture for autonomous agents acting on noisy phone audio.

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
A\["Voice agent writes\nclinical data\n(low confidence)"] --> B\["Call ends"]
B --> CL\["Call Classifier"]
CL -->|"Junk (prank/ad/bot)"| REJ1\["❌ Rejected\n(confidence → 0)"]
CL -->|"Real call"| J1\["Per-Event LLM Judge"]

```
J1 -->|"Valid"| AP["✅ Auto-approved\n(confidence → verified)"]
J1 -->|"Correctable"| COR["Auto-correct\n(name casing, dates, phones)"]
COR --> AP
J1 -->|"Uncertain"| FLAG["⚠️ Flagged\n→ Review queue"]

AP --> J2["Session Coherence Check"]
J2 -->|"Coherent"| SYNC["✅ Sync-eligible\n(confidence upgraded)"]
J2 -->|"Contradictions"| FLAG

FLAG --> HR["Human Reviewer"]
HR -->|"Approve"| SYNC2["✅ Human-approved\n(high confidence)"]
HR -->|"Correct"| NEW["New event\n(supersedes original)"]
HR -->|"Reject"| REJ2["❌ Rejected"]
NEW --> SYNC2

SYNC --> OUT["Connector runner\nsyncs to external system\n(confidence gate ≥ verified)"]
SYNC2 --> OUT" %}
```

**Three-stage automated review:**

| Stage                 | What It Checks                                                         | Actions                                                      |
| --------------------- | ---------------------------------------------------------------------- | ------------------------------------------------------------ |
| **Call Classifier**   | Is this a real clinical call or junk? (prank, ad, bot, silence)        | Real → continue; Junk → reject all session events            |
| **Per-Event Judge**   | Cross-references each event against transcript + existing entity state | Approve, auto-correct (formatting), or flag for human review |
| **Session Coherence** | Do all events tell a coherent story? Contradictions? Missing data?     | Upgrade confidence if coherent, flag if contradictions found |

**Why three stages, not one**: Per-event review catches data-level errors (wrong phone format, impossible DOB, name doesn't match transcript). Session-level review catches narrative-level errors (contradictions between events, discussed insurance but no coverage event recorded). These are different kinds of errors requiring different analysis approaches.

### Patient Safety Isolation

A **write scope** is enforced per session - write tools can only target the patient identified in the current call. This prevents cross-patient data errors. Write tools are also **deduplicated** - identical calls within the same session return cached results rather than creating duplicate records (30-second TTL, successful results only - errors are always retryable).

## Emotional Adaptation

The voice agent adapts across **four independent output channels simultaneously** based on real-time caller emotion. Each row in the matrix below is a detected situation; columns show how each output channel responds. All adaptation is automatic - workspace managers control only the baseline via [voice settings](https://docs.amigo.ai/developer-guide/platform-api/workspaces#voice-settings).

### Valence-Arousal Model

Every detected emotion maps to a two-dimensional (valence, arousal) coordinate. The system tracks these coordinates across a rolling window to build a stable yet responsive picture of the caller's emotional state:

```
        High Arousal (1.0)
             │
    ANGER ───┼─── EXCITEMENT
  Frustration│    Joy
  Fear       │    Enthusiasm
             │
  ───────────┼───────────── Valence
  Negative   │    Positive
  (-1.0)     │    (+1.0)
             │
    SADNESS ──┼─── CONTENTMENT
  Disappointment  Relief
  Boredom    │    Gratitude
             │
        Low Arousal (0.0)
```

| Quadrant                                           | Agent Strategy | Voice Tone     | LLM Behavior                                                                         |
| -------------------------------------------------- | -------------- | -------------- | ------------------------------------------------------------------------------------ |
| **High-arousal negative** (anger, frustration)     | De-escalate    | `calm`         | Direct, concise, acknowledge frustration, skip pleasantries, match urgency           |
| **Low-arousal negative** (sadness, disappointment) | Comfort        | `sympathetic`  | Warm, patient, gentle language, give extra space, do not rush                        |
| **High-arousal positive** (excitement, joy)        | Match energy   | `enthusiastic` | Enthusiastic language, keep momentum, match positive energy                          |
| **Low-arousal positive** (contentment, relief)     | Maintain       | `content`      | Warm and steady, reinforce positive outcome, conversational                          |
| **Confusion** (high confidence)                    | Clarify        | `calm`         | Simplify explanations, break into small pieces, check understanding, offer to repeat |
| **Anxiety** (high confidence)                      | Reassure       | `sympathetic`  | Calm and reassuring, provide clear next steps, avoid uncertainty                     |

### Voice Tone Priority Chain

The agent's voice tone is determined by a six-level priority chain - each layer fires only if the previous returned no signal:

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TD
A{"1. Recent vocal burst?\n(laugh, sigh, cry)\n -  last 5s, score ≥ 0.5"} -->|Yes| A1\["Burst-derived tone\n(highest priority)"]
A -->|No| B{"2. Prosody emotion?\n(rolling 30s window)\n -  score ≥ 0.25"}
B -->|Yes| B1\["Prosody-derived tone"]
B -->|No| C{"3. Sensitive topic?\n(action matches\nsensitive\_topics list)"}
C -->|Yes| C1\["Preemptive sympathetic\n(before distress shows)"]
C -->|No| D{"4. Previous turn\nhad strong tone?"}
D -->|Yes| D1\["Tone momentum\n(keeps continuity)"]
D -->|No| E{"5. Workspace tone\nconfigured?"}
E -->|Yes| E1\["Workspace baseline"]
E -->|No| F\["6. System default"]" %}

**Why this ordering matters:**

* **Bursts are the highest-priority signal** because they capture the most immediate emotional state. A caller who just laughed should hear warmth *immediately* - not the rolling average of the last 30 seconds. Burst detection (within last 5 seconds, confidence ≥ 0.5) overrides everything.
* **Tone momentum** (layer 4) prevents jarring voice tone changes. When the current emotional signal is weak (score < 0.25) or doesn't map to a tone, the previous turn's tone persists. Only a strong contradictory signal changes the tone - making the voice feel continuous across the conversation:

```
Turn 1: Anxiety detected (score 0.72) → "sympathetic" → stored as momentum
Turn 2: Calmness detected (score 0.30) → unmapped → momentum returns "sympathetic"
Turn 3: Joy detected (score 0.65)      → "enthusiastic" → stored as new momentum
```

* **Proactive topic sensitivity** (layer 3) fires *before the caller shows distress*. When the agent is about to discuss test results, billing, surgery, or other loaded topics, the voice tone preemptively shifts to sympathetic - even without an emotion signal.

### Emotion → Response Matrix

All four adaptation channels respond simultaneously to each caller state. The agent mirrors *empathy*, not the caller's emotion:

| Caller Emotion                     | Voice Tone     | Filler Style   | LLM Prompt Adaptation                           | Rationale                             |
| ---------------------------------- | -------------- | -------------- | ----------------------------------------------- | ------------------------------------- |
| **Anger, Annoyance, Contempt**     | `calm`         | **Suppressed** | Direct, concise, acknowledge frustration        | De-escalate - don't mirror aggression |
| **Anxiety, Fear, Distress**        | `sympathetic`  | Reassuring     | Calm, clear next steps, avoid uncertainty       | Reassure - steady presence            |
| **Sadness, Disappointment, Guilt** | `sympathetic`  | Warm           | Patient, supportive, don't rush                 | Warm empathy - give space             |
| **Confusion**                      | `calm`         | Simple         | Simplify, small pieces, check understanding     | Patient clarity                       |
| **Excitement, Joy, Enthusiasm**    | `enthusiastic` | Warm, matching | Match positive energy, keep momentum            | Mirror positive energy                |
| **Contentment, Relief, Gratitude** | `content`      | Warm           | Steady, reinforce outcome                       | Warm and grounding                    |
| **Interest, Concentration**        | `curious`      | Engaged        | Engaged tone, match intellectual focus          | Show interest                         |
| **Embarrassment, Doubt**           | `calm`         | Encouraging    | Non-judgmental, encouraging                     | Put at ease                           |
| **Boredom, Tiredness**             | `enthusiastic` | Concise        | Re-engage with energy, be efficient             | Re-energize                           |
| **Sarcasm**                        | `calm`         | Professional   | Respond to underlying concern, not surface tone | Stay professional                     |

**Burst-to-experience mapping**: 25 vocal burst types are mapped to specific agent tones and caller state interpretations:

| Burst Types       | Agent Tone             | Inferred Caller State    |
| ----------------- | ---------------------- | ------------------------ |
| Laugh, Giggle     | `enthusiastic`         | Amused                   |
| Sigh              | `sympathetic`          | Weary                    |
| Cry, Sob, Whimper | `sympathetic`          | Distressed               |
| Gasp              | `calm`                 | Alarmed                  |
| Groan, Ugh        | `sympathetic` / `calm` | Frustrated               |
| Growl, Tsk        | `calm`                 | Angry                    |
| Hmm, Mhm          | `calm`                 | Thinking / Acknowledging |
| Aww               | `sympathetic`          | Touched                  |

### Filler Speech

Fillers cover processing latency so the caller never hears silence. The system uses **principle-based guidance** - not hardcoded phrase lists - generating contextually appropriate fillers from emotional context, the current action, and the expected latency.

**Three-layer filler generation:**

| Layer                    | When                        | What It Controls                                                                                                                                     |
| ------------------------ | --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Latency adaptation**   | Always                      | Filler length matches expected processing time (2-4 words for normal latency, 3-5 words for audio verification processing)                           |
| **Emotional attunement** | When emotion data available | Emotional register matches the caller's state - not specific phrases, but principles like "gentle and reassuring" or "a verbal hand on the shoulder" |
| **Action context**       | Always                      | Current context graph action description injected so the filler hints at what the agent is about to do                                               |

**Per-action filler hints**: Context graph actions can include optional PM-configured filler suggestions. These are weak steering - emotion-adaptive principles always dominate. The LLM sees hints as suggestions to draw from, not commands.

**Suppression rule**: When `valence < -0.2 AND arousal > 0.4 AND emotion is NOT Anxiety/Fear/Distress` → fillers disabled entirely. Frustrated callers don't want acknowledgments - they want the answer. **Exception**: Anxious callers still receive reassuring fillers, because anxiety benefits from reassurance while frustration does not.

### Call Phase Escalation

The system automatically increases urgency as calls extend with negative sentiment:

| Phase     | Duration | Condition           | Adaptation                                                                                 |
| --------- | -------- | ------------------- | ------------------------------------------------------------------------------------------ |
| **Early** | < 5 min  | Any                 | Standard emotional adaptation                                                              |
| **Mid**   | 5-10 min | Trend deteriorating | "Focus on resolution speed. Shorten responses."                                            |
| **Late**  | ≥ 10 min | Negative valence    | **URGENCY.** "Prioritize resolution. Be maximally concise. Escalate if unable to resolve." |

### Proactive Intelligence

The system detects emotionally sensitive topics from the current context graph action **before the caller shows distress**:

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
A\["Current Action:\n'Discuss test results'"] --> B{"Matches\nsensitive\_topics?"}
B -->|Yes| C\["Preemptive shift to\nsympathetic tone\n(priority level 3\nin tone chain)"]
B -->|No| D\["Normal tone\npriority chain"]" %}

`sensitive_topics` is configurable via [voice settings](https://docs.amigo.ai/developer-guide/platform-api/workspaces#voice-settings). Falls back to healthcare defaults: test results, diagnosis, billing, payment, insurance, denial, emergency, referral, specialist, surgery, procedure, medication.

This fires at priority level 3 in the TTS emotion chain - below burst and prosody (which have actual data about the caller's current state) but above tone momentum and workspace defaults.

### Coherence Detection

When what the caller *says* doesn't match how they *sound* (coherence < 0.4), the system shifts its steering: *"The caller's words suggest X but voice sounds Y. Trust the vocal tone over the words - respond to how they sound, not what they claim."*

This is injected into the system prompt without the agent ever explicitly mentioning the discrepancy to the caller.

### Control Plane ↔ Adaptation

How each workspace voice setting interacts with the automatic emotion adaptation system:

| Voice Setting                   | What You Control                           | What the System Overrides         | Override Condition                              |
| ------------------------------- | ------------------------------------------ | --------------------------------- | ----------------------------------------------- |
| `tone`                          | Baseline voice emotion for neutral callers | Emotion-derived tone replaces it  | Any non-neutral emotion detected (score ≥ 0.25) |
| `speed`                         | Base speech rate                           | Never overridden                  | Your choice always respected                    |
| `volume`                        | Base volume                                | Never overridden                  | Your choice always respected                    |
| `voice_id`                      | Voice persona                              | Per-agent voice config overrides  | Agent version has voice config set              |
| `keyterms`                      | Domain vocabulary for STT boost            | Merged with service keyterms      | Always additive, never overridden               |
| `correction_categories`         | Domain hints for audio correction          | None                              | Used as additional context                      |
| `sensitive_topics`              | Topics for proactive tone softening        | Falls back to healthcare defaults | Preemptive, not reactive                        |
| `post_call_analysis_enabled`    | Quality scoring on/off                     | None                              | Full PM control                                 |
| `transcript_correction_enabled` | Re-verification on/off                     | None                              | Full PM control                                 |

**Key principle**: Workspace managers control the *baseline experience* and *domain knowledge*. The emotion intelligence system overrides the baseline *only when it detects a strong signal* - and always in the direction of more empathy, never less.

### Graceful Degradation

Every intelligence layer is best-effort with an explicit fallback. **A failed intelligence layer must never fail a call.**

| Layer                    | Failure Mode                       | Fallback                                                            | Impact                                                       |
| ------------------------ | ---------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------ |
| **Emotion connection**   | Auth error, billing, timeout       | Session continues without emotion detection                         | No emotional adaptation, workspace defaults used             |
| **Emotion segment**      | Processing error, connection close | Consecutive failure counter → disable after 5                       | Degrades gracefully to less data                             |
| **Emotion detection**    | Insufficient data (< 2 segments)   | No emotional steering, default fillers                              | First few seconds may lack adaptation                        |
| **Burst detection**      | No burst events                    | Falls through to prosody-derived emotion                            | Loses immediate reaction, uses rolling average               |
| **Language model**       | No language results                | Coherence defaults to 1.0 (agreement assumed)                       | Loses word-vs-tone disagreement detection                    |
| **Audio verification**   | Timeout or error                   | No corrections injected, call continues                             | Relies on raw STT only                                       |
| **Voice settings**       | Parse error                        | Defaults (filler on, emotion on)                                    | Baseline experience still works                              |
| **Post-call analysis**   | Any error                          | Logged, not raised (fire-and-forget)                                | Quality data missing, call unaffected                        |
| **TTS connection**       | Close/error mid-stream             | Auto-reconnect on next turn                                         | Brief silence, then recovery                                 |
| **STT connection**       | Connection loss                    | Exponential backoff reconnect (max 3 attempts)                      | Brief gap in transcription                                   |
| **Context graph engine** | Backend unavailable                | Falls back to static prompt mode (without context graph navigation) | Agent still converses, just without state machine navigation |

## Tool Execution

Skills configured in the context graph execute **asynchronously** during calls - the agent acknowledges the action and continues speaking while tools run in the background. Results arrive as continuation turns.

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
A\["LLM returns\ntool call"] --> B\["Filler plays\nwhile tool runs"]
B --> C\["Tool executes\n(async, background)"]
C --> D\["Result arrives"]
D --> E\["Agent relays\nresult to caller"]" %}

### Execution Tiers

Tool calls are routed through an execution tier system that matches the tool's complexity to the right execution model:

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
TC\["Tool Call"] --> R{"Route by\nexecution tier"}
R -->|"T1: direct"| T1\["Direct Integration\n(single HTTP call)\n< 2 seconds"]
R -->|"T2: orchestrated"| T2\["LLM Agent\n(multi-turn reasoning\nwith tool access)\n2-30 seconds"]
R -->|"T3: autonomous"| T3\["Autonomous Agent\n(extended loop with\ncheckpointing + MCP tools)\n30s - 5 min"]
R -->|"Integration tool"| IT\["Integration Client\n(direct HTTP with\nOAuth2/WIF auth)"]
R -->|"Fallback"| FB\["Legacy execution"]" %}

| Tier   | Name         | Execution Model                                      | Latency  | Use Cases                                       |
| ------ | ------------ | ---------------------------------------------------- | -------- | ----------------------------------------------- |
| **T1** | Direct       | Single integration API call, no LLM                  | < 2s     | Patient lookup, allergy check, medication list  |
| **T2** | Orchestrated | Multi-turn LLM agent with tool access                | 2-30s    | Eligibility cascades, multi-step writes         |
| **T3** | Autonomous   | Extended agent loop with checkpointing and MCP tools | 30s-5min | Complex prior auth, cross-system reconciliation |

**T3 autonomous agents** use a full agent SDK with:

* **Custom MCP tools** injected per-task (world model tools, integration tools)
* **Session checkpointing** for pause/resume across retries
* **Cost caps** per task to prevent runaway execution
* **Isolated working directories** per task

**Write-tool deduplication**: All write tools are deduplicated within a session (30-second TTL). Identical tool calls return cached results. Only successful results are cached - errors are always retryable.

### Built-in Clinical Tools

Healthcare workspaces get 13 built-in tools automatically - no integration configuration required:

**Read tools:**

| Tool                        | Purpose                                          | Key Feature                                                        |
| --------------------------- | ------------------------------------------------ | ------------------------------------------------------------------ |
| **Patient lookup**          | Search by DOB, name, phone, or MRN               | DOB preferred for accuracy                                         |
| **Slot search**             | Available appointment slots by location and date | Returns human-readable times + doctor names, caches slot internals |
| **Appointment lookup**      | Patient's existing appointments                  | Returns appointment references for cancel/confirm                  |
| **Semantic patient search** | Fuzzy, embedding-based patient matching          | Handles misspellings and partial information                       |
| **Semantic event search**   | Embedding-based search across clinical events    | Optionally scoped to a specific patient                            |

**Write tools:**

| Tool                       | Purpose                                         | Key Feature                                                         |
| -------------------------- | ----------------------------------------------- | ------------------------------------------------------------------- |
| **Patient create**         | Create patient with automatic deduplication     | Dedup by name + DOB                                                 |
| **Patient update**         | Update contact info (phone, email, address)     | Requires entity reference                                           |
| **Save patient**           | Create-or-update with dedup check               | Accepts natural field names and flexible date formats               |
| **Schedule appointment**   | Book from slot search results or explicit times | Accepts `slot_ref` from slot search - auto-resolves booking details |
| **Cancel appointment**     | Cancel by appointment reference                 | Writes cancellation event                                           |
| **Confirm appointment**    | Confirm a booked appointment                    | Writes confirmation event                                           |
| **Create insurance**       | Insurance record with carrier fuzzy-matching    | Supports policy holder info                                         |
| **Schedule outbound call** | Schedule a future callback                      | Creates `outbound_task` entity atomically                           |

All write tools pass through the [multi-stage verification pipeline](#multi-stage-verification) before data reaches external systems. All write tools enforce [patient safety isolation](#patient-safety-isolation).

### Call Forwarding

A built-in `forward_call` tool transfers the caller to a human. Two modes:

* **Static forwarding** - per-phone-number fallback, configured via [Phone Numbers](https://docs.amigo.ai/developer-guide/platform-api/platform-api/phone-numbers)
* **Location-based forwarding** - the agent selects from location phone numbers in the patient's context

The agent cannot specify arbitrary phone numbers - the destination always comes from the resolved config or location entity state. When the caller requests a human, the agent is required to invoke the tool - the actual transfer happens via the telephony system, not through words alone.

{% hint style="info" %}
**Deferred transfer** - Call transfers are deferred until the agent's goodbye message finishes playing. The transfer is cancellable by barge-in or operator join.
{% endhint %}

## Audio Verification

When the agent needs to capture structured data (names, dates, phone numbers, insurance IDs), it can trigger audio verification - sending the caller's raw audio for AI-powered correction alongside the real-time transcript.

This catches STT errors on structured data that streaming transcription commonly gets wrong: proper names, alphanumeric IDs, phone numbers, and dates.

**Domain-aware**: `correction_categories` from [voice settings](https://docs.amigo.ai/developer-guide/platform-api/workspaces#voice-settings) are injected as domain hints. This tells the correction model: *"This workspace commonly handles medication names, insurance carriers. STT frequently gets these wrong. Pay extra attention."*

### Correction Output

Corrections are structured as field-level pairs showing what STT heard vs. the corrected value:

```
name: "Micah Adeline" → "Mika Adlin" (confidence: 9)
dob: "March 15 1990" → "1990-03-15" (confidence: 8)
```

### Correction Confidence

| Level                 | Score | Agent Behavior                                            |
| --------------------- | ----- | --------------------------------------------------------- |
| **Certain**           | 8-9   | Use corrected value directly without confirming           |
| **Likely**            | 5-7   | Confirm with caller ("I have \[value], is that correct?") |
| **Uncertain**         | 1-4   | Ask caller to spell out or repeat slowly                  |
| **Both models wrong** | -     | Audio quality is poor - ask for letter-by-letter spelling |

Observer events include both the original STT value, the corrected value, and the numeric confidence - enabling frontend visualization of correction accuracy.

## Safety & Monitoring

### Conversation Monitor

An embedding-based safety detection system evaluates every turn against configured safety concepts using a two-stage pipeline:

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
A\["Caller transcript"] --> B\["Embed transcript"]
B --> C\["Cosine similarity\nvs all concept vectors\n(matrix multiply, <1ms)"]
C --> D{"Above\nthreshold?"}
D -->|Yes| E\["AI Judge\n(structured output:\naction + reasoning)"]
D -->|No| F\["No action"]
E --> G{"Decision"}
G -->|hard\_escalate| H\["Interrupt agent +\nimmediate escalation"]
G -->|soft\_escalate| I\["Escalate after\ncurrent turn completes"]
G -->|alert| J\["Log event only"]
G -->|ignore| F
D -->|"Standalone\n≥ 0.85"| H" %}

**Standalone fallback**: If semantic similarity exceeds a high threshold (default 0.85), escalation triggers immediately without waiting for the AI judge - providing a safety net even if the judge model is unavailable.

**Default safety concepts** (always active): suicidal ideation, self harm, domestic violence, adverse drug reaction, post-discharge red flag. Custom concepts can be added via the [Safety API](https://docs.amigo.ai/developer-guide/platform-api/platform-api/safety) with pre-computed embeddings.

### Auto-Escalation

When an escalation triggers, the system:

1. Writes an escalation event to the world model (dual-entity: both call and operator entities)
2. Notifies the [operator dashboard](https://docs.amigo.ai/developer-guide/platform-api/platform-api/operators)
3. For hard escalations - immediately suspends the AI agent pending human intervention

## Observer WebSocket

Monitor active calls in real time via a cross-pod WebSocket connection:

```
WS /voice-agent/observe/{call_sid}?token={api_key}
```

Requires a valid workspace API key. Any observer instance can monitor any active call in the workspace, regardless of which pod handles the call (events are distributed via pub/sub).

**Late-join replay**: Observers connecting mid-call receive a buffered replay of recent events before transitioning to the live stream. Events carry monotonic sequence numbers for ordering.

### Event Types

| Event                 | Key Data                                                                             | Source           |
| --------------------- | ------------------------------------------------------------------------------------ | ---------------- |
| `session_start`       | `call_sid`, `service_id`, `workspace_id`, `initial_state`, `trace_id`                | Session init     |
| `session_info`        | Full call snapshot (sent on observer connect)                                        | Observer connect |
| `user_transcript`     | `transcript`, `emotion_label`, `emotion_valence`                                     | Turn controller  |
| `agent_transcript`    | `transcript`, `action`, `interrupted`                                                | Speaker          |
| `state_transition`    | `previous_state`, `next_state`                                                       | Turn controller  |
| `tool_call_started`   | `tool_name`, `call_id`, `input`                                                      | Turn controller  |
| `tool_call_completed` | `tool_name`, `duration_ms`, `output` (truncated), `succeeded`, `error_message`       | Turn controller  |
| `nav_timing`          | `nav_ms`, `render_ms`, `total_ms`, `input_tokens`, `output_tokens`, `model`, `state` | Turn controller  |
| `latency`             | `e2e_ttfb_ms`, `engine_ms`, `nav_ms`, `render_ms`, `audio_ttfb_ms`, `continuation`   | Speaker          |
| `emotion`             | `dominant`, `valence`, `arousal`                                                     | Transport        |
| `session_end`         | `call_sid`, `duration_s`, `turns`, `completion_reason`, `final_state`                | Session shutdown |
| `injected_event`      | `message`, `sender`, `event_type`                                                    | Turn controller  |
| `ping`                | (empty)                                                                              | Keepalive (30s)  |

## Session Event Injection

External systems can inject events into active voice sessions. The agent processes injected events through its response generation (without context graph navigation) and speaks a natural response. This enables real-time interaction with live calls from EHR systems, operator dashboards, or any backend service.

### Injection Paths

| Path                        | Endpoint                                               | Auth                        | Use Case                                              |
| --------------------------- | ------------------------------------------------------ | --------------------------- | ----------------------------------------------------- |
| **Voice Agent HTTP**        | `POST /voice-agent/sessions/{call_sid}/event`          | Bearer token                | Direct injection from backend services                |
| **Platform API (general)**  | `POST /v1/{workspace_id}/sessions/{call_sid}/inject`   | API key                     | Frontend or third-party injection with workspace auth |
| **Platform API (operator)** | `POST /v1/{workspace_id}/operators/{id}/send-guidance` | API key (`Operator:Update`) | Operator-scoped guidance with identity tracking       |
| **WebSocket control**       | Text frame on `/test-call` or `/direct-stream`         | Session auth                | Developer playground and testing                      |

### Event Types

| Type             | Behavior                                               | Example                                        |
| ---------------- | ------------------------------------------------------ | ---------------------------------------------- |
| `external_event` | Queues behind current speech. Cancels silence monitor. | "Appointment confirmed for 2pm tomorrow"       |
| `guidance`       | Interrupts current speech and cancels silence monitor. | "Ask for their insurance ID before confirming" |

The distinction matters: external events carry factual information that can wait for the agent to finish speaking, while guidance carries instructions that are time-sensitive and should be acted on immediately.

### Request Format

```
POST /voice-agent/sessions/{call_sid}/event
Authorization: Bearer <api_key>
```

```json
{
  "message": "The patient's insurance has been verified",
  "sender": "ehr_system",
  "event_type": "external_event"
}
```

The `event_type` field accepts `"external_event"` or `"guidance"`. The `sender` field is recorded in the call transcript for attribution.

### Response

The endpoint returns delivery status indicating whether the event was received by the session:

```json
{
  "status": "delivered",
  "call_sid": "CA1234..."
}
```

A `status` of `"queued_no_subscriber"` indicates the event was published but no active session was listening - this can happen during a brief window when a session is initializing or if the call has already ended.

### Cross-Pod Architecture

HTTP injections publish to a per-session pub/sub channel (`va:inject:{call_sid}`). The session subscribes to this channel at startup and drains pending events before each STT poll. This means injection works regardless of which server pod is handling the call.

The subscription reconnects with exponential backoff (1s to 10s cap) if the pub/sub connection is interrupted. A transient infrastructure outage never kills a voice session.

### Active Sessions

List currently active sessions via the platform API:

```
GET /v1/{workspace_id}/sessions/active
Authorization: Bearer <api_key>
```

Returns a real-time list of active sessions with call metadata. This endpoint proxies to the voice agent's distributed active call registry.

### WebSocket Control Channel

Test calls and direct streams accept text-frame control messages for injection and session control:

```jsonc
// Inject an external event
{"type": "inject_event", "message": "...", "sender": "..."}

// Inject operator guidance (interrupts speech)
{"type": "inject_guidance", "message": "..."}

// Force context refresh (reloads patient data)
{"type": "refresh_context"}

// Stop the session
{"type": "stop"}
```

### Test-Call Scenarios

The `/test-call` WebSocket endpoint supports scenario-based testing:

| Parameter                 | Default      | Description                                                                                |
| ------------------------- | ------------ | ------------------------------------------------------------------------------------------ |
| `scenario`                | `inbound`    | `inbound` (agent greets first), `outbound` (task context greeting), `silent` (no greeting) |
| `caller_id`               | `playground` | Simulated caller phone number                                                              |
| `outbound_task_entity_id` | -            | Entity ID for outbound task context (required for `outbound` scenario)                     |
| `system_prompt`           | -            | Freeform prompt override (takes precedence over scenario-derived prompts)                  |

```
WS /voice-agent/test-call?token={api_key}&scenario=outbound&outbound_task_entity_id=123&caller_id=+15551234567
```

## Call Record & Persistence

Every call produces a detailed record persisted to the database:

* **Turns** - Each turn carries a 5-layer timing model (all fields in milliseconds):
  * **Layer 1 (STT)**: `user_speech_start_ms`, `user_speech_end_ms` - speech boundaries
  * **Layer 2 (Engine)**: `engine_ms`, `nav_ms`, `render_ms`, `audio_ttfb_ms` - processing latency breakdown
  * **Layer 4 (TTS/Transport)**: `agent_speech_start_ms`, `agent_speech_end_ms` - when agent audio played
* **Tool calls** - Name, input, output, duration, success/failure
* **State transitions** - Full context graph navigation history
* **Emotional summary** - See [below](#emotional-summary)
* **Escalation history** - Full escalation lifecycle if operator joined
* **Config snapshot** - Version set, agent version, context graph version used

## Calls API

### Active Calls

```
GET /voice-agent/calls/active
Authorization: Bearer <api_key>
```

Lists all currently active calls across the workspace. Active call state is maintained in a distributed registry - any API pod can serve this request regardless of which pod handles the call.

### Call History

```
GET /voice-agent/calls?limit=20&continuation_token=0
Authorization: Bearer <api_key>
```

### Call Detail

```
GET /voice-agent/calls/{call_id}
Authorization: Bearer <api_key>
```

Full call record including turns with timing model, tool calls, state transitions, emotional summary, escalation history, safety state, and config snapshot.

### Recordings

| Endpoint                                   | Description                                                              |
| ------------------------------------------ | ------------------------------------------------------------------------ |
| `GET /calls/{call_id}/recording/stereo`    | Stereo WAV (caller left channel, agent right channel)                    |
| `GET /calls/{call_id}/recording/waveform`  | Amplitude envelope for timeline visualization                            |
| `GET /calls/{call_id}/recording/{channel}` | Single channel WAV (`caller` or `agent`)                                 |
| `POST /calls/{call_id}/verify-transcript`  | Re-transcribe with high-accuracy batch model for ground-truth timestamps |

### Outbound Calls

```
POST /voice-agent/create_outbound_call
Authorization: Bearer <api_key>
```

### Outbound Text Sessions (SMS)

```
POST /voice-agent/create_outbound_text
Authorization: Bearer <api_key>
```

Creates an SMS-based conversation using the same context graph engine as voice calls. The text session sends a greeting, then conducts a multi-turn conversation over SMS.

| Parameter         | Type           | Required | Description                                          |
| ----------------- | -------------- | -------- | ---------------------------------------------------- |
| `phone_to`        | string (E.164) | Yes      | Patient phone number                                 |
| `phone_from`      | string (E.164) | Yes      | Agent phone number (must be configured in workspace) |
| `workspace_id`    | string         | Yes      | Workspace ID                                         |
| `service_id`      | string         | Yes      | Service (agent) to run                               |
| `entity_id`       | string         | No       | World model entity ID for patient context            |
| `surface_id`      | string         | No       | Surface ID to deliver inline in the conversation     |
| `idempotency_key` | string         | No       | Client-provided dedup key (cached 5 minutes)         |

Returns `session_id`, `status` (`created` or `already_active`), and `conversation_id`. Rate limited to 20 per workspace per minute.

**Consent enforcement**: Returns `403 Forbidden` if the patient has opted out of SMS. Opt-out is tracked when patients text STOP, UNSUBSCRIBE, CANCEL, END, or QUIT to the agent's number.

**Inbound SMS**: When a patient texts the agent's phone number, a text session is automatically created if one is not already active for that phone pair.

## Emotional Summary

At call end, the system persists a complete emotional record available in the call detail response:

```json
{
  "dominant_emotion": "Anxiety",
  "average_valence": -0.312,
  "average_arousal": 0.654,
  "peak_negative_valence": -0.587,
  "peak_negative_emotion": "Fear",
  "emotional_shifts": 3,
  "final_trend": "improving",
  "segment_count": 42,
  "barge_in_count": 2,
  "short_response_streak": 0,
  "silence_gap_count": 1,
  "coherence": 0.72,
  "language_sentiment": 0.45,
  "burst_types": {"Sigh": 2, "Hmm": 3}
}
```

## Roadmap: Toward Deeper Empathy

The emotional intelligence system is actively evolving. These are areas where we're investing to push beyond current capabilities:

| Area                              | Where We Are Today                                                                                                             | Where We're Heading                                                                                                                     |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------- |
| **Prosodic rhythm**               | Text-level rhythm guidance (shorter sentences for urgency, gentle transitions when rushing)                                    | Audio-level prosodic planning - breath-like pauses, per-word speed variation, rhythm that matches the emotional weight of each sentence |
| **Emotional response time**       | Emotion applied on the next turn after detection (\~2-4s). Burst detection (laughs, sighs) provides faster sub-segment signals | Sub-second emotional adaptation - responding to a voice crack within the same conversational beat                                       |
| **Emotional memory across calls** | Each call persists a full emotional summary. Patient context injected from world model                                         | Cross-call emotional profiles - "this patient was anxious about test results last call" surfaced proactively in future calls            |
| **Mixed-emotion voice**           | Single emotion label per generation; text structure conveys nuance                                                             | Emotion blending - "warm concern with a hint of encouragement" expressed in a single sentence through TTS-level control                 |

## API Reference

* [Calls](https://docs.amigo.ai/api-reference/readme/platform/calls)
* [Recordings](https://docs.amigo.ai/api-reference/readme/platform/recordings)
