# Voice Agent

The Amigo voice agent powers real-time, emotionally intelligent voice conversations. It handles inbound and outbound phone calls, executing context graph logic (based on a Hierarchical State Machine architecture) with speech understanding, text-to-speech, tool execution, safety monitoring, and continuous emotional adaptation. Every call connects to the [world model](/developer-guide/platform-api/platform-api/data-world-model.md), reads live patient context, writes clinical events with multi-stage verification, and adapts its behavior based on real-time vocal emotion analysis.

{% hint style="warning" %}
**Reliability target.** This system handles healthcare scheduling calls where callers may be in distress, pain, or crisis. Every design decision prioritizes graceful degradation: if any intelligence layer fails, the call continues with the next-best behavior, never silence.
{% endhint %}

{% hint style="info" %}
**Voice settings and Classic API differences.** Voice settings (tone, speed, keyterms, sensitive topics, post-call flags) are configured at the workspace level; see [Workspaces, Voice Settings](/developer-guide/platform-api/platform-api/workspaces.md#voice-settings). Classic API offers [WebSocket voice streaming](/developer-guide/classic-api/core-api/conversations/conversations-voice.md) for text-based apps; Platform API voice is phone-based with emotion detection, EHR context, and operator escalation.
{% endhint %}

## Audio Pipeline Architecture

Every voice call flows through a five-layer pipeline that transforms the caller's audio into emotionally adaptive agent speech, while simultaneously reading from and writing to the world model.

```mermaid
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
    subgraph Input["1. Signal Capture"]
        A["Caller Audio\n(telephony stream)"]
        A --> B["Speech-to-Text\n(streaming, sub-300ms)"]
        A --> C["Emotion Detection\n(prosody + burst + language)"]
    end

    subgraph Intel["2. Intelligence Layer"]
        D["Emotional State\n(rolling 30s window,\nrecency-weighted)"]
        E["Voice Context\n(per-turn, pure function)"]
        C --> D --> E
    end

    subgraph Engine["3. Context Graph Engine"]
        F["Navigator\n(select action + filler)"]
        G["Engage LLM\n(generate response)"]
        B --> F --> G
        E -->|"emotional steering\n+ filler guidelines"| F
        E -->|"emotional steering\n+ micro-behaviors"| G
    end

    subgraph Output["4. Audio Output"]
        H["Text-to-Speech\n(emotion-adaptive,\nword-level timestamps)"]
        G --> H
        E -->|"emotion + speed\n+ volume"| H
    end

    subgraph Post["5. Post-Call Intelligence"]
        I["Transcript Verification\n(batch re-transcription)"]
        J["Quality Analysis\n(5-dimension scoring)"]
        JJ["STT Keyword Feedback\n(self-improving loop)"]
        J --> JJ
    end

    subgraph World["World Model"]
        K["Patient Context\n(ambient injection)"]
        L["Verified Clinical Events\n(confidence-gated)"]
    end

    K -.->|"ambient context\n(3 channels)"| G
    G -.->|"tool results\n(confidence 0.3 → review → 0.7)"| L
```

### Layer 1: Signal Capture

Two parallel streams process the caller's audio simultaneously, and neither blocks the other. This dual-stream architecture is fundamental: speech recognition and emotion detection are completely independent. A failure in one never impacts the other.

**Speech-to-Text**: real-time streaming transcription with sub-300ms latency. Three layers of domain vocabulary boost recognition accuracy:

1. **Service-level keyterms**: managed by workspace administrators, applied to all calls for that service.
2. **Workspace voice settings keyterms**: API-configurable per workspace (see [voice settings](/developer-guide/platform-api/platform-api/workspaces.md#voice-settings)).
3. **System defaults**: engineering-level fallback vocabulary.

All sources are merged and deduplicated per call. Configurable end-of-turn detection with tunable confidence thresholds determines when the caller has finished speaking, balancing responsiveness against cutting off mid-sentence.

**Emotion Detection**: parallel audio analysis with zero impact on the voice pipeline. Three concurrent models analyze every audio segment:

| Model           | Input                   | Output                                                          | Unique Capabilities                                                                                      |
| --------------- | ----------------------- | --------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- |
| **Prosody**     | 2-second audio segments | 48 emotions from vocal tone, pitch, rhythm, timbre              | Real-time voice quality analysis                                                                         |
| **Vocal Burst** | Same audio segments     | 67 non-speech vocal types (laughs, sighs, cries, gasps, groans) | Captures sounds that transcription loses entirely                                                        |
| **Language**    | Final STT transcripts   | 53 emotions + 9-point sentiment + 6-category toxicity           | Detects sarcasm, tiredness, annoyance, disapproval, enthusiasm (5 emotions unavailable from audio alone) |

**Dual-payload multiplexing**: audio segments request prosody and burst models; text transcripts request the language model. Responses are unambiguous: each contains only the models requested. This separation is architecturally important: the language model requires text input, not audio, and runs on STT output rather than raw audio, ensuring it analyzes what the caller *said*, not just how they *sounded*.

**Audio buffering**: 2-second segments with non-blocking queues (max 5 segments for audio, max 20 for text). The emotion pipeline never blocks the voice pipeline, and dropped segments gracefully degrade to slightly less precise emotion detection rather than failure.

**Circuit breaker protection**: emotion detection is protected by a circuit breaker (2 failures triggers 10-second recovery). If the emotion pipeline is degraded, calls continue smoothly with workspace defaults; the circuit breaker prevents cascading latency from affecting the critical voice path.

### Layer 2: Intelligence

The **Emotional State** maintains a rolling 30-second window (\~15 segments) of recent caller signals with **recency-weighted linear averaging**, so the most recent signals have the highest influence. The agent responds to the caller's *current* emotional state, not an average of the whole call.

```mermaid
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
    subgraph Inputs["Signal Sources"]
        P["Prosody\n(48 emotions)"]
        B["Bursts\n(67 vocal types)"]
        L["Language\n(53 emotions +\nsentiment + toxicity)"]
        BH["Behavioral\n(barge-ins, silences,\nshort responses)"]
    end

    subgraph State["Emotional State (Rolling 30s Window)"]
        V["Valence: -1.0 to +1.0"]
        AR["Arousal: 0.0 to 1.0"]
        DOM["Dominant emotion + score"]
        TR["Trend: improving/stable/deteriorating"]
        COH["Coherence: prosody vs language"]
        BUR["Burst events (last 5s)"]
        PH["Call phase: early/mid/late"]
    end

    subgraph Output["Per-Turn Voice Context"]
        TTS["TTS: emotion + speed + volume"]
        FG["Filler: guidelines + suppression"]
        ES["Emotional steering → system prompt"]
        FE["Filler enabled/disabled"]
    end

    P --> State
    B --> State
    L --> State
    BH --> State
    State --> Output
```

**Valence/Arousal computation**: every emotion maps to a (valence, arousal) coordinate via a complete emotion-dimension mapping. Weighted sums across all detected emotions per segment, then recency-weighted across the rolling window, produce stable yet responsive emotional tracking.

**Trend detection**: compares first-half vs second-half valence of the rolling window. A delta > 0.1 → **improving**; delta < -0.1 → **deteriorating**; otherwise **stable**. This powers the call-phase escalation system: deteriorating trends trigger increasingly urgent adaptation.

**Coherence (prosody vs language agreement)**: Measures whether what the caller *says* matches how they *sound*:

| Condition                                                | Coherence      | Agent Response                          |
| -------------------------------------------------------- | -------------- | --------------------------------------- |
| Same valence sign (both positive or both negative)       | High (0.7-1.0) | Normal adaptation                       |
| Opposite valence (words say fine, voice says distressed) | Low (0.0-0.3)  | **Trust the vocal tone over the words** |
| One signal neutral                                       | Mild (0.8)     | Use the available signal                |

**When coherence < 0.4**: the caller may be masking their true state. The agent's emotional steering instruction shifts: *"Respond to how they sound, not what they claim."* This is injected into the system prompt without the agent ever explicitly mentioning the discrepancy.

**Behavioral signal tracking**: updated in real-time from the session and turn controller:

| Signal                  | Detection                       | Threshold | Meaning                              |
| ----------------------- | ------------------------------- | --------- | ------------------------------------ |
| `barge_in_count`        | Caller interrupts agent speech  | ≥ 2       | Frustration (agent talking too much) |
| `short_response_streak` | Consecutive responses ≤ 4 words | ≥ 3       | Disengagement (caller withdrawing)   |
| `silence_gap_count`     | Gaps ≥ 5 seconds                | ≥ 2       | Confusion, hesitation, or distress   |

{% hint style="info" %}
**Semantic barge-in detection.** Barge-in detection uses semantic confirmation: it requires actual recognized words from the STT engine (not just voice activity detection). This filters false triggers from coughs, breathing, echo, and background noise. Minimum speech duration is 0.5 seconds with recognized words, with a 1.0-second fallback for delayed word recognition.
{% endhint %}

These behavioral signals are injected into the system prompt alongside emotional steering, so the LLM receives a complete picture of both *how the caller sounds* and *how they're behaving*.

### Layer 3: Context Graph Engine

Each turn processes through a **two-stage LLM pipeline**:

1. **Navigator**: selects the next action or exit condition from the current context graph state. Also generates a filler phrase to cover processing latency and determines whether to trigger [audio verification](#audio-verification) for structured data capture. Uses structured output validation with automatic retry (up to 3 attempts) and fallback to first valid action.
2. **Engage LLM**: generates the caller-facing response, informed by the selected action, full conversation history with per-message emotion annotations (`[VOICE: EmotionName, valence=V.VVV]`), audio correction results, emotional steering context, ambient patient context, and available tools.

**Emotion reaches the LLM via two independent paths:**

| Path                        | Scope              | What It Contains                                                                                                                |
| --------------------------- | ------------------ | ------------------------------------------------------------------------------------------------------------------------------- |
| **Per-message annotations** | Every user message | Inline `[VOICE: Anxiety, valence=-0.312]`, so the LLM sees the emotional trajectory across the full conversation                |
| **Session-level steering**  | System prompt      | Dominant emotion + trend, quadrant-specific adaptation instructions, behavioral signals, call-phase urgency, coherence warnings |

**Communication micro-behaviors**: the engage template contains hardcoded guidelines that instruct the LLM on micro-level conversational behaviors that are always active, not gated by emotion:

| Behavior                    | Description                                                                              |
| --------------------------- | ---------------------------------------------------------------------------------------- |
| **Speech rhythm mirroring** | If the caller speaks in short bursts, respond concisely; if conversational, match warmth |
| **Emotional name usage**    | Use the caller's name at moments of emotional significance, not mechanically             |
| **Pause injection**         | When delivering difficult information, pause naturally before the key detail             |
| **Pace inversion**          | When the caller is rushing, slow the pace with longer sentences and gentle transitions   |
| **Completion inference**    | When the caller trails off mid-sentence, acknowledge what they were trying to say        |
| **Emotion concealment**     | Never explicitly mention that the system can detect emotions                             |
| **Natural laughter**        | Contextual laughter available for naturally warm moments, used sparingly                 |

### Layer 4: Audio Output

The engage LLM's text streams to the TTS engine for speech synthesis with **per-turn dynamic controls**:

* **Emotion**: derived from the [voice tone priority chain](#voice-tone-priority-chain).
* **Speed**: from workspace [voice settings](/developer-guide/platform-api/platform-api/workspaces.md#voice-settings).
* **Volume**: from workspace voice settings.

**Word-level timestamps** are collected for every generated word (start time and end time), enabling transcript-to-audio scrubbing in the call playback UI. This is critical for the review queue workflow where operators need to jump to specific moments in a call.

### Layer 5: Post-Call Intelligence

Two optional analyses run after every call (controlled via [voice settings](/developer-guide/platform-api/platform-api/workspaces.md#voice-settings)):

**Transcript verification**: re-transcribes the full call audio with a high-accuracy batch model and computes Word Error Rate (WER) against the real-time transcript. Produces `verified_transcript`, `verified_words`, and `transcript_accuracy`, enabling quality comparisons between the real-time and batch transcription.

**Quality analysis**: listens to the full stereo recording (caller and agent) and scores on 5 dimensions (1-5 each):

| Dimension                | What It Measures                              |
| ------------------------ | --------------------------------------------- |
| **Task Completion**      | Did the agent achieve the caller's goal?      |
| **Information Accuracy** | Was the information provided correct?         |
| **Conversation Flow**    | Was the conversation natural and smooth?      |
| **Error Recovery**       | How well did the agent recover from mistakes? |
| **Caller Experience**    | How did the caller feel at the end?           |

#### Call Intelligence Persistence

Alongside the LLM-based quality analysis, the voice agent computes a structured intelligence summary from in-memory session state at call end. This runs synchronously during session cleanup (before the session is torn down) and captures operational telemetry that the async quality analysis cannot see.

Each call intelligence record contains:

| Field                  | Type   | Description                                                                                            |
| ---------------------- | ------ | ------------------------------------------------------------------------------------------------------ |
| `quality_score`        | float  | Rule-based composite score (0-100), penalty-based                                                      |
| `emotion_summary`      | object | `dominant_emotion`, `average_valence`, arousal, peak negative, shifts, final trend                     |
| `risk_summary`         | object | Composite risk score, level, contributing signals with weights                                         |
| `latency_summary`      | object | Engine response time (avg/p50/p95), audio TTFB (avg/p50/p95), silence ratio                            |
| `conversation_summary` | object | `turn_count`, `states_visited_count`, `unique_states`, `loop_count`, barge-in count, completion reason |
| `tool_summary`         | object | Total calls, success/failure counts, failure rate, per-tool breakdown                                  |
| `safety_summary`       | object | `match_count` (safety rule matches), `actions` taken                                                   |
| `operator_summary`     | object | `escalated` (boolean), operator connect time, resolution                                               |
| `completion_reason`    | string | Why the call ended (hangup, terminal state, silence, etc.)                                             |
| `final_state`          | string | Last context graph state at call end                                                                   |

**Quality score penalties:**

| Signal        | Threshold               | Penalty    |
| ------------- | ----------------------- | ---------- |
| High latency  | p95 audio TTFB > 1000ms | -5 to -15  |
| Silence       | Silence ratio > 0.2     | -10 to -20 |
| Barge-ins     | > 2                     | -5 to -15  |
| Agent loops   | > 0 revisited states    | -10 to -20 |
| Escalation    | Any                     | -10        |
| Tool failures | Failure rate > 5%       | -5 to -15  |

The computation is pure (no I/O, no external calls); all data comes from in-memory session state. If the write fails, the error is logged but does not affect the caller or post-call processing.

#### Call Intelligence Endpoints

Two endpoints expose intelligence data for completed and active calls:

**`GET /calls/{call_id}/intelligence`**. Full intelligence profile for a completed call.

Joins persisted call intelligence summaries with per-turn data reconstructed from conversation history:

| Response Field                     | Source                  | Description                                                                               |
| ---------------------------------- | ----------------------- | ----------------------------------------------------------------------------------------- |
| `quality_score`                    | Persisted summary       | Composite 0-100 score                                                                     |
| `emotion_trajectory`               | Per-turn reconstruction | `EmotionTurnPoint[]`: turn number, timestamp, emotion, valence                            |
| `risk_timeline`                    | Per-turn reconstruction | `RiskTurnPoint[]`: turn number, timestamp, risk score, state                              |
| `latency_profile`                  | Both                    | `LatencyProfile`: per-turn waterfall (engine/nav/render/audio TTFB ms) plus summary stats |
| `tool_performance`                 | Per-turn reconstruction | `ToolPerformanceItem[]`: per-tool invocations, success/fail, avg ms                       |
| `conversation_quality`             | Per-turn reconstruction | Loop events (turn + state) and barge-in events (turn + interrupted text)                  |
| `*_summary` fields                 | Persisted summary       | Full summaries (emotion, risk, latency, conversation, tool, safety, operator)             |
| `completion_reason`, `final_state` | Persisted summary       | Why the call ended and the last context graph state                                       |

Returns 404 if the call or intelligence data is not found.

**`GET /calls/active/intelligence`**. Active calls with live intelligence overlay.

Enriches the active call listing with per-turn intelligence from cached snapshots:

| Response Field       | Source     | Description                           |
| -------------------- | ---------- | ------------------------------------- |
| `current_emotion`    | Live cache | Current detected emotion              |
| `current_valence`    | Live cache | Current emotional valence             |
| `current_risk_score` | Live cache | Current composite risk score          |
| `risk_trend`         | Live cache | `rising`, `stable`, or `falling`      |
| `turn_count`         | Live cache | Number of turns completed             |
| `escalation_active`  | Live cache | Whether operator escalation is active |
| `current_state`      | Live cache | Current context graph state           |

Supports `workspace_id` query parameter for filtering.

#### Live Intelligence Pipeline

The voice agent writes a compact intelligence snapshot after each caller speech turn. The snapshot includes current emotion, risk score, turn count, escalation status, and current state.

Intelligence data is refreshed alongside the active call heartbeat. If the session ends or is lost, the live data expires automatically.

The active intelligence endpoint reads all live intelligence for active calls in a single operation for efficient dashboard polling.

**Self-improving feedback loop**: quality analysis also produces `stt_suggestions`, words the STT misheard, formatted as recognition keywords for future calls. This creates a closed loop:

```mermaid
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
    A["Quality analysis\nfinds STT errors"] --> B["Suggests keywords"]
    B --> C["Keywords added to\nworkspace voice settings"]
    C --> D["Future calls\nbetter recognition"]
    D -->|"Next call"| A
```

## How Calls Work

Every call runs inside a **conference architecture**, a multi-party audio bridge that lets the caller, AI agent, and optionally a human [operator](/developer-guide/platform-api/platform-api/operators.md) all participate simultaneously.

### Inbound Call Flow (Instant Greeting)

The system eliminates dead air at call start through **parallel pre-warming**: the engine, greeting, and agent connection all initialize while the phone is still ringing.

```mermaid
%%{init: {"theme": "base", "themeVariables": {"actorBkg": "#083241", "actorTextColor": "#FFFFFF", "actorBorder": "#083241", "signalColor": "#575452", "signalTextColor": "#100F0F", "labelBoxBkgColor": "#F1EAE7", "labelBoxBorderColor": "#D7D2D0", "labelTextColor": "#100F0F", "loopTextColor": "#100F0F", "noteBkgColor": "#F1EAE7", "noteBorderColor": "#D7D2D0", "noteTextColor": "#100F0F", "activationBkgColor": "#E8E2EB", "activationBorderColor": "#083241", "altSectionBkgColor": "#F1EAE7", "altSectionColor": "#100F0F"}}}%%
sequenceDiagram
    participant Caller
    participant Tel as Telephony
    participant Agent as Voice Agent

    Caller->>Tel: Dials phone number
    Tel->>Agent: Webhook (T=0)
    Note over Agent: Resolve: phone → workspace → service → version set

    par Pre-warm during ring time (parallel)
        Agent->>Agent: Initialize engine + load context graph + load tools
        Agent->>Agent: Generate greeting text via LLM
        Agent->>Agent: Resolve caller → patient context from world model
        Agent->>Tel: Create agent conference leg (via conference name)
        Tel->>Agent: Agent WebSocket connects
        Note over Agent: Agent fully ready - greeting cached
    end

    Note over Tel: Phone rings...

    Caller->>Tel: Picks up (T=Xs)
    Tel->>Agent: Caller joins conference
    Agent-->>Caller: ✅ Instant greeting (~200-300ms)

    loop Conversation Turns
        Caller->>Agent: Speech audio (bidirectional stream)
        par Signal Processing
            Agent->>Agent: STT (transcript + end-of-turn)
            Agent->>Agent: Emotion (prosody + burst + language)
        end
        Agent->>Agent: Navigator → Filler → Engage LLM → TTS
        Agent-->>Caller: Emotionally adaptive audio response
    end

    Note over Agent: Call ends (terminal state, silence, or hangup)
    Note over Agent: Persist call record + emotional summary
    Note over Agent: Post-call analysis (background)
```

**Key insight**: the telephony conference API accepts friendly names, not just IDs. The conference name is known at webhook time. The agent leg is created immediately; the conference is created on-demand when the agent joins. The agent can be fully connected and waiting *before the caller even picks up*.

**Timeline comparison:**

| Phase                  | Without Pre-warm      | With Pre-warm                        |
| ---------------------- | --------------------- | ------------------------------------ |
| Webhook → Engine ready | After pickup (+1-3s)  | During ring (hidden)                 |
| Agent leg creation     | After pickup (+200ms) | During ring (hidden)                 |
| WebSocket connection   | After pickup (+200ms) | During ring (hidden)                 |
| Greeting generation    | After pickup (+500ms) | During ring (hidden)                 |
| **Total dead air**     | **\~1200ms**          | **\~200-300ms** (TTS streaming only) |

**Safety guarantees:**

* Caller hangs up during ring → cache entry expires (30s TTL), resources cleaned up lazily.
* WebSocket lands on different pod → cache miss, standard initialization (no degradation).
* Pre-warm exceeds timeout → TwiML returned anyway, standard initialization on pickup.
* Session capacity is NOT consumed during pre-warm (no active session yet).

{% hint style="info" %}
**Pre-warm** is best-effort. If initialization takes longer than expected, the system falls back to standard initialization: no degradation in call quality, just a slightly longer time to first greeting.
{% endhint %}

### Outbound Call Flow

Outbound calls are **world-model-native**: scheduled as `outbound_task` entities via the `schedule_outbound_call` tool during inbound calls, then dispatched by the [connector runner](/developer-guide/platform-api/platform-api/connector-runner.md) when they become due.

**Five business logic patterns** can produce outbound tasks:

| Pattern                           | Description                              | Example                                                        |
| --------------------------------- | ---------------------------------------- | -------------------------------------------------------------- |
| **Scheduled**                     | Decision made, execution deferred        | "I'll call you back tomorrow at 2pm"                           |
| **Event-reactive**                | Trigger → evaluate → maybe act           | New lab result → is it critical? → call patient                |
| **Continuous monitoring**         | Periodic population sweep                | Patients with no contact in 30 days                            |
| **Conversational follow-through** | Track preconditions from agent promises  | "I'll call after the doctor reviews" → pending on doctor event |
| **Orchestrated campaign**         | Achieve outcome for population over time | "Get all 200 patients to complete annual wellness by Q4"       |

```mermaid
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
    subgraph Producers["Task Producers"]
        A["Voice agent\n(mid-call promise)"]
        B["Autonomous agent\n(panel review)"]
        C["Connector runner\n(reactive rules)"]
        D["Dashboard\n(manual schedule)"]
    end

    subgraph WM["World Model"]
        E["outbound_task\nentity"]
    end

    subgraph Dispatch["Dispatch Loop"]
        F{"Due?\nBusiness hours?\nRetry budget?"}
        G["Build rich context\nfrom patient projection"]
        H["Dispatch call"]
    end

    Producers --> E
    E --> F
    F -->|Yes| G --> H
    F -->|No| I["Wait for\nnext window"]
    H --> J["Voice agent\nexecutes with\nfull patient context"]
```

Each outbound task carries a patient reference, reason, goal, priority (1-10), business-hours window (timezone-aware), retry config (max attempts with configurable backoff), and rich context from the patient's world model projection. The dispatch loop enriches the system prompt so the agent starts the call with full patient knowledge. **The agent never needs to "look up" the patient.**

**Outbound prewarm**: outbound calls use the same parallel pre-warming as inbound calls. During the dialing/ringing phase (typically 5-15 seconds), the engine initializes, loads the context graph, resolves patient context, and generates the greeting. When the patient answers, the engine and greeting are already cached, so the patient hears an instant greeting instead of several seconds of silence. Prewarm is best-effort: if initialization takes longer than the ring time, the system falls back to standard cold initialization.

### Conference Architecture

<details>

<summary>Conference architecture: telephony details</summary>

The conference architecture supports multiple simultaneous audio participants with independent per-participant streams:

```mermaid
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
    subgraph Conference["Telephony Conference"]
        C["Caller\n(PSTN - phone network)"]
        A["AI Agent\n(WebSocket stream)"]
        O["Operator\n(PSTN or WebRTC)\n[optional]"]
    end

    subgraph STT["Per-Participant Speech-to-Text"]
        CS["Caller STT\n(speaker attribution)"]
        OS["Operator STT\n(human transcript capture)"]
        AS["Agent STT\n(turn processing)"]
    end

    C --> CS
    O --> OS
    A --> AS

    subgraph Resolution["Speaker Resolution"]
        R["Priority: Operator > Caller > Default"]
    end

    CS --> Resolution
    OS --> Resolution
```

| Participant  | Role                              | Audio Transport         | STT                              |
| ------------ | --------------------------------- | ----------------------- | -------------------------------- |
| **Caller**   | Person who called or was called   | PSTN                    | Dedicated per-participant stream |
| **Agent**    | AI voice agent                    | Bidirectional WebSocket | Main session STT                 |
| **Operator** | Human monitor/takeover (optional) | PSTN or browser WebRTC  | Dedicated per-participant stream |

**Three-party speaker resolution**: when multiple parties are on the call, speaker attribution uses a priority chain: operator STT → caller STT → default (caller). Every turn in the call record carries `speaker_id` and `speaker_role` for accurate attribution in the transcript.

</details>

## Context Graph Engine

The voice agent executes a **Hierarchical State Machine** loaded from the service's version set. Each call gets its own engine instance with an in-memory state database for zero-latency state tracking, flushed to persistent storage after the call ends.

### State Types

| State Type          | Purpose                                                            | LLM Call?            |
| ------------------- | ------------------------------------------------------------------ | -------------------- |
| **ActionState**     | Agent performs actions and evaluates exit conditions to transition | Yes (Engage LLM)     |
| **DecisionState**   | Agent evaluates conditions and chooses a transition                | Yes (Navigator only) |
| **ReflectionState** | Agent reasons deeply over a problem with optional tool calls       | Yes (deep reasoning) |
| **ToolCallState**   | Enforces execution of a designated tool before transitioning       | No (automatic)       |
| **RecallState**     | Retrieves information from memory before transitioning             | No (automatic)       |
| **AnnotationState** | Injects an inner thought and transitions immediately               | No (automatic)       |

### Per-Turn Flow

```mermaid
%%{init: {"theme": "base", "themeVariables": {"actorBkg": "#083241", "actorTextColor": "#FFFFFF", "actorBorder": "#083241", "signalColor": "#575452", "signalTextColor": "#100F0F", "labelBoxBkgColor": "#F1EAE7", "labelBoxBorderColor": "#D7D2D0", "labelTextColor": "#100F0F", "loopTextColor": "#100F0F", "noteBkgColor": "#F1EAE7", "noteBorderColor": "#D7D2D0", "noteTextColor": "#100F0F", "activationBkgColor": "#E8E2EB", "activationBorderColor": "#083241", "altSectionBkgColor": "#F1EAE7", "altSectionColor": "#100F0F"}}}%%
sequenceDiagram
    participant Caller
    participant STT as Speech-to-Text
    participant Emo as Emotion Detection
    participant Nav as Navigator
    participant Engage as Engage LLM
    participant TTS as Text-to-Speech
    participant Tools as Tool Executor

    Caller->>STT: Speech audio
    par Signal Processing
        STT->>Nav: Transcript + end-of-turn
        Caller->>Emo: Same audio (parallel)
        Emo->>Nav: Emotional state update
    end

    Nav->>Nav: Select action + generate filler
    Nav->>TTS: Filler phrase (immediate)
    TTS-->>Caller: Filler audio plays

    Nav->>Engage: Action + emotional steering + patient context
    Engage->>TTS: Response text (streaming)
    TTS-->>Caller: Response audio (emotion-adaptive)

    opt Tool calls in response
        Engage->>Tools: Dispatch tool (async)
        Tools-->>Engage: Result → continuation turn
    end
```

The navigator handles multi-state traversal automatically. Decision states, annotation states, and recall states are resolved without user interaction before landing on an action state for the engage LLM.

**Navigator resilience**: structured output validation with automatic retry (up to 3 total attempts). When all retries are exhausted, the engine falls back to the first valid action or exit. Filler text from earlier attempts is preserved across retries (first-wins), so the caller never hears silence even during recovery.

### Action State Extensions

Action states support three optional extension fields for asynchronous workflows:

| Field                   | Type   | Description                                                                                                                                                                                                       |
| ----------------------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `wait_for`              | string | Pause navigation until an async condition clears. Values: `surface_submission`, `human_approval`.                                                                                                                 |
| `channel_overrides`     | object | Per-channel overrides keyed by channel kind (`voice`, `sms`). Each override can set `objective`, `action_guidelines`, and `progress` (a [progress hint](#tool-wait-progress-hints) for tool waits in that state). |
| `surface_spec_template` | object | Surface spec auto-created on state entry. Uses the same field schema as `POST /surfaces`. The entity ID defaults to the session's primary entity.                                                                 |

**Wait conditions**: when the navigator returns a `waiting_for` value, the engine skips context graph navigation on subsequent turns. The engage prompt includes a `WAITING_FOR_CONDITION` block that constrains the agent to empathetic small-talk until the condition clears. For voice sessions, clearance comes via the real-time event stream. For text sessions, the session blocks on a dedicated event listener.

**Channel overrides**: the `channel_kind` is set on the engine session (`voice` for calls, `sms` for text sessions). Prompt rendering merges the channel-specific objective and guidelines into the engage prompt. For voice, the override can also carry a `progress` hint that shapes how the agent covers tool waits in that state (see [Tool-Wait Progress Hints](#tool-wait-progress-hints)).

**Surface templates**: on state entry, if the new state has a `surface_spec_template`, the engine creates the surface via the platform API and tracks the `surface_id` in the session's active surface set. This enables deterministic surface creation as part of the context graph design rather than relying on agent tool calls.

### Terminal State & Auto-Hangup

When the context graph reaches its terminal state (an `ActionState` with one action and zero exits), the agent speaks its goodbye and automatically ends the call:

1. Navigator lands on terminal state → `is_terminal = true`.
2. Agent speaks the goodbye response.
3. Waits for TTS to finish plus a grace period (audio buffer flush).
4. Terminates the call via telephony API.

**Silence detection**: when the caller goes silent, the silence monitor fires check-ins at increasing intervals (10s → 20s → 40s). After 3 unanswered check-ins, the agent says a brief goodbye and auto-disconnects.

**Session shutdown contract**: every code path that stops the session must also stop the audio speaker; otherwise the speaker blocks indefinitely. This is enforced across all shutdown triggers: hangup, STT failure, WebSocket disconnect, and terminal state.

## World Model Integration

The voice agent connects to the workspace's [world model](/developer-guide/platform-api/platform-api/data-world-model.md) through three data channels. This architecture is informed by the [Liquid World Model thesis](/developer-guide/platform-api/platform-api/data-world-model.md#design-thesis) where the distinction between data infrastructure and intelligence dissolves.

```mermaid
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
    subgraph WM["World Model (Event-Sourced)"]
        EV["Events\n(immutable, confidence-scored)"]
        EN["Entities\n(projected state + embeddings)"]
        EG["Entity Graph\n(relationships)"]
        EV -->|"projection"| EN
        EN --- EG
    end

    subgraph Channels["Three Data Channels"]
        direction TB
        A["🔵 Ambient (pushed)\nPatient state in system prompt\nLocation context\nRelated entities"]
        B["🟢 Queried (pulled)\nSlot search, patient lookup\nSemantic search"]
        C["🟡 Extracted (captured)\nInsurance details from speech\nContact info from conversation"]
    end

    subgraph LLM["LLM Context"]
        CTX["System prompt + conversation history\n+ tool results + emotional steering"]
    end

    EN -->|"At session start +\nmid-call refresh"| A
    B <-->|"Tool calls\n↔ results"| EN
    C -->|"Transcript extraction\n(confidence 0.7)"| EV
    A --> CTX
    B --> CTX
    C -.->|"implicit capture"| EV
```

### Channel 1: Ambient (Pushed)

Data the LLM should always have without asking. Injected into the system prompt at session start and refreshed as the conversation evolves:

* **Patient demographics**: name, DOB, MRN, phone, email, address.
* **Clinical context**: active conditions, medications, allergies (filtered to text-only for LLM consumption).
* **Upcoming appointments**: with patient entity references for cross-referencing.
* **Insurance coverage**: active plans and subscriber info.
* **Location context**: clinic details, available appointment types, hours (resolved from the inbound phone number).

**Design principle: ambient over queried.** If the LLM will almost certainly need this data, push it into context. Don't make it ask. A voice agent that already has the patient's insurance in context doesn't need to dispatch a tool call to look it up.

### Channel 2: Queried (Pulled)

Data that can't be ambient because the search space is too large. The agent calls [built-in clinical tools](#built-in-clinical-tools) to retrieve specific information.

**Key simplification**: queried tools return human-readable results, not database internals. Slot search returns doctor names and times, not template IDs and slot UUIDs. When the agent says "book the 1:45 with Dr. Jones," the system resolves scheduling internals from cached slot data. **The LLM never touches scheduling internals.**

### Channel 3: Extracted (Captured)

Structured data mentioned in conversation (insurance details, contact information, preferences) is automatically captured and written to the world model without requiring explicit tool calls. This eliminates the mode switch where the LLM stops being a conversationalist and becomes a database operator. **The conversation IS the data entry.**

Extracted data is written with moderate confidence (below verified threshold). The LLM can still use explicit write tools for high-stakes data where precision matters. Extraction is a complement, not a replacement.

### Multi-Stage Verification

All data written by the voice agent during calls starts at a low confidence level and must pass through a verification pipeline before syncing to external systems. This is the trust architecture for autonomous agents acting on noisy phone audio.

```mermaid
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
    A["Voice agent writes\nclinical data\n(low confidence)"] --> B["Call ends"]
    B --> CL["Call Classifier"]
    CL -->|"Junk (prank/ad/bot)"| REJ1["❌ Rejected\n(confidence → 0)"]
    CL -->|"Real call"| J1["Per-Event LLM Judge"]

    J1 -->|"Valid"| AP["✅ Auto-approved\n(confidence → verified)"]
    J1 -->|"Correctable"| COR["Auto-correct\n(name casing, dates, phones)"]
    COR --> AP
    J1 -->|"Uncertain"| FLAG["⚠️ Flagged\n→ Review queue"]

    AP --> J2["Session Coherence Check"]
    J2 -->|"Coherent"| SYNC["✅ Sync-eligible\n(confidence upgraded)"]
    J2 -->|"Contradictions"| FLAG

    FLAG --> HR["Human Reviewer"]
    HR -->|"Approve"| SYNC2["✅ Human-approved\n(high confidence)"]
    HR -->|"Correct"| NEW["New event\n(supersedes original)"]
    HR -->|"Reject"| REJ2["❌ Rejected"]
    NEW --> SYNC2

    SYNC --> OUT["Connector runner\nsyncs to external system\n(confidence gate ≥ verified)"]
    SYNC2 --> OUT
```

**Three-stage automated review:**

| Stage                 | What It Checks                                                            | Actions                                                      |
| --------------------- | ------------------------------------------------------------------------- | ------------------------------------------------------------ |
| **Call Classifier**   | Is this a real clinical call or junk? (prank, ad, bot, silence)           | Real → continue; Junk → reject all session events            |
| **Per-Event Judge**   | Cross-references each event against transcript plus existing entity state | Approve, auto-correct (formatting), or flag for human review |
| **Session Coherence** | Do all events tell a coherent story? Contradictions? Missing data?        | Upgrade confidence if coherent, flag if contradictions found |

**Why three stages, not one**: per-event review catches data-level errors (wrong phone format, impossible DOB, name doesn't match transcript). Session-level review catches narrative-level errors (contradictions between events, discussed insurance but no coverage event recorded). These are different kinds of errors that need different analysis approaches.

### Patient Safety Isolation

A **write scope** is enforced per session: write tools can only target the patient identified in the current call. This prevents cross-patient data errors. Write tools are also **deduplicated**: identical calls within the same session return cached results rather than creating duplicate records (30-second TTL, successful results only, errors are always retryable).

## Emotional Adaptation

The voice agent adapts across **four independent output channels simultaneously** based on real-time caller emotion. Each row in the matrix below is a detected situation; columns show how each output channel responds. All adaptation is automatic; workspace managers control only the baseline via [voice settings](/developer-guide/platform-api/platform-api/workspaces.md#voice-settings).

### Valence-Arousal Model

Every detected emotion maps to a two-dimensional (valence, arousal) coordinate. The system tracks these coordinates across a rolling window to build a stable yet responsive picture of the caller's emotional state:

```
        High Arousal (1.0)
             │
    ANGER ───┼─── EXCITEMENT
  Frustration│    Joy
  Fear       │    Enthusiasm
             │
  ───────────┼───────────── Valence
  Negative   │    Positive
  (-1.0)     │    (+1.0)
             │
    SADNESS ──┼─── CONTENTMENT
  Disappointment  Relief
  Boredom    │    Gratitude
             │
        Low Arousal (0.0)
```

| Quadrant                                           | Agent Strategy | Voice Tone     | LLM Behavior                                                                         |
| -------------------------------------------------- | -------------- | -------------- | ------------------------------------------------------------------------------------ |
| **High-arousal negative** (anger, frustration)     | De-escalate    | `calm`         | Direct, concise, acknowledge frustration, skip pleasantries, match urgency           |
| **Low-arousal negative** (sadness, disappointment) | Comfort        | `sympathetic`  | Warm, patient, gentle language, give extra space, do not rush                        |
| **High-arousal positive** (excitement, joy)        | Match energy   | `enthusiastic` | Enthusiastic language, keep momentum, match positive energy                          |
| **Low-arousal positive** (contentment, relief)     | Maintain       | `content`      | Warm and steady, reinforce positive outcome, conversational                          |
| **Confusion** (high confidence)                    | Clarify        | `calm`         | Simplify explanations, break into small pieces, check understanding, offer to repeat |
| **Anxiety** (high confidence)                      | Reassure       | `sympathetic`  | Calm and reassuring, provide clear next steps, avoid uncertainty                     |

### Voice Tone Priority Chain

The agent's voice tone is determined by a six-level priority chain. Each layer fires only if the previous returned no signal:

```mermaid
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TD
    A{"1. Recent vocal burst?\n(laugh, sigh, cry)\n -  last 5s, score ≥ 0.5"} -->|Yes| A1["Burst-derived tone\n(highest priority)"]
    A -->|No| B{"2. Prosody emotion?\n(rolling 30s window)\n -  score ≥ 0.25"}
    B -->|Yes| B1["Prosody-derived tone"]
    B -->|No| C{"3. Sensitive topic?\n(action matches\nsensitive_topics list)"}
    C -->|Yes| C1["Preemptive sympathetic\n(before distress shows)"]
    C -->|No| D{"4. Previous turn\nhad strong tone?"}
    D -->|Yes| D1["Tone momentum\n(keeps continuity)"]
    D -->|No| E{"5. Workspace tone\nconfigured?"}
    E -->|Yes| E1["Workspace baseline"]
    E -->|No| F["6. System default"]
```

**Why this ordering matters:**

* **Bursts are the highest-priority signal** because they capture the most immediate emotional state. A caller who just laughed should hear warmth *immediately*, not the rolling average of the last 30 seconds. Burst detection (within last 5 seconds, confidence ≥ 0.5) overrides everything.
* **Tone momentum** (layer 4) prevents jarring voice tone changes. When the current emotional signal is weak (score < 0.25) or doesn't map to a tone, the previous turn's tone persists. Only a strong contradictory signal changes the tone, making the voice feel continuous across the conversation:

```
Turn 1: Anxiety detected (score 0.72) → "sympathetic" → stored as momentum
Turn 2: Calmness detected (score 0.30) → unmapped → momentum returns "sympathetic"
Turn 3: Joy detected (score 0.65)      → "enthusiastic" → stored as new momentum
```

* **Proactive topic sensitivity** (layer 3) fires *before the caller shows distress*. When the agent is about to discuss test results, billing, surgery, or other loaded topics, the voice tone preemptively shifts to sympathetic, even without an emotion signal.

### Emotion → Response Matrix

All four adaptation channels respond simultaneously to each caller state. The agent mirrors *empathy*, not the caller's emotion:

| Caller Emotion                     | Voice Tone     | Filler Style   | LLM Prompt Adaptation                           | Rationale                                |
| ---------------------------------- | -------------- | -------------- | ----------------------------------------------- | ---------------------------------------- |
| **Anger, Annoyance, Contempt**     | `calm`         | **Suppressed** | Direct, concise, acknowledge frustration        | De-escalate without mirroring aggression |
| **Anxiety, Fear, Distress**        | `sympathetic`  | Reassuring     | Calm, clear next steps, avoid uncertainty       | Reassure with a steady presence          |
| **Sadness, Disappointment, Guilt** | `sympathetic`  | Warm           | Patient, supportive, don't rush                 | Warm empathy, give space                 |
| **Confusion**                      | `calm`         | Simple         | Simplify, small pieces, check understanding     | Patient clarity                          |
| **Excitement, Joy, Enthusiasm**    | `enthusiastic` | Warm, matching | Match positive energy, keep momentum            | Mirror positive energy                   |
| **Contentment, Relief, Gratitude** | `content`      | Warm           | Steady, reinforce outcome                       | Warm and grounding                       |
| **Interest, Concentration**        | `curious`      | Engaged        | Engaged tone, match intellectual focus          | Show interest                            |
| **Embarrassment, Doubt**           | `calm`         | Encouraging    | Non-judgmental, encouraging                     | Put at ease                              |
| **Boredom, Tiredness**             | `enthusiastic` | Concise        | Re-engage with energy, be efficient             | Re-energize                              |
| **Sarcasm**                        | `calm`         | Professional   | Respond to underlying concern, not surface tone | Stay professional                        |

**Burst-to-experience mapping**: 25 vocal burst types are mapped to specific agent tones and caller state interpretations:

| Burst Types       | Agent Tone             | Inferred Caller State    |
| ----------------- | ---------------------- | ------------------------ |
| Laugh, Giggle     | `enthusiastic`         | Amused                   |
| Sigh              | `sympathetic`          | Weary                    |
| Cry, Sob, Whimper | `sympathetic`          | Distressed               |
| Gasp              | `calm`                 | Alarmed                  |
| Groan, Ugh        | `sympathetic` / `calm` | Frustrated               |
| Growl, Tsk        | `calm`                 | Angry                    |
| Hmm, Mhm          | `calm`                 | Thinking / Acknowledging |
| Aww               | `sympathetic`          | Touched                  |

### Filler Speech

Fillers cover processing latency so the caller never hears silence. The system uses **principle-based guidance** rather than hardcoded phrase lists, generating contextually appropriate fillers from emotional context, the current action, and the expected latency.

**Three-layer filler generation:**

| Layer                    | When                        | What It Controls                                                                                                                                     |
| ------------------------ | --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Latency adaptation**   | Always                      | Filler length matches expected processing time (2-4 words for normal latency, 3-5 words for audio verification processing)                           |
| **Emotional attunement** | When emotion data available | Emotional register matches the caller's state. Not specific phrases, but principles like "gentle and reassuring" or "a verbal hand on the shoulder". |
| **Action context**       | Always                      | Current context graph action description injected so the filler hints at what the agent is about to do                                               |

**Per-action filler hints**: context graph actions can include optional PM-configured filler suggestions. These are weak steering; emotion-adaptive principles always dominate. The LLM sees hints as suggestions to draw from, not commands.

**Suppression rule**: when `valence < -0.2 AND arousal > 0.4 AND emotion is NOT Anxiety/Fear/Distress` → fillers disabled entirely. Frustrated callers don't want acknowledgments, they want the answer. **Exception**: anxious callers still receive reassuring fillers, because anxiety benefits from reassurance while frustration does not.

#### Tool-Wait Progress Hints

When a tool is running, the agent narrates the wait based on a `ProgressHint` declared on the state's channel override and/or on the individual tool binding. The hint describes the *shape* of the wait, not a phrase list — the engine composes the actual utterance from tool semantics, turn emotion, and conversation context.

| Field                 | Type                                                                                  | Description                                                                                                                                                                                                                                                                                                                                                                               |
| --------------------- | ------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `mode`                | `"auto"` \| `"silent"` \| `"backchannel"` \| `"verbal"`                               | How the agent covers the wait. `auto` lets the engine decide from class and latency; `silent` drops filler text; `backchannel` emits a single token ("Mm."); `verbal` produces a contextual filler.                                                                                                                                                                                       |
| `progress_class`      | `"lookup"` \| `"write"` \| `"external_call"` \| `"compute"` \| `"multi_step"` \| null | Semantic category of the work the tool is doing. Drives the templated language the engine picks for retries.                                                                                                                                                                                                                                                                              |
| `expected_latency_ms` | integer (0-60000) \| null                                                             | How long the tool is expected to take. Used to choose filler length and pacing.                                                                                                                                                                                                                                                                                                           |
| `initial_delay_ms`    | integer (0-30000) \| null                                                             | How long to wait after a tool starts before playing the first filler. When unset, the delay is calculated from `expected_latency_ms`. Set to `0` for tools where you want immediate filler playback.                                                                                                                                                                                      |
| `phrases`             | array of strings (1-10 items) \| null                                                 | Ordered deterministic filler phrases. When set, the engine plays `phrases[0]` on the first filler attempt, `phrases[1]` on the second, and so on, clamping to the last phrase for attempts beyond the list length. Each phrase must be at most 30 words. **Deterministic mode**: supersedes `custom_phrase`, class templates, and vocabulary - the engine plays exactly what you specify. |
| `custom_phrase`       | string (max 500 chars) \| null                                                        | Operator-authored filler utterance for the first filler attempt. In `auto` mode: requires `expected_latency_ms` >= 4000 and `progress_class` to be set. In `verbal` mode: always honored (word cap still applies). Ignored when `phrases` is set. Subsequent attempts use class templates.                                                                                                |

**Placement**: `progress` can be set on `channel_overrides[channel].progress` (state-level default for that channel) and on each entry of `action_tool_call_specs` or `exit_condition_tool_call_specs` (per tool). Per-tool fields merge field-wise over the channel override, so a state can declare the default wait shape once and individual tools only override fields that actually differ.

**Retry narration**: when a tool retries, the engine uses deterministic attempt-aware templates keyed on `progress_class` - acknowledgement on attempt 1, brief apology on attempt 2, a "still working" update from attempt 3 - so retry audio stays off the LLM hot path and latency stays bounded. When `phrases` is set, the engine skips templates entirely and plays the operator's phrases in order.

**Deterministic vs non-deterministic**: use `phrases` when you need exact control over what the agent says during tool waits (demo flows, regulated disclosures). Use `custom_phrase` + `progress_class` when you want to steer the first filler but let the engine handle retries with contextual templates.

### Call Phase Escalation

The system automatically increases urgency as calls extend with negative sentiment:

| Phase     | Duration | Condition           | Adaptation                                                                                 |
| --------- | -------- | ------------------- | ------------------------------------------------------------------------------------------ |
| **Early** | < 5 min  | Any                 | Standard emotional adaptation                                                              |
| **Mid**   | 5-10 min | Trend deteriorating | "Focus on resolution speed. Shorten responses."                                            |
| **Late**  | ≥ 10 min | Negative valence    | **URGENCY.** "Prioritize resolution. Be maximally concise. Escalate if unable to resolve." |

### Proactive Intelligence

The system detects emotionally sensitive topics from the current context graph action **before the caller shows distress**:

```mermaid
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
    A["Current Action:\n'Discuss test results'"] --> B{"Matches\nsensitive_topics?"}
    B -->|Yes| C["Preemptive shift to\nsympathetic tone\n(priority level 3\nin tone chain)"]
    B -->|No| D["Normal tone\npriority chain"]
```

`sensitive_topics` is configurable via [voice settings](/developer-guide/platform-api/platform-api/workspaces.md#voice-settings). Falls back to healthcare defaults: test results, diagnosis, billing, payment, insurance, denial, emergency, referral, specialist, surgery, procedure, medication.

This fires at priority level 3 in the TTS emotion chain, below burst and prosody (which have actual data about the caller's current state) but above tone momentum and workspace defaults.

### Coherence Detection

When what the caller *says* doesn't match how they *sound* (coherence < 0.4), the system shifts its steering: *"The caller's words suggest X but voice sounds Y. Trust the vocal tone over the words, respond to how they sound, not what they claim."*

This is injected into the system prompt without the agent ever explicitly mentioning the discrepancy to the caller.

### Control Plane ↔ Adaptation

How each workspace voice setting interacts with the automatic emotion adaptation system:

| Voice Setting                   | What You Control                           | What the System Overrides         | Override Condition                              |
| ------------------------------- | ------------------------------------------ | --------------------------------- | ----------------------------------------------- |
| `tone`                          | Baseline voice emotion for neutral callers | Emotion-derived tone replaces it  | Any non-neutral emotion detected (score ≥ 0.25) |
| `speed`                         | Base speech rate                           | Never overridden                  | Your choice is always respected                 |
| `volume`                        | Base volume                                | Never overridden                  | Your choice is always respected                 |
| `voice_id`                      | Voice persona                              | Per-agent voice config overrides  | Agent version has voice config set              |
| `keyterms`                      | Domain vocabulary for STT boost            | Merged with service keyterms      | Always additive, never overridden               |
| `correction_categories`         | Domain hints for audio correction          | None                              | Used as additional context                      |
| `sensitive_topics`              | Topics for proactive tone softening        | Falls back to healthcare defaults | Preemptive, not reactive                        |
| `post_call_analysis_enabled`    | Quality scoring on/off                     | None                              | Full PM control                                 |
| `transcript_correction_enabled` | Re-verification on/off                     | None                              | Full PM control                                 |

**Key principle**: workspace managers control the *baseline experience* and *domain knowledge*. The emotion intelligence system overrides the baseline *only when it detects a strong signal*, and always in the direction of more empathy, never less.

### Graceful Degradation

Every intelligence layer is best-effort with an explicit fallback. **A failed intelligence layer must never fail a call.**

| Layer                    | Failure Mode                       | Fallback                                                            | Impact                                                       |
| ------------------------ | ---------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------ |
| **Emotion connection**   | Auth error, billing, timeout       | Session continues without emotion detection                         | No emotional adaptation, workspace defaults used             |
| **Emotion segment**      | Processing error, connection close | Consecutive failure counter → disable after 5                       | Degrades gracefully to less data                             |
| **Emotion detection**    | Insufficient data (< 2 segments)   | No emotional steering, default fillers                              | First few seconds may lack adaptation                        |
| **Burst detection**      | No burst events                    | Falls through to prosody-derived emotion                            | Loses immediate reaction, uses rolling average               |
| **Language model**       | No language results                | Coherence defaults to 1.0 (agreement assumed)                       | Loses word-vs-tone disagreement detection                    |
| **Audio verification**   | Timeout or error                   | No corrections injected, call continues                             | Relies on raw STT only                                       |
| **Voice settings**       | Parse error                        | Defaults (filler on, emotion on)                                    | Baseline experience still works                              |
| **Post-call analysis**   | Any error                          | Logged, not raised (fire-and-forget)                                | Quality data missing, call unaffected                        |
| **TTS connection**       | Close/error mid-stream             | Auto-reconnect on next turn                                         | Brief silence, then recovery                                 |
| **STT connection**       | Connection loss                    | Exponential backoff reconnect (max 3 attempts)                      | Brief gap in transcription                                   |
| **Context graph engine** | Backend unavailable                | Falls back to static prompt mode (without context graph navigation) | Agent still converses, just without state machine navigation |

## Tool Execution

Skills configured in the context graph execute **asynchronously** during calls. The agent acknowledges the action and continues speaking while tools run in the background. Results arrive as continuation turns.

```mermaid
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
    A["LLM returns\ntool call"] --> B["Filler plays\nwhile tool runs"]
    B --> C["Tool executes\n(async, background)"]
    C --> D["Result arrives"]
    D --> E["Agent relays\nresult to caller"]
```

### Execution Tiers

Tool calls are routed through an execution tier system that matches the tool's complexity to the right execution model:

```mermaid
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
    TC["Tool Call"] --> R{"Route by\nexecution tier"}
    R -->|"T1: direct"| T1["Direct Integration\n(single HTTP call)\n< 2 seconds"]
    R -->|"T2: orchestrated"| T2["LLM Agent\n(multi-turn reasoning\nwith tool access)\n2-30 seconds"]
    R -->|"T3: autonomous"| T3["Autonomous Agent\n(extended loop with\ncheckpointing + MCP tools)\n30s - 5 min"]
    R -->|"Integration tool"| IT["Integration Client\n(direct HTTP with\nOAuth2/WIF auth)"]
    R -->|"Fallback"| FB["Legacy execution"]
```

| Tier   | Name         | Execution Model                                      | Latency  | Use Cases                                       |
| ------ | ------------ | ---------------------------------------------------- | -------- | ----------------------------------------------- |
| **T1** | Direct       | Single integration API call, no LLM                  | < 2s     | Patient lookup, allergy check, medication list  |
| **T2** | Orchestrated | Multi-turn LLM agent with tool access                | 2-30s    | Eligibility cascades, multi-step writes         |
| **T3** | Autonomous   | Extended agent loop with checkpointing and MCP tools | 30s-5min | Complex prior auth, cross-system reconciliation |

**T3 autonomous agents** use a full agent SDK with:

* **Custom MCP tools** injected per-task (world model tools, integration tools).
* **Session checkpointing** for pause/resume across retries.
* **Cost caps** per task to prevent runaway execution.
* **Isolated working directories** per task.

**Write-tool deduplication**: all write tools are deduplicated within a session (30-second TTL). Identical tool calls return cached results. Only successful results are cached; errors are always retryable.

### Built-in Clinical Tools

Healthcare workspaces get 13 built-in tools automatically, with no integration configuration required:

**Read tools:**

| Tool                        | Purpose                                          | Key Feature                                                        |
| --------------------------- | ------------------------------------------------ | ------------------------------------------------------------------ |
| **Patient lookup**          | Search by DOB, name, phone, or MRN               | DOB preferred for accuracy                                         |
| **Slot search**             | Available appointment slots by location and date | Returns human-readable times + doctor names, caches slot internals |
| **Appointment lookup**      | Patient's existing appointments                  | Returns appointment references for cancel/confirm                  |
| **Semantic patient search** | Fuzzy, embedding-based patient matching          | Handles misspellings and partial information                       |
| **Semantic event search**   | Embedding-based search across clinical events    | Optionally scoped to a specific patient                            |

**Write tools:**

| Tool                       | Purpose                                         | Key Feature                                                           |
| -------------------------- | ----------------------------------------------- | --------------------------------------------------------------------- |
| **Patient create**         | Create patient with automatic deduplication     | Dedup by name + DOB                                                   |
| **Patient update**         | Update contact info (phone, email, address)     | Requires entity reference                                             |
| **Save patient**           | Create-or-update with dedup check               | Accepts natural field names and flexible date formats                 |
| **Schedule appointment**   | Book from slot search results or explicit times | Accepts `slot_ref` from slot search and auto-resolves booking details |
| **Cancel appointment**     | Cancel by appointment reference                 | Writes cancellation event                                             |
| **Confirm appointment**    | Confirm a booked appointment                    | Writes confirmation event                                             |
| **Create insurance**       | Insurance record with carrier fuzzy-matching    | Supports policy holder info                                           |
| **Schedule outbound call** | Schedule a future callback                      | Creates `outbound_task` entity atomically                             |

All write tools pass through the [multi-stage verification pipeline](#multi-stage-verification) before data reaches external systems. All write tools enforce [patient safety isolation](#patient-safety-isolation).

### Call Forwarding

A built-in `forward_call` tool transfers the caller to a human. Two modes:

* **Static forwarding**: workspace-configured fallback for the source number.
* **Location-based forwarding**: the agent selects from location phone numbers in the patient's context.

The agent cannot specify arbitrary phone numbers; the destination always comes from the resolved config or location entity state. When the caller requests a human, the agent is required to invoke the tool. The actual transfer happens via the telephony system, not through words alone.

{% hint style="info" %}
**Deferred transfer.** Call transfers are deferred until the agent's goodbye message finishes playing. The transfer is cancellable by barge-in or operator join.
{% endhint %}

## Audio Verification

When the agent needs to capture structured data (names, dates, phone numbers, insurance IDs), it can trigger audio verification, sending the caller's raw audio for AI-powered correction alongside the real-time transcript.

This catches STT errors on structured data that streaming transcription commonly gets wrong: proper names, alphanumeric IDs, phone numbers, and dates.

**Domain-aware**: `correction_categories` from [voice settings](/developer-guide/platform-api/platform-api/workspaces.md#voice-settings) are injected as domain hints. This tells the correction model: *"This workspace commonly handles medication names and insurance carriers. STT frequently gets these wrong. Pay extra attention."*

### Correction Output

Corrections are structured as field-level pairs showing what STT heard versus the corrected value:

```
name: "Micah Adeline" → "Mika Adlin" (confidence: 9)
dob: "March 15 1990" → "1990-03-15" (confidence: 8)
```

### Correction Confidence

| Level                 | Score | Agent Behavior                                            |
| --------------------- | ----- | --------------------------------------------------------- |
| **Certain**           | 8-9   | Use corrected value directly without confirming           |
| **Likely**            | 5-7   | Confirm with caller ("I have \[value], is that correct?") |
| **Uncertain**         | 1-4   | Ask caller to spell out or repeat slowly                  |
| **Both models wrong** | -     | Audio quality is poor; ask for letter-by-letter spelling  |

Observer events include the original STT value, the corrected value, and the numeric confidence, enabling frontend visualization of correction accuracy.

## Safety & Monitoring

### Conversation Monitor

An embedding-based safety detection system evaluates every turn against configured safety concepts using a two-stage pipeline:

```mermaid
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
    A["Caller transcript"] --> B["Embed transcript"]
    B --> C["Cosine similarity\nvs all concept vectors\n(matrix multiply, <1ms)"]
    C --> D{"Above\nthreshold?"}
    D -->|Yes| E["AI Judge\n(structured output:\naction + reasoning)"]
    D -->|No| F["No action"]
    E --> G{"Decision"}
    G -->|hard_escalate| H["Interrupt agent +\nimmediate escalation"]
    G -->|soft_escalate| I["Escalate after\ncurrent turn completes"]
    G -->|alert| J["Log event only"]
    G -->|ignore| F
    D -->|"Standalone\n≥ 0.85"| H
```

**Standalone fallback**: if semantic similarity exceeds a high threshold (default 0.85), escalation triggers immediately without waiting for the AI judge, providing a safety net even if the judge model is unavailable.

**Default safety concepts** (always active): suicidal ideation, self harm, domestic violence, adverse drug reaction, post-discharge red flag. Custom concepts can be added via the [Safety API](/developer-guide/platform-api/platform-api/safety.md) with pre-computed embeddings.

### Auto-Escalation

When an escalation triggers, the system:

1. Writes an escalation event to the world model (dual-entity: both call and operator entities).
2. Notifies the [operator dashboard](/developer-guide/platform-api/platform-api/operators.md).
3. For hard escalations, immediately suspends the AI agent pending human intervention.

## Observer WebSocket

Monitor active calls in real time via a cross-pod WebSocket connection:

```
WS /agent/observe/{call_sid}?token={api_key}
```

Requires a valid workspace API key. Any observer instance can monitor any active call in the workspace, regardless of which pod handles the call (events are distributed via pub/sub).

**Late-join replay**: observers connecting mid-call receive a buffered replay of recent events before transitioning to the live stream. Events carry monotonic sequence numbers for ordering.

### Event Types

| Event                 | Key Data                                                                             | Source           |
| --------------------- | ------------------------------------------------------------------------------------ | ---------------- |
| `session_start`       | `call_sid`, `service_id`, `workspace_id`, `initial_state`, `trace_id`                | Session init     |
| `session_info`        | Full call snapshot (sent on observer connect)                                        | Observer connect |
| `user_transcript`     | `transcript`, `emotion_label`, `emotion_valence`                                     | Turn controller  |
| `agent_transcript`    | `transcript`, `action`, `interrupted`                                                | Speaker          |
| `state_transition`    | `previous_state`, `next_state`                                                       | Turn controller  |
| `tool_call_started`   | `tool_name`, `call_id`, `input`                                                      | Turn controller  |
| `tool_call_completed` | `tool_name`, `duration_ms`, `output` (truncated), `succeeded`, `error_message`       | Turn controller  |
| `nav_timing`          | `nav_ms`, `render_ms`, `total_ms`, `input_tokens`, `output_tokens`, `model`, `state` | Turn controller  |
| `latency`             | `e2e_ttfb_ms`, `engine_ms`, `nav_ms`, `render_ms`, `audio_ttfb_ms`, `continuation`   | Speaker          |
| `emotion`             | `dominant`, `valence`, `arousal`                                                     | Transport        |
| `session_end`         | `call_sid`, `duration_s`, `turns`, `completion_reason`, `final_state`                | Session shutdown |
| `injected_event`      | `message`, `sender`, `event_type`                                                    | Turn controller  |
| `ping`                | (empty)                                                                              | Keepalive (30s)  |

## Session Event Injection

External systems can inject events into active voice sessions. The agent processes injected events through its response generation (without context graph navigation) and speaks a natural response. This enables real-time interaction with live calls from EHR systems, operator dashboards, or any backend service.

### Injection Paths

| Path                        | Endpoint                                               | Auth                        | Use Case                                              |
| --------------------------- | ------------------------------------------------------ | --------------------------- | ----------------------------------------------------- |
| **Voice Agent HTTP**        | `POST /agent/sessions/{call_sid}/event`                | Bearer token                | Direct injection from backend services                |
| **Platform API (general)**  | `POST /v1/{workspace_id}/sessions/{call_sid}/inject`   | API key                     | Frontend or third-party injection with workspace auth |
| **Platform API (operator)** | `POST /v1/{workspace_id}/operators/{id}/send-guidance` | API key (`Operator:Update`) | Operator-scoped guidance with identity tracking       |
| **WebSocket control**       | Text frame on `/test-call` or `/direct-stream`         | Session auth                | Developer playground and testing                      |

### Event Types

| Type             | Behavior                                               | Example                                        |
| ---------------- | ------------------------------------------------------ | ---------------------------------------------- |
| `external_event` | Queues behind current speech. Cancels silence monitor. | "Appointment confirmed for 2pm tomorrow"       |
| `guidance`       | Interrupts current speech and cancels silence monitor. | "Ask for their insurance ID before confirming" |

The distinction matters: external events carry factual information that can wait for the agent to finish speaking, while guidance carries instructions that are time-sensitive and should be acted on immediately.

### Request Format

```
POST /agent/sessions/{call_sid}/event
Authorization: Bearer <api_key>
```

```json
{
  "message": "The patient's insurance has been verified",
  "sender": "ehr_system",
  "event_type": "external_event"
}
```

The `event_type` field accepts `"external_event"` or `"guidance"`. The `sender` field is recorded in the call transcript for attribution.

### Response

The endpoint returns delivery status indicating whether the event was received by the session:

```json
{
  "status": "delivered",
  "call_sid": "CA1234..."
}
```

A `status` of `"queued_no_subscriber"` indicates the event was published but no active session was listening. This can happen during a brief window when a session is initializing or if the call has already ended.

### Cross-Pod Architecture

HTTP injections publish to a per-session pub/sub channel (`va:inject:{call_sid}`). The session subscribes to this channel at startup and drains pending events before each STT poll. Injection works regardless of which server pod is handling the call.

The subscription reconnects with exponential backoff (1s to 10s cap) if the pub/sub connection is interrupted. A transient infrastructure outage never kills a voice session.

### Active Sessions

List currently active sessions via the platform API:

```
GET /v1/{workspace_id}/sessions/active
Authorization: Bearer <api_key>
```

Returns a real-time list of active sessions with call metadata. This endpoint proxies to the voice agent's distributed active call registry.

### WebSocket Control Channel

Test calls and direct streams accept text-frame control messages for injection and session control:

```jsonc
// Inject an external event
{"type": "inject_event", "message": "...", "sender": "..."}

// Inject operator guidance (interrupts speech)
{"type": "inject_guidance", "message": "..."}

// Force context refresh (reloads patient data)
{"type": "refresh_context"}

// Stop the session
{"type": "stop"}
```

### Test-Call Scenarios

The `/test-call` WebSocket endpoint supports scenario-based testing:

| Parameter                 | Default      | Description                                                                                |
| ------------------------- | ------------ | ------------------------------------------------------------------------------------------ |
| `scenario`                | `inbound`    | `inbound` (agent greets first), `outbound` (task context greeting), `silent` (no greeting) |
| `caller_id`               | `playground` | Simulated caller phone number                                                              |
| `outbound_task_entity_id` | -            | Entity ID for outbound task context (required for `outbound` scenario)                     |
| `system_prompt`           | -            | Freeform prompt override (takes precedence over scenario-derived prompts)                  |

```
WS /agent/test-call?token={api_key}&scenario=outbound&outbound_task_entity_id=123&caller_id=+15551234567
```

## Call Record & Persistence

Every call produces a detailed record persisted to the database:

* **Turns**: each turn carries a 5-layer timing model (all fields in milliseconds):
  * **Layer 1 (STT)**: `user_speech_start_ms`, `user_speech_end_ms` (speech boundaries).
  * **Layer 2 (Engine)**: `engine_ms`, `nav_ms`, `render_ms`, `audio_ttfb_ms` (processing latency breakdown).
  * **Layer 4 (TTS/Transport)**: `agent_speech_start_ms`, `agent_speech_end_ms` (when agent audio played).
* **Tool calls**: name, input, output, duration, success/failure.
* **State transitions**: full context graph navigation history.
* **Emotional summary**: see [below](#emotional-summary).
* **Escalation history**: full escalation lifecycle if operator joined.
* **Config snapshot**: version set, agent version, context graph version used.

## Calls API

### Active Calls

```
GET /agent/calls/active
Authorization: Bearer <api_key>
```

Lists all currently active calls across the workspace. Active call state is maintained in a distributed registry, so any API pod can serve this request regardless of which pod handles the call.

### Call History

```
GET /agent/calls?limit=20&continuation_token=0
Authorization: Bearer <api_key>
```

### Call Detail

```
GET /agent/calls/{call_id}
Authorization: Bearer <api_key>
```

Full call record including turns with timing model, tool calls, state transitions, emotional summary, escalation history, safety state, and config snapshot.

### Recordings

| Endpoint                                   | Description                                                              |
| ------------------------------------------ | ------------------------------------------------------------------------ |
| `GET /calls/{call_id}/recording/stereo`    | Stereo WAV (caller left channel, agent right channel)                    |
| `GET /calls/{call_id}/recording/waveform`  | Amplitude envelope for timeline visualization                            |
| `GET /calls/{call_id}/recording/{channel}` | Single channel WAV (`caller` or `agent`)                                 |
| `POST /calls/{call_id}/verify-transcript`  | Re-transcribe with high-accuracy batch model for ground-truth timestamps |

### Outbound Calls

```
POST /agent/create_outbound_call
Authorization: Bearer <api_key>
```

### Text Sessions

### WebSocket Connection

For bidirectional real-time text conversations, connect via WebSocket at the session connect endpoint. This is useful for chat interfaces that need persistent connections with immediate message delivery in both directions.

**Connection URL:** `wss://{api-host}/v1/{workspace_id}/sessions/connect`

**Query parameters:**

| Parameter         | Type          | Required | Description                                |
| ----------------- | ------------- | -------- | ------------------------------------------ |
| `service_id`      | string (UUID) | Yes      | The service to connect to                  |
| `entity_id`       | string (UUID) | Yes      | The entity (patient/user) for this session |
| `conversation_id` | string (UUID) | No       | Resume an existing conversation            |
| `tool_events`     | boolean       | No       | Emit tool call events (default: true)      |

**Authentication:** Pass your API key or JWT via the WebSocket subprotocol header:

```
Sec-WebSocket-Protocol: auth, <your-api-key-or-jwt>
```

Query-parameter tokens are not supported to avoid credential leakage in server logs.

**Close codes:**

| Code | Meaning                                    |
| ---- | ------------------------------------------ |
| 4001 | Missing or invalid parameters              |
| 4403 | Authentication failed or service not found |
| 4408 | Session timed out                          |
| 4500 | Upstream connection failed                 |
| 4502 | Upstream connection rejected               |
| 4503 | Agent service unavailable                  |

Messages are exchanged as JSON text frames in both directions. The server sends the same event types used by the SSE streaming endpoint (`token`, `tool_call_started`, `tool_call_completed`, `thinking`, `message`, `done`, `error`).

### REST Streaming (SMS, WhatsApp, and WebSocket)

Text sessions run the same context graph engine as voice calls over three channels: SMS, WhatsApp, and WebSocket. All three share the same signal-driven session actor model. Inbound messages are routed through a shared producer that resolves the patient to a workspace and service, finds or creates the session actor, and pushes the message signal to the actor's queue.

```mermaid
flowchart LR
    sms["SMS\nWebhook"] -->|LPUSH| queue["Signal Queue"]
    wa["WhatsApp\nWebhook"] -->|LPUSH| queue
    ws["WebSocket\n/agent/text-stream"] -->|LPUSH| queue
    rest["REST\nPOST /conversations/messages"] -->|proxy| queue
    queue --> actor["TextOrchestrator\nCut > Navigate > Engage"]
    actor --> transport["Channel Transport\n(SMS, WhatsApp, or WS frames)"]
    actor -->|freeze/thaw| store["Conversation Store"]
```

#### WebSocket Text Chat

```
WS /agent/text-stream?token={api_key}&workspace_id={id}&service_id={id}
```

Real-time bidirectional text chat over a persistent WebSocket connection. The same TextOrchestrator actor that powers SMS and WhatsApp sessions runs behind the WebSocket, so the agent uses the same context graphs, tools, and safety rules.

| Query Parameter   | Type   | Required | Description                                          |
| ----------------- | ------ | -------- | ---------------------------------------------------- |
| `token`           | string | Yes      | JWT or API key                                       |
| `workspace_id`    | string | Yes      | Workspace ID                                         |
| `service_id`      | string | Yes      | Service (agent) to run                               |
| `conversation_id` | string | No       | Resume an existing conversation (thaws frozen state) |
| `entity_id`       | string | No       | Patient entity ID for world model context            |

**Client-to-server messages:**

| Type      | Payload                              | Description          |
| --------- | ------------------------------------ | -------------------- |
| `message` | `{"type": "message", "text": "..."}` | Send a user message  |
| `stop`    | `{"type": "stop"}`                   | End the conversation |

**Server-to-client messages:**

| Type              | Payload                                                                      | Description            |
| ----------------- | ---------------------------------------------------------------------------- | ---------------------- |
| `session_started` | `{"type": "session_started", "session_id": "...", "conversation_id": "..."}` | Connection established |
| `message`         | `{"type": "message", "text": "..."}`                                         | Agent response         |
| `typing`          | `{"type": "typing"}`                                                         | Agent is composing     |
| `error`           | `{"type": "error", "message": "..."}`                                        | Error occurred         |
| `session_ended`   | `{"type": "session_ended", "reason": "..."}`                                 | Conversation ended     |

**Conversation persistence:** when the WebSocket disconnects, the conversation freezes - turns are compressed into a natural-language plan and saved alongside recent verbatim turns. Reconnecting with the same `conversation_id` thaws the conversation: the plan and turns are loaded and injected into the reasoning engine. The agent resumes with full context and does not re-send a greeting.

**Residency policy:** 1-hour max duration, 5-minute idle timeout, compress-on-freeze enabled.

**Close codes:**

| Code   | Meaning                                                        |
| ------ | -------------------------------------------------------------- |
| `1000` | Normal close                                                   |
| `4001` | Missing required params or invalid token                       |
| `4003` | Workspace mismatch                                             |
| `4200` | Engine initialization failed (e.g., unpublished agent version) |

#### REST Conversations

```
POST /v1/{workspace_id}/conversations/messages
Authorization: Bearer <api_key>
```

Synchronous single-message interaction. Sends one message, receives the agent's response in the same HTTP call. Conversations persist between requests via `conversation_id`.

| Field             | Type   | Required | Description                                      |
| ----------------- | ------ | -------- | ------------------------------------------------ |
| `service_id`      | string | Yes      | Agent service ID                                 |
| `message`         | string | Yes      | User message (max 10,000 characters)             |
| `conversation_id` | string | No       | Resume existing conversation. Omit to start new. |
| `entity_id`       | string | No       | Patient entity ID for context                    |

**Response:**

| Field             | Type   | Description                                             |
| ----------------- | ------ | ------------------------------------------------------- |
| `conversation_id` | string | Conversation ID (use this to continue the conversation) |
| `messages`        | array  | Agent response messages, each with `role` and `text`    |
| `status`          | string | `active`, `completed`, or `error`                       |

For real-time streaming with typing indicators, use the WebSocket endpoint above.

#### Outbound SMS

```
POST /agent/create_outbound_text
Authorization: Bearer <api_key>
```

Creates an SMS-based conversation. The text session sends a greeting, then conducts a multi-turn conversation over SMS.

| Parameter         | Type           | Required | Description                                          |
| ----------------- | -------------- | -------- | ---------------------------------------------------- |
| `phone_to`        | string (E.164) | Yes      | Patient phone number                                 |
| `phone_from`      | string (E.164) | Yes      | Agent phone number (must be configured in workspace) |
| `workspace_id`    | string         | Yes      | Workspace ID                                         |
| `service_id`      | string         | Yes      | Service (agent) to run                               |
| `entity_id`       | string         | No       | World model entity ID for patient context            |
| `surface_id`      | string         | No       | Surface ID to deliver inline in the conversation     |
| `idempotency_key` | string         | No       | Client-provided dedup key (cached 5 minutes)         |

Returns `session_id`, `status` (`created` or `already_active`), and `conversation_id`. Rate limited to 20 per workspace per minute.

**Consent enforcement**: returns `403 Forbidden` if the patient has opted out of SMS. Opt-out is tracked when patients text STOP, UNSUBSCRIBE, CANCEL, END, or QUIT to the agent's number.

**Inbound SMS**: when a patient texts the agent's phone number, a text session is automatically created if one is not already active for that phone pair.

#### Inbound WhatsApp

WhatsApp text sessions are created automatically when a patient messages the agent's WhatsApp number. The platform validates the inbound message signature, parses the payload, resolves the patient's phone number to a workspace and service via a phone number mapping, and routes the message to a session actor. Phone numbers are normalized to E.164 format so international numbers are handled correctly regardless of how the messaging provider formats them.

Session continuity is keyed on the patient's phone number. If an active session exists for that phone number, the message is routed to the existing session. If not, a new session is created with full patient context loaded from the world model. WhatsApp sessions have a longer default session window than SMS, reflecting the platform's 24-hour messaging reply policy.

Delivery status events (delivered, read, rejected) are tracked separately from text content and do not create conversation turns.

### WhatsApp Voice Notes

```
POST /v1/{workspace_id}/services/{service_id}/voice-turn
Authorization: Bearer <api_key>
Content-Type: multipart/form-data
```

Handles voice note conversations on WhatsApp (and any channel that sends completed audio recordings rather than real-time streams). The platform transcribes the audio, runs the same reasoning engine pipeline as voice calls and text sessions, synthesizes the agent's spoken reply, and returns it as OGG Opus audio.

| Parameter      | Type           | Required | Description                                                                                     |
| -------------- | -------------- | -------- | ----------------------------------------------------------------------------------------------- |
| `audio`        | file           | Yes      | Audio recording (any format the STT engine accepts). Max 20 MB.                                 |
| `phone_number` | string (E.164) | Yes      | Caller phone number. Session continuity is keyed on `(workspace_id, service_id, phone_number)`. |

**Response**: `200 OK` with `audio/ogg` body (OGG Opus). `204 No Content` if the agent has nothing to say. `409 Conflict` if a turn for this phone number is already in progress.

Sessions are server-managed - no session ID required. The platform maintains conversation state across turns keyed on the caller's phone number, so patients can switch between text and voice notes in the same thread without losing context.

### Desktop Sessions

```
POST /v1/{workspace_id}/desktop-sessions
Authorization: Bearer <api_key>
```

Desktop session proxy for driving RDP desktop automation sessions without direct cloud infrastructure access. The platform proxies session lifecycle operations to the desktop sidecar, handling authentication, workspace isolation, and audit logging.

| Endpoint                                | Description                                                    |
| --------------------------------------- | -------------------------------------------------------------- |
| `POST /desktop-sessions`                | Create a session (one active per sidecar, admin role required) |
| `GET /desktop-sessions/{id}/screenshot` | Capture current screen                                         |
| `POST /desktop-sessions/{id}/action`    | Send mouse/keyboard action                                     |
| `GET /desktop-sessions/{id}/status`     | Check session state                                            |
| `DELETE /desktop-sessions/{id}`         | Disconnect session                                             |

All operations emit PHI-access audit events and are rate-limited. A distributed lock prevents concurrent session creation across API pods.

## Agent Trace & Debugging

Three endpoints provide deep inspection into agent reasoning and call quality.

### Execution Trace

```
GET /agent/calls/{call_id}/trace
```

Per-turn execution log showing the full sequence of agent decisions. Each turn includes:

| Field           | Type           | Description                                                                            |
| --------------- | -------------- | -------------------------------------------------------------------------------------- |
| `action`        | string         | The action taken (speak, navigate, tool call, escalate)                                |
| `signal_kind`   | string         | What triggered this turn (caller speech, tool result, state transition, barge-in)      |
| `effect_kind`   | string         | What the agent did (spoke, invoked tool, navigated, escalated)                         |
| `tools`         | array          | Tool calls with name, parameters, result, and duration                                 |
| `state`         | string         | Context graph state at this turn                                                       |
| `emotion`       | object         | Detected caller emotion at this turn                                                   |
| `inner_thought` | string or null | Agent's internal reasoning (when available)                                            |
| `latency_ms`    | integer        | Response latency for this turn                                                         |
| `barge_in`      | object or null | Barge-in details if the agent was interrupted (interrupted text, discarded utterances) |

### Prompt Trace

```
GET /agent/calls/{call_id}/prompts
```

Full LLM prompts for each turn. Each entry includes the system prompt, conversation history, tool definitions, and the model's response. Useful for debugging unexpected agent behavior by seeing exactly what the model received and generated.

### Call Trace Analysis

```
GET /agent/calls/{call_id}/trace-analysis
```

Audio-native intelligence analysis computed asynchronously after call completion.

| Field                  | Type   | Description                                                  |
| ---------------------- | ------ | ------------------------------------------------------------ |
| `status`               | string | `completed`, `pending`, or `unavailable`                     |
| `emotional_arc`        | object | How the caller's emotional state evolved throughout the call |
| `key_moments`          | array  | Critical decision points with causal attribution             |
| `counterfactuals`      | array  | What would have happened with different agent decisions      |
| `coaching`             | array  | Specific recommendations for configuration improvement       |
| `interaction_dynamics` | object | Turn-taking patterns, engagement metrics, flow analysis      |

Returns `status: pending` while analysis is in progress and `status: unavailable` if analysis is not configured for the workspace.

Equivalent trace and prompt endpoints are available for simulation sessions at `/internal/simulations/sessions/{session_id}/trace` and `/internal/simulations/sessions/{session_id}/prompts`.

## Emotional Summary

At call end, the system persists a complete emotional record available in the call detail response:

```json
{
  "dominant_emotion": "Anxiety",
  "average_valence": -0.312,
  "average_arousal": 0.654,
  "peak_negative_valence": -0.587,
  "peak_negative_emotion": "Fear",
  "emotional_shifts": 3,
  "final_trend": "improving",
  "segment_count": 42,
  "barge_in_count": 2,
  "short_response_streak": 0,
  "silence_gap_count": 1,
  "coherence": 0.72,
  "language_sentiment": 0.45,
  "burst_types": {"Sigh": 2, "Hmm": 3}
}
```

## Roadmap: Toward Deeper Empathy

The emotional intelligence system is actively evolving. These are areas where we are investing to push beyond current capabilities:

| Area                              | Where We Are Today                                                                                                              | Where We're Heading                                                                                                                    |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| **Prosodic rhythm**               | Text-level rhythm guidance (shorter sentences for urgency, gentle transitions when rushing)                                     | Audio-level prosodic planning: breath-like pauses, per-word speed variation, rhythm that matches the emotional weight of each sentence |
| **Emotional response time**       | Emotion applied on the next turn after detection (\~2-4s). Burst detection (laughs, sighs) provides faster sub-segment signals. | Sub-second emotional adaptation, responding to a voice crack within the same conversational beat                                       |
| **Emotional memory across calls** | Each call persists a full emotional summary. Patient context injected from world model                                          | Cross-call emotional profiles: "this patient was anxious about test results last call" surfaced proactively in future calls            |
| **Mixed-emotion voice**           | Single emotion label per generation; text structure conveys nuance                                                              | Emotion blending: "warm concern with a hint of encouragement" expressed in a single sentence through TTS-level control                 |

## API Reference

* [Calls](https://docs.amigo.ai/api-reference/readme/platform/calls)
* [Recordings](https://docs.amigo.ai/api-reference/readme/platform/recordings)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.amigo.ai/developer-guide/platform-api/platform-api/voice-agent.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
