# How It Works

This page walks through a complete interaction lifecycle using a voice call as the example, since voice is the most complex modality. Every step maps to a real system component.

The platform's [reasoning engine](https://docs.amigo.ai/agent/reasoning-engine) is modality-independent - it processes typed signals (utterances, emotion, tool results) and emits effects (respond, execute tool, escalate). Voice, SMS, and simulation are modality adapters that convert channel-specific I/O into these signals and execute the resulting effects. The call lifecycle below shows how the voice adapter feeds signals to the engine and delivers effects as speech. Text conversations follow the same reasoning pipeline but skip audio processing and deliver effects as messages.

## Call Lifecycle

<figure><img src="https://3635224444-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvcLyiHRcwv7g83p6vxAd%2Fuploads%2Fgit-blob-1bbbd086a17409fe19a10419b8e53c307f3491cd%2Fhow-it-works-sequence-blue.svg?alt=media" alt="Voice call lifecycle: pre-call, greeting, utterance loop, escalation, post-call"><figcaption></figcaption></figure>

## Phase by Phase

The phases below describe the voice modality in detail. For text (SMS) interactions, the same reasoning and tool execution phases apply - but audio capture, STT, emotion detection from prosody, filler speech, and TTS are replaced by message-based I/O. The agent engine processes the same context graphs, executes the same tools, and applies the same safety rules regardless of channel.

### 1. Instant Greeting (During Ring Time)

When a call comes in, the system does not wait for the patient to pick up. During the ring time, it:

* Creates a conference call with an agent leg already connected
* Resolves the caller's identity from their phone number against the world model
* Loads patient context (demographics, upcoming appointments, recent encounters) into the agent's system prompt
* Loads the context graph (the state machine that defines the call flow)

By the time the patient says "hello," the agent is fully loaded and responds immediately. This conference-first architecture means there is never dead air at the start of a call.

### 2. Parallel Audio Processing

Every audio frame from the caller is processed by two independent systems simultaneously:

**Speech-to-Text** converts audio to text with sub-300ms latency. Domain-specific vocabulary boosting (medical terms, local provider names, insurance plans) improves recognition accuracy. End-of-turn detection determines when the caller has finished speaking.

**Emotion Detection** analyzes vocal prosody, burst patterns, and language content through concurrent models. Each caller's signals are normalized to their own vocal baseline (not population averages), and interpreted in the context of the conversation state - whether the call is progressing, stuck, or in escalation. Results feed into a rolling window that tracks emotional state across the conversation, weighted toward recent signals.

These two streams never block each other. If emotion detection fails, speech processing continues unaffected.

### 3. Context Graph Navigation

The context graph is a hierarchical state machine that defines what the agent should accomplish at each point in the call. A navigation step evaluates the current transcript, emotional state, and conversation history to select the next action.

This is not a fixed script. The context graph defines goals and constraints. The agent determines how to achieve them based on the live conversation. If a patient brings up insurance while the agent is in a scheduling flow, the state machine can handle the transition.

The navigation step also selects filler phrases ("Let me check that for you") that keep the conversation flowing while the system processes the next response.

### 4. Tool Execution

External healthcare systems range from fast APIs to slow endpoints with rate limits to web portals with no API at all. The tiered tool execution system decouples the agent's ability to help the patient from whatever the external system supports:

| Tier           | Latency  | Example                                  | Why This Tier Exists                                                                |
| -------------- | -------- | ---------------------------------------- | ----------------------------------------------------------------------------------- |
| Direct         | Under 2s | Patient lookup, slot search              | Fast API available - no reason to make the patient wait                             |
| Orchestrated   | 2-30s    | Appointment booking, insurance check     | Multi-step workflow against a responsive API - filler speech covers the wait        |
| Autonomous     | 30s-5min | Prior authorization, referral processing | Slow external system or complex multi-system workflow - runs in background          |
| Browser        | 1-10min  | Portal login, form submission            | No API exists - browser automation navigates the web portal directly                |
| Computer Use   | 1-10min  | Legacy desktop app, RDP/VNC sessions     | No API and no web portal - full desktop automation via screenshots and input events |
| Approval-gated | Variable | Prescription changes, clinical orders    | High-stakes action that requires human sign-off before execution                    |

Higher tiers keep the caller informed with natural status updates rather than silence. The world model absorbs throughput mismatches: if the agent needs to book an appointment but the EHR can't handle the write immediately, the intent is captured as an event and the connector runner delivers it when the external system is ready.

### 5. Response Generation and TTS

The response generation step produces the agent's reply using the full context: patient data, conversation history, tool results, emotional state, and the current context graph action. Emotion detection results directly influence the response through micro-behaviors (pacing, word choice, acknowledgment phrases).

Text-to-speech converts the response to audio with emotion-adaptive delivery. If the caller sounds rushed, the agent speeds up. If they sound confused, it slows down and simplifies. Word-level timestamps enable precise barge-in detection so the agent stops speaking when the caller starts.

### 6. Operator Escalation

When a situation exceeds the agent's scope (clinical judgment calls, upset callers requesting a human, safety triggers), the system escalates to a human operator.

The architecture is conference-first: the patient, agent, and operator are all in the same conference call. The operator joins the existing call rather than receiving a transfer. This means:

* No dropped calls during handoff
* The operator hears the agent's context summary before taking over
* The agent can remain on the line to assist the operator with lookups
* The transition is uninterrupted from the patient's perspective

### 7. Post-Call Processing

Data captured during a live phone call is inherently uncertain. A patient might misspeak, the STT might mishear, or the agent might misinterpret. The verification pipeline catches these errors before they reach a system of record.

After the call ends:

1. **Events** written during the call (at voice confidence) enter the automated review pipeline
2. **Call classifier** filters junk calls before review runs
3. **Per-event review** cross-references extracted data against the transcript
4. **Session coherence check** validates narrative consistency across all events from the call
5. **Verified events** (promoted confidence) become eligible for EHR sync
6. **Flagged items** route to the operator review queue for human decision
7. **Post-call summary** is generated and stored as an event on the call entity

The connector runner handles the final step: syncing verified data back to the EHR through the appropriate adapter.
