# Phone

Phone is the most complex channel the platform supports. A phone call has hard real-time constraints that text does not: the caller expects a response within a second, silence feels broken, and emotional tone matters as much as the words. Everything on this page covers what makes phone different from other channels. The reasoning engine, context graphs, tools, and safety rules are the same across all channels and are documented in [Agent Architecture](/agent/reasoning-engine.md).

## Conference-First Architecture

Every call runs as a multi-party conference with at least two participants: the caller and the AI agent. This design is intentional. A conference call, rather than a point-to-point connection, means a human [operator](/channels/operators.md) can join the same call at any time as a third participant without transferring, reconnecting, or interrupting the conversation.

The agent leg is created during ring time, before the caller picks up. The agent is already connected and ready when the call begins. There is no dead air, no "please hold while we connect you" delay. The caller hears a greeting within the first moment of the call.

When an operator joins, they enter the same conference. They can listen silently or take over the conversation. The caller experiences a single continuous call regardless of how many participants are involved behind the scenes.

```mermaid
flowchart LR
    C[Caller] <-->|Audio| Conf[Conference]
    A[AI Agent] <-->|Audio| Conf
    O[Operator] -.->|Joins on\nescalation| Conf
```

## The Voice Pipeline

Every voice call flows through five layers that convert caller audio into a spoken response. Layers 1-2 and 5 are voice-specific. Layers 3-4 are the modality-independent [reasoning engine](/agent/reasoning-engine.md) that also powers text and simulation.

| Layer                          | What It Does                                                                                                                                                                                                                                                                                                                                                   | Scope  |
| ------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
| **1. Audio Capture**           | Captures the caller's audio stream from the telephony layer. Sends it to two parallel processors: speech-to-text and emotion detection. Neither blocks the other.                                                                                                                                                                                              | Voice  |
| **2. Speech-to-Text**          | Converts audio to text using streaming transcription with domain-specific vocabulary boosting. Determines when the caller has finished speaking. Produces utterance and emotion signals.                                                                                                                                                                       | Voice  |
| **3. Intelligence**            | Maintains a rolling emotional profile of the caller normalized to their own vocal baseline. Combines transcript text, emotional state, conversation context, and patient data from the world model into a complete picture of the current moment.                                                                                                              | Engine |
| **4. Navigation and Response** | The context graph engine selects the right action, chooses the appropriate vocal emotion for the response, generates text, and produces filler speech to cover processing time. A unified voice timeline orchestrator coordinates all filler emission, empathy pauses, and tool progress updates through a signal-driven model rather than independent timers. | Engine |
| **5. Text-to-Speech**          | Converts the generated text into spoken audio using the nav-selected emotion, pace, and emphasis. Each utterance carries its own TTS parameters (emotion and speed), so fillers and responses can use different vocal qualities without interference. Streams audio back to the caller through the conference.                                                 | Voice  |

If the primary reasoning model times out during Layer 4, the engine automatically retries with a fallback model. The caller hears filler speech during the retry, so the failover is transparent. This prevents silence or dropped responses during traffic spikes or transient provider issues.

After the call ends, a post-call pipeline re-transcribes the full recording at higher accuracy, scores the interaction across quality dimensions, and feeds accuracy data back into the transcription system.

## Patient Context Injection

When a call connects, the agent resolves the caller's identity from their phone number, loads their full patient context from the world model (demographics, appointments, conditions, recent encounters), and injects it into the reasoning engine's context. This context refreshes during the call. After any tool writes new data, the patient context reloads automatically so the agent always reasons from the latest state.

## Session Event Injection

External systems can inject events into active voice sessions in real time. The agent processes injected events through its response generation and speaks a natural response.

| Type               | Behavior                     | Use Case                                                            |
| ------------------ | ---------------------------- | ------------------------------------------------------------------- |
| **External event** | Queues behind current speech | EHR notifications, appointment confirmations, system status updates |
| **Guidance**       | Interrupts current speech    | Operator steering, real-time instructions to the agent              |

Events can be injected through multiple paths: an HTTP endpoint, a WebSocket control channel, or through the platform API. The platform API also provides a dedicated [operator guidance](/channels/operators.md) endpoint so operators can send guidance scoped to their identity and permissions.

The injection architecture supports multi-instance deployments. Injection works regardless of which server instance is handling the call, and connections reconnect automatically through transient infrastructure issues.

## Operational Impact

For a healthcare organization running call volume, voice agents replace or augment the front desk phone experience. Patients call the same number they always have. The agent handles scheduling, insurance verification, prescription refill requests, and general inquiries. When the conversation requires a human, an operator joins the live call.

Every call writes structured events to the [world model](/data/world-model.md). Data extracted from conversations flows through confidence gates before reaching the EHR. Nothing is written to a system of record without verification.

## Learn More

{% content-ref url="/pages/diNa7vR8bPyfaSkKqKHe" %}
[Audio Pipeline](/channels/voice/audio-pipeline.md)
{% endcontent-ref %}

{% content-ref url="/pages/aYU0aijmZffitKIRNaTq" %}
[Emotion Detection](/channels/voice/emotion-detection.md)
{% endcontent-ref %}

{% content-ref url="/pages/nRAlEc8vAwxClF3hhQFs" %}
[Call Recordings](/channels/voice/recordings.md)
{% endcontent-ref %}

## Phone Numbers

Phone numbers are the entry point for inbound calls. Each number is provisioned through the platform and routed to a specific service.

### Provisioning and Lifecycle

Numbers follow a four-step lifecycle: **search** for available numbers by area code or region, **purchase** through the platform, **assign** to a service, and **release** when no longer needed. Multiple numbers can route to the same service - for example, different clinic locations sharing one scheduling agent.

Three number types are supported:

| Type          | Use Case                                                  |
| ------------- | --------------------------------------------------------- |
| **Local**     | Geographic numbers tied to a specific area code or region |
| **Mobile**    | Mobile numbers, common in markets outside North America   |
| **Toll-free** | Free-to-caller numbers for national reach                 |

Number type availability varies by country. The platform validates which types are available for a given country during search and only returns purchasable results.

### Channel Management

Phone numbers can be provisioned with verified caller identity to reduce call rejection and spam flagging by carriers. The platform automates the compliance pipeline for each number: business profile verification, caller ID authentication (STIR/SHAKEN attestation), and display name registration.

Regulatory requirements are country-conditional. US numbers require a full compliance stack - business profile, STIR/SHAKEN attestation, and CNAM registration - before provisioning is allowed. Canadian numbers require business profile and CNAM. For all other countries, the platform discovers the required regulatory bundles automatically based on the country, number type, and business type, then gates provisioning on bundle approval. This means provisioning a number in any supported country follows the same flow - the platform resolves what compliance artifacts are needed and blocks the purchase until they are in place.

For SMS-capable numbers, US messaging compliance adds a separate layer. Sending application-to-person (A2P) messages from local numbers requires brand registration and carrier approval. The platform automates the A2P compliance pipeline - submitting business profiles, brand registrations, and messaging campaign approvals - so workspaces can send SMS without navigating carrier compliance processes manually.

Numbers that require address verification (common for local or toll-free numbers in some countries) are handled automatically. The platform validates that the business address on file matches the country requirements for the number being purchased and attaches the address during provisioning.

The process is transactional - if any step fails, previous steps are rolled back automatically. Once the initial business profile is approved, subsequent trust products are submitted via webhook-driven progression without manual intervention. Verified numbers display the organization's business name on caller ID, which increases answer rates for outbound calls and reduces the chance of the number being flagged as spam.

### Routing

When a call arrives on a provisioned number, the platform routes it to the service associated with that number. The service association determines:

* Which context graph governs the conversation flow
* Which voice settings (tone, speed, key terms) apply
* Which world model workspace provides patient context
* Which escalation rules and safety monitors are active

Each phone number routes to exactly one service. This keeps the mapping simple - if you need to know what a number does, look at its service assignment.

{% hint style="info" %}
**Developer Guide** - For phone number API endpoints and voice agent integration details, see the [Developer Guide](https://docs.amigo.ai/developer-guide/platform-api/platform-api/voice-agent).
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.amigo.ai/channels/voice.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.