phoneVoice Agent

Real-time phone-based AI agent with conference-first architecture, emotion detection, and operator escalation.

The Amigo voice agent handles inbound and outbound phone calls for healthcare organizations. It answers the phone, greets the caller, navigates structured conversation flows, and speaks with an adaptive voice that responds to the caller's emotional state in real time.

Conference-First Architecture

Every call runs as a multi-party conference with at least two participants: the caller and the AI agent. This design is intentional. A conference call, rather than a point-to-point connection, means a human operator can join the same call at any time as a third participant without transferring, reconnecting, or interrupting the conversation.

The agent leg is created during ring time, before the caller picks up. This means the agent is already connected and ready when the call begins. There is no dead air, no "please hold while we connect you" delay. The caller hears a greeting within the first moment of the call.

When an operator joins, they enter the same conference. They can listen silently or take over the conversation. The caller experiences a single continuous call regardless of how many participants are involved behind the scenes.

The Five-Layer Pipeline

Every voice call flows through five layers that transform caller audio into an adaptive spoken response.

Layer
What It Does

1. Audio Capture

Captures the caller's audio stream from the telephony layer. Sends it to two parallel processors: speech-to-text and emotion detection. Neither blocks the other.

2. Speech-to-Text

Converts audio to text using streaming transcription with domain-specific vocabulary boosting. Determines when the caller has finished speaking.

3. Intelligence

Maintains a rolling emotional profile of the caller. Combines transcript text, emotional state, and patient context from the world model into a complete picture of the current moment.

4. Navigation and Response

The context graph engine selects the right action, generates a response, and produces filler speech to cover processing time.

5. Text-to-Speech

Converts the generated text into spoken audio with emotion-appropriate tone, pace, and emphasis. Streams audio back to the caller through the conference.

After the call ends, a post-call pipeline re-transcribes the full recording at higher accuracy, scores the interaction across quality dimensions, and feeds keyword accuracy data back into the STT system.

Patient Context Injection

When a call connects, the voice agent resolves the caller's identity from their phone number, loads their full patient context from the world model (demographics, appointments, conditions, recent encounters), and injects it into the agent's system prompt. This context refreshes during the call - after any tool writes new data, the patient context reloads automatically so the agent always reasons from the latest state.

Session Event Injection

External systems can inject events into active voice sessions in real time. The agent processes injected events through its response generation (without navigating the context graph state machine) and speaks a natural response.

Two event types are supported:

Type
Behavior
Use Case

External event

Queues behind current speech, cancels silence monitor

EHR notifications, appointment confirmations, system status updates

Guidance

Interrupts current speech, cancels silence monitor

Operator steering, real-time instructions to the agent

Events can be injected through three paths: an HTTP endpoint on the voice agent, a WebSocket control channel (for test calls and direct streams), or through the platform API which proxies to the voice agent. The platform API also provides a dedicated operator guidance endpoint so operators can send guidance scoped to their identity and permissions.

The injection architecture is cross-pod. HTTP injections publish to a per-session pub/sub channel, and the session subscribes to that channel at startup. This means injection works regardless of which server pod is handling the call. The subscription reconnects automatically if the connection is interrupted, so transient infrastructure issues never kill a voice session.

What This Means for Operations

For a healthcare organization running call volume, the voice agent replaces or augments the front desk phone experience. Patients call the same number they always have. The agent handles scheduling, insurance verification, prescription refill requests, and general inquiries. When the conversation requires a human, an operator joins the live call.

Every call writes structured events to the world model. Clinical data extracted from conversations flows through confidence gates before reaching the EHR. Nothing is written to a system of record without verification.

Learn More

microphone-linesAudio Pipelinechevron-rightface-smileEmotion Detectionchevron-rightheadsetOperators and Escalationchevron-rightphone-volumePhone Number Managementchevron-righttriangle-exclamationRisk Scoring and Silence Managementchevron-rightstethoscopeClinical Toolschevron-right
circle-info

Developer Guide - For API endpoints and integration details, see the Voice Agent Developer Guidearrow-up-right.

Last updated

Was this helpful?