phoneVoice Agent

Real-time phone-based AI agent with conference-first architecture, emotion detection, and operator escalation.

The Amigo voice agent handles inbound and outbound phone calls for healthcare organizations. It answers the phone, greets the caller, navigates structured conversation flows, and speaks with an adaptive voice that responds to the caller's emotional state in real time.

Conference-First Architecture

Every call runs as a multi-party conference with at least two participants: the caller and the AI agent. This design is intentional. A conference call, rather than a point-to-point connection, means a human operator can join the same call at any time as a third participant without transferring, reconnecting, or interrupting the conversation.

The agent leg is created during ring time, before the caller picks up. This means the agent is already connected and ready when the call begins. There is no dead air, no "please hold while we connect you" delay. The caller hears a greeting within the first moment of the call.

When an operator joins, they enter the same conference. They can listen silently or take over the conversation. The caller experiences a single continuous call regardless of how many participants are involved behind the scenes.

The Five-Layer Pipeline

Every voice call flows through five layers that transform caller audio into an adaptive spoken response.

Layer
What It Does

1. Audio Capture

Captures the caller's audio stream from the telephony layer. Sends it to two parallel processors: speech-to-text and emotion detection. Neither blocks the other.

2. Speech-to-Text

Converts audio to text using streaming transcription with domain-specific vocabulary boosting. Determines when the caller has finished speaking.

3. Intelligence

Maintains a rolling emotional profile of the caller. Combines transcript text, emotional state, and patient context from the world model into a complete picture of the current moment.

4. Navigation and Response

The context graph engine selects the right action, generates a response, and produces filler speech to cover processing time.

5. Text-to-Speech

Converts the generated text into spoken audio with emotion-appropriate tone, pace, and emphasis. Streams audio back to the caller through the conference.

After the call ends, a post-call pipeline re-transcribes the full recording at higher accuracy, scores the interaction across quality dimensions, and feeds keyword accuracy data back into the STT system.

Patient Context Injection

When a call connects, the voice agent resolves the caller's identity from their phone number, loads their full patient context from the world model (demographics, appointments, conditions, recent encounters), and injects it into the agent's system prompt. This context refreshes during the call - after any tool writes new data, the patient context reloads automatically so the agent always reasons from the latest state.

What This Means for Operations

For a healthcare organization running call volume, the voice agent replaces or augments the front desk phone experience. Patients call the same number they always have. The agent handles scheduling, insurance verification, prescription refill requests, and general inquiries. When the conversation requires a human, an operator joins the live call.

Every call writes structured events to the world model. Clinical data extracted from conversations flows through confidence gates before reaching the EHR. Nothing is written to a system of record without verification.

Learn More

microphone-linesAudio Pipelinechevron-rightface-smileEmotion Detectionchevron-rightheadsetOperators and Escalationchevron-rightphone-volumePhone Number Managementchevron-righttriangle-exclamationRisk Scoring and Silence Managementchevron-rightstethoscopeClinical Toolschevron-right
circle-info

Developer Guide - For API endpoints and integration details, see the Voice Agent Developer Guidearrow-up-right.

Last updated

Was this helpful?