Voice Agent

Real-time voice pipeline with emotion detection, context graph engine, tool execution, and post-call analysis.

The Amigo voice agent powers real-time, emotionally intelligent voice conversations. It handles inbound and outbound phone calls - executing context graph (HSM - Hierarchical State Machine) logic with speech understanding, text-to-speech, tool execution, safety monitoring, and continuous emotional adaptation. Every call connects to the world model, reads live patient context, writes clinical events with multi-stage verification, and adapts its behavior based on real-time vocal emotion analysis.

Reliability target: This system handles healthcare scheduling calls where callers may be in distress, pain, or crisis. Every design decision prioritizes graceful degradation - if any intelligence layer fails, the call continues with the next-best behavior, never silence.

Voice settings and Classic API differences - Voice settings (tone, speed, keyterms, sensitive topics, post-call flags) are configured at the workspace level; see Workspaces - Voice Settings. Classic API offers WebSocket voice streaming for text-based apps; Platform API voice is phone-based with emotion detection, EHR context, and operator escalation.

Audio Pipeline Architecture

Every voice call flows through a five-layer pipeline that transforms the caller's audio into emotionally adaptive agent speech - while simultaneously reading from and writing to the world model.

Layer 1: Signal Capture

Two parallel streams process the caller's audio simultaneously - neither blocks the other. This dual-stream architecture is fundamental: speech recognition and emotion detection are completely independent. A failure in one never impacts the other.

Speech-to-Text - Real-time streaming transcription with sub-300ms latency. Three layers of domain vocabulary boost recognition accuracy:

Service-level keyterms - Managed by workspace administrators, applied to all calls for that service
Workspace voice settings keyterms - API-configurable per workspace (see voice settings)
System defaults - Engineering-level fallback vocabulary

All sources are merged and deduplicated per call. Configurable end-of-turn detection with tunable confidence thresholds determines when the caller has finished speaking - balancing responsiveness against cutting off mid-sentence.

Emotion Detection - Parallel audio analysis with zero impact on the voice pipeline. Three concurrent models analyze every audio segment on a single persistent connection:

Model

Input

Output

Unique Capabilities

Prosody

2-second audio segments

48 emotions from vocal tone, pitch, rhythm, timbre

Real-time voice quality analysis

Vocal Burst

Same audio segments

67 non-speech vocal types (laughs, sighs, cries, gasps, groans)

Captures sounds that transcription loses entirely

Language

Final STT transcripts

53 emotions + 9-point sentiment + 6-category toxicity

Detects sarcasm, tiredness, annoyance, disapproval, enthusiasm - 5 emotions unavailable from audio alone

Dual-payload multiplexing: Audio segments request prosody + burst models; text transcripts request the language model. Responses are unambiguous - each contains only the models requested. This separation is architecturally important: the language model requires text input, not audio, and runs on STT output rather than raw audio - ensuring it analyzes what the caller said, not just how they sounded.

Audio buffering: 2-second segments with non-blocking queues (max 5 segments for audio, max 20 for text). The emotion pipeline never blocks the voice pipeline, and dropped segments gracefully degrade to slightly less precise emotion detection rather than failure.

Circuit breaker protection: Emotion detection is protected by a circuit breaker (2 failures triggers 10-second recovery). If the emotion service is degraded, calls continue smoothly with workspace defaults - the circuit breaker prevents cascading latency from affecting the critical voice path.

Layer 2: Intelligence

The Emotional State maintains a rolling 30-second window (~15 segments) of recent caller signals with recency-weighted linear averaging - the most recent signals have the highest influence. The agent responds to the caller's current emotional state, not an average of the whole call.

Valence/Arousal computation: Every emotion maps to a (valence, arousal) coordinate via a complete emotion-dimension mapping. Weighted sums across all detected emotions per segment, then recency-weighted across the rolling window, produce stable yet responsive emotional tracking.

Trend detection: Compares first-half vs second-half valence of the rolling window. A delta > 0.1 → improving; delta < -0.1 → deteriorating; otherwise stable. This powers the call-phase escalation system - deteriorating trends trigger increasingly urgent adaptation.

Coherence (prosody vs language agreement): Measures whether what the caller says matches how they sound:

Condition

Coherence

Agent Response

Same valence sign (both positive or both negative)

High (0.7–1.0)

Normal adaptation

Opposite valence (words say fine, voice says distressed)

Low (0.0–0.3)

Trust the vocal tone over the words

One signal neutral

Mild (0.8)

Use the available signal

When coherence < 0.4: The caller may be masking their true state. The agent's emotional steering instruction shifts: "Respond to how they sound, not what they claim." This is injected into the system prompt without the agent ever explicitly mentioning the discrepancy.

Behavioral signal tracking: Updated in real-time from the session and turn controller:

Signal

Detection

Threshold

Meaning

barge_in_count

Caller interrupts agent speech

≥ 2

Frustration - agent talking too much

short_response_streak

Consecutive responses ≤ 4 words

≥ 3

Disengagement - caller withdrawing

silence_gap_count

Gaps ≥ 5 seconds

≥ 2

Confusion, hesitation, or distress

Semantic barge-in detection - Barge-in detection uses semantic confirmation - it requires actual recognized words from the STT engine (not just voice activity detection). This filters false triggers from coughs, breathing, echo, and background noise. Minimum speech duration is 0.5 seconds with recognized words, with a 1.0-second fallback for delayed word recognition.

These behavioral signals are injected into the system prompt alongside emotional steering - the LLM receives a complete picture of both how the caller sounds and how they're behaving.

Layer 3: Context Graph Engine

Each turn processes through a two-stage LLM pipeline:

Navigator - Selects the next action or exit condition from the current context graph state. Also generates a filler phrase to cover processing latency and determines whether to trigger audio verification for structured data capture. Uses structured output validation with automatic retry (up to 3 attempts) and fallback to first valid action.
Engage LLM - Generates the caller-facing response, informed by the selected action, full conversation history with per-message emotion annotations ([VOICE: EmotionName, valence=V.VVV]), audio correction results, emotional steering context, ambient patient context, and available tools.

Emotion reaches the LLM via two independent paths:

Path

Scope

What It Contains

Per-message annotations

Every user message

Inline [VOICE: Anxiety, valence=-0.312] - the LLM sees the emotional trajectory across the full conversation

Session-level steering

System prompt

Dominant emotion + trend, quadrant-specific adaptation instructions, behavioral signals, call-phase urgency, coherence warnings

Communication micro-behaviors: The engage template contains hardcoded guidelines that instruct the LLM on micro-level conversational behaviors that are always active - not gated by emotion:

Behavior

Description

Speech rhythm mirroring

If the caller speaks in short bursts, respond concisely; if conversational, match warmth

Emotional name usage

Use the caller's name at moments of emotional significance, not mechanically

Pause injection

When delivering difficult information, pause naturally before the key detail

Pace inversion

When the caller is rushing, slow the pace with longer sentences and gentle transitions

Completion inference

When the caller trails off mid-sentence, acknowledge what they were trying to say

Emotion concealment

Never explicitly mention that the system can detect emotions

Natural laughter

Contextual laughter available for naturally warm moments - used sparingly

Layer 4: Audio Output

The engage LLM's text streams to the TTS engine for speech synthesis with per-turn dynamic controls:

Emotion - Derived from the voice tone priority chain
Speed - From workspace voice settings
Volume - From workspace voice settings

Word-level timestamps are collected for every generated word - start time and end time - enabling transcript-to-audio scrubbing in the call playback UI. This is critical for the review queue workflow where operators need to jump to specific moments in a call.

Layer 5: Post-Call Intelligence

Two optional analyses run after every call (controlled via voice settings):

Transcript verification - Re-transcribes the full call audio with a high-accuracy batch model and computes Word Error Rate (WER) against the real-time transcript. Produces verified_transcript, verified_words, and transcript_accuracy - enabling quality comparisons between the real-time and batch transcription.

Quality analysis - Listens to the full stereo recording (caller + agent) and scores on 5 dimensions (1–5 each):

Dimension

What It Measures

Task Completion

Did the agent achieve the caller's goal?

Information Accuracy

Was the information provided correct?

Conversation Flow

Was the conversation natural and smooth?

Error Recovery

How well did the agent recover from mistakes?

Caller Experience

How did the caller feel at the end?

Self-improving feedback loop: Quality analysis also produces stt_suggestions - words the STT misheard, formatted as recognition keywords for future calls. This creates a closed loop:

How Calls Work

Every call runs inside a conference architecture - a multi-party audio bridge that enables the caller, AI agent, and optionally a human operator to all participate simultaneously.

Inbound Call Flow (Instant Greeting)

The system eliminates dead air at call start through parallel pre-warming - the engine, greeting, and agent connection all initialize while the phone is still ringing.

Key insight: The telephony conference API accepts friendly names, not just IDs. The conference name is known at webhook time. The agent leg is created immediately - the conference is created on-demand when the agent joins. This means the agent can be fully connected and waiting before the caller even picks up.

Timeline comparison:

Phase

Without Pre-warm

With Pre-warm

Webhook → Engine ready

After pickup (+1-3s)

During ring (hidden)

Agent leg creation

After pickup (+200ms)

During ring (hidden)

WebSocket connection

After pickup (+200ms)

During ring (hidden)

Greeting generation

After pickup (+500ms)

During ring (hidden)

Total dead air

~1200ms

~200-300ms (TTS streaming only)

Safety guarantees:

Caller hangs up during ring → cache entry expires (30s TTL), resources cleaned up lazily
WebSocket lands on different pod → cache miss, standard initialization (no degradation)
Pre-warm exceeds timeout → TwiML returned anyway, standard initialization on pickup
Session capacity is NOT consumed during pre-warm (no active session yet)

Pre-warm is best-effort. If initialization takes longer than expected, the system falls back to standard initialization - no degradation in call quality, just a slightly longer time to first greeting.

Outbound Call Flow

Outbound calls are world-model-native - scheduled as outbound_task entities via the schedule_outbound_call tool during inbound calls, then dispatched by the connector runner when they become due.

Five business logic patterns can produce outbound tasks:

Pattern

Description

Example

Scheduled

Decision made, execution deferred

"I'll call you back tomorrow at 2pm"

Event-Reactive

Trigger → evaluate → maybe act

New lab result → is it critical? → call patient

Continuous Monitoring

Periodic population sweep

Patients with no contact in 30 days

Conversational Follow-Through

Track preconditions from agent promises

"I'll call after the doctor reviews" → pending on doctor event

Orchestrated Campaign

Achieve outcome for population over time

"Get all 200 patients to complete annual wellness by Q4"

Each outbound task carries: patient reference, reason, goal, priority (1–10), business-hours window (timezone-aware), retry config (max attempts with configurable backoff), and rich context from the patient's world model projection. The dispatch loop enriches the system prompt so the agent starts the call with full patient knowledge - the agent never needs to "look up" the patient.

Conference Architecture

Conference architecture - telephony details

The conference architecture supports multiple simultaneous audio participants with independent per-participant streams:

Participant

Role

Audio Transport

STT

Caller

Person who called or was called

PSTN

Dedicated per-participant stream

Agent

AI voice agent

Bidirectional WebSocket

Main session STT

Operator

Human monitor/takeover (optional)

PSTN or browser WebRTC

Dedicated per-participant stream

Three-party speaker resolution: When multiple parties are on the call, speaker attribution uses a priority chain: operator STT → caller STT → default (caller). Every turn in the call record carries speaker_id and speaker_role for accurate attribution in the transcript.

Context Graph (HSM) Engine

The voice agent executes a Hierarchical State Machine (HSM) loaded from the service's version set. Each call gets its own engine instance with an in-memory state database for zero-latency state tracking, flushed to persistent storage after the call ends.

State Types

State Type

Purpose

LLM Call?

ActionState

Agent performs actions and evaluates exit conditions to transition

Yes - Engage LLM

DecisionState

Agent evaluates conditions and chooses a transition

Yes - Navigator only

ReflectionState

Agent reasons deeply over a problem with optional tool calls

Yes - deep reasoning

ToolCallState

Enforces execution of a designated tool before transitioning

No - automatic

RecallState

Retrieves information from memory before transitioning

No - automatic

AnnotationState

Injects an inner thought and transitions immediately

No - automatic

Per-Turn Flow

The navigator handles multi-state traversal automatically - decision states, annotation states, and recall states are resolved without user interaction before landing on an action state for the engage LLM.

Navigator resilience: Structured output validation with automatic retry (up to 3 total attempts). On all retries exhausted, falls back to the first valid action or exit. Filler text from earlier attempts is preserved across retries (first-wins) - the caller never hears silence even during recovery.

Terminal State & Auto-Hangup

When the context graph reaches its terminal state (an ActionState with one action and zero exits), the agent speaks its goodbye and automatically ends the call:

Navigator lands on terminal state → is_terminal = true
Agent speaks the goodbye response
Waits for TTS to finish + grace period (audio buffer flush)
Terminates the call via telephony API

Silence detection: When the caller goes silent, the silence monitor fires check-ins at increasing intervals (10s → 20s → 40s). After 3 unanswered check-ins, the agent says a brief goodbye and auto-disconnects.

Session shutdown contract: Every code path that stops the session must also stop the audio speaker - otherwise the speaker blocks indefinitely. This is enforced across all shutdown triggers: hangup, STT failure, WebSocket disconnect, and terminal state.

World Model Integration

The voice agent connects to the workspace's world model through three data channels - this architecture is informed by the Liquid World Model thesis where the distinction between data infrastructure and intelligence dissolves.

Channel 1: Ambient (Pushed)

Data the LLM should always have without asking. Injected into the system prompt at session start and refreshed as the conversation evolves:

Patient demographics - name, DOB, MRN, phone, email, address
Clinical context - active conditions, medications, allergies (filtered to text-only for LLM consumption)
Upcoming appointments - with patient entity references for cross-referencing
Insurance coverage - active plans and subscriber info
Location context - clinic details, available appointment types, hours (resolved from the inbound phone number)

Design principle: ambient over queried. If the LLM will almost certainly need this data, push it into context. Don't make it ask. A voice agent that already has the patient's insurance in context doesn't need to dispatch a tool call to look it up.

Channel 2: Queried (Pulled)

Data that can't be ambient because the search space is too large. The agent calls built-in clinical tools to retrieve specific information.

Key simplification: Queried tools return human-readable results, not database internals. Slot search returns doctor names and times, not template IDs and slot UUIDs. When the agent says "book the 1:45 with Dr. Jones," the system resolves scheduling internals from cached slot data. The LLM never touches scheduling internals.

Channel 3: Extracted (Captured)

Structured data mentioned in conversation - insurance details, contact information, preferences - is automatically captured and written to the world model without requiring explicit tool calls. This eliminates the mode switch where the LLM stops being a conversationalist and becomes a database operator. The conversation IS the data entry.

Extracted data is written with moderate confidence (below verified threshold) - the LLM can still use explicit write tools for high-stakes data where precision matters. Extraction is a complement, not a replacement.

Multi-Stage Verification

All data written by the voice agent during calls starts at a low confidence level and must pass through a verification pipeline before syncing to external systems. This is the trust architecture for autonomous agents acting on noisy phone audio.

Three-stage automated review:

Stage

What It Checks

Actions

Call Classifier

Is this a real clinical call or junk? (prank, ad, bot, silence)

Real → continue; Junk → reject all session events

Per-Event Judge

Cross-references each event against transcript + existing entity state

Approve, auto-correct (formatting), or flag for human review

Session Coherence

Do all events tell a coherent story? Contradictions? Missing data?

Upgrade confidence if coherent, flag if contradictions found

Why three stages, not one: Per-event review catches data-level errors (wrong phone format, impossible DOB, name doesn't match transcript). Session-level review catches narrative-level errors (contradictions between events, discussed insurance but no coverage event recorded). These are fundamentally different kinds of errors requiring different analysis approaches.

Patient Safety Isolation

A write scope is enforced per session - write tools can only target the patient identified in the current call. This prevents cross-patient data errors. Write tools are also deduplicated - identical calls within the same session return cached results rather than creating duplicate records (30-second TTL, successful results only - errors are always retryable).

Emotional Adaptation

The voice agent adapts across four independent output channels simultaneously based on real-time caller emotion. Each row in the matrix below is a detected situation; columns show how each output channel responds. All adaptation is automatic - workspace managers control only the baseline via voice settings.

Valence–Arousal Model

Every detected emotion maps to a two-dimensional (valence, arousal) coordinate. The system tracks these coordinates across a rolling window to build a stable yet responsive picture of the caller's emotional state:

        High Arousal (1.0)
             │
    ANGER ───┼─── EXCITEMENT
  Frustration│    Joy
  Fear       │    Enthusiasm
             │
  ───────────┼───────────── Valence
  Negative   │    Positive
  (-1.0)     │    (+1.0)
             │
    SADNESS ──┼─── CONTENTMENT
  Disappointment  Relief
  Boredom    │    Gratitude
             │
        Low Arousal (0.0)

Quadrant

Agent Strategy

Voice Tone

LLM Behavior

High-arousal negative (anger, frustration)

De-escalate

calm

Direct, concise, acknowledge frustration, skip pleasantries, match urgency

Low-arousal negative (sadness, disappointment)

Comfort

sympathetic

Warm, patient, gentle language, give extra space, do not rush

High-arousal positive (excitement, joy)

Match energy

enthusiastic

Enthusiastic language, keep momentum, match positive energy

Low-arousal positive (contentment, relief)

Maintain

content

Warm and steady, reinforce positive outcome, conversational

Confusion (high confidence)

Clarify

calm

Simplify explanations, break into small pieces, check understanding, offer to repeat

Anxiety (high confidence)

Reassure

sympathetic

Calm and reassuring, provide clear next steps, avoid uncertainty

Voice Tone Priority Chain

The agent's voice tone is determined by a six-level priority chain - each layer fires only if the previous returned no signal:

Why this ordering matters:

Bursts are the highest-priority signal because they capture the most immediate emotional state. A caller who just laughed should hear warmth immediately - not the rolling average of the last 30 seconds. Burst detection (within last 5 seconds, confidence ≥ 0.5) overrides everything.
Tone momentum (layer 4) prevents jarring voice tone changes. When the current emotional signal is weak (score < 0.25) or doesn't map to a tone, the previous turn's tone persists. Only a strong contradictory signal changes the tone - making the voice feel continuous across the conversation:

Turn 1: Anxiety detected (score 0.72) → "sympathetic" → stored as momentum
Turn 2: Calmness detected (score 0.30) → unmapped → momentum returns "sympathetic"
Turn 3: Joy detected (score 0.65)      → "enthusiastic" → stored as new momentum

Proactive topic sensitivity (layer 3) fires before the caller shows distress. When the agent is about to discuss test results, billing, surgery, or other loaded topics, the voice tone preemptively shifts to sympathetic - even without an emotion signal.

Emotion → Response Matrix

All four adaptation channels respond simultaneously to each caller state. The agent mirrors empathy, not the caller's emotion:

Caller Emotion

Voice Tone

Filler Style

LLM Prompt Adaptation

Rationale

Anger, Annoyance, Contempt

calm

Suppressed

Direct, concise, acknowledge frustration

De-escalate - don't mirror aggression

Anxiety, Fear, Distress

sympathetic

Reassuring

Calm, clear next steps, avoid uncertainty

Reassure - steady presence

Sadness, Disappointment, Guilt

sympathetic

Warm

Patient, supportive, don't rush

Warm empathy - give space

Confusion

calm

Simple

Simplify, small pieces, check understanding

Patient clarity

Excitement, Joy, Enthusiasm

enthusiastic

Warm, matching

Match positive energy, keep momentum

Mirror positive energy

Contentment, Relief, Gratitude

content

Warm

Steady, reinforce outcome

Warm and grounding

Interest, Concentration

curious

Engaged

Engaged tone, match intellectual focus

Show interest

Embarrassment, Doubt

calm

Encouraging

Non-judgmental, encouraging

Put at ease

Boredom, Tiredness

enthusiastic

Concise

Re-engage with energy, be efficient

Re-energize

Sarcasm

calm

Professional

Respond to underlying concern, not surface tone

Stay professional

Burst-to-experience mapping: 25 vocal burst types are mapped to specific agent tones and caller state interpretations:

Burst Types

Agent Tone

Inferred Caller State

Laugh, Giggle

enthusiastic

Amused

Sigh

sympathetic

Weary

Cry, Sob, Whimper

sympathetic

Distressed

Gasp

calm

Alarmed

Groan, Ugh

sympathetic / calm

Frustrated

Growl, Tsk

calm

Angry

Hmm, Mhm

calm

Thinking / Acknowledging

Aww

sympathetic

Touched

Filler Speech

Fillers cover processing latency so the caller never hears silence. The system uses principle-based guidance - not hardcoded phrase lists - generating contextually appropriate fillers from emotional context, the current action, and the expected latency.

Three-layer filler generation:

Layer

When

What It Controls

Latency adaptation

Always

Filler length matches expected processing time (2-4 words for normal latency, 3-5 words for audio verification processing)

Emotional attunement

When emotion data available

Emotional register matches the caller's state - not specific phrases, but principles like "gentle and reassuring" or "a verbal hand on the shoulder"

Action context

Always

Current context graph action description injected so the filler hints at what the agent is about to do

Per-action filler hints: Context graph actions can include optional PM-configured filler suggestions. These are weak steering - emotion-adaptive principles always dominate. The LLM sees hints as suggestions to draw from, not commands.

Suppression rule: When valence < -0.2 AND arousal > 0.4 AND emotion is NOT Anxiety/Fear/Distress → fillers disabled entirely. Frustrated callers don't want acknowledgments - they want the answer. Exception: Anxious callers still receive reassuring fillers, because anxiety benefits from reassurance while frustration does not.

Call Phase Escalation

The system automatically increases urgency as calls extend with negative sentiment:

Phase

Duration

Condition

Adaptation

Early

< 5 min

Any

Standard emotional adaptation

Mid

5–10 min

Trend deteriorating

"Focus on resolution speed. Shorten responses."

Late

≥ 10 min

Negative valence

URGENCY. "Prioritize resolution. Be maximally concise. Escalate if unable to resolve."

Proactive Intelligence

The system detects emotionally sensitive topics from the current context graph action before the caller shows distress:

sensitive_topics is configurable via voice settings. Falls back to healthcare defaults: test results, diagnosis, billing, payment, insurance, denial, emergency, referral, specialist, surgery, procedure, medication.

This fires at priority level 3 in the TTS emotion chain - below burst and prosody (which have actual data about the caller's current state) but above tone momentum and workspace defaults.

Coherence Detection

When what the caller says doesn't match how they sound (coherence < 0.4), the system shifts its steering: "The caller's words suggest X but voice sounds Y. Trust the vocal tone over the words - respond to how they sound, not what they claim."

This is injected into the system prompt without the agent ever explicitly mentioning the discrepancy to the caller.

Control Plane ↔ Adaptation

How each workspace voice setting interacts with the automatic emotion adaptation system:

Voice Setting

What You Control

What the System Overrides

Override Condition

tone

Baseline voice emotion for neutral callers

Emotion-derived tone replaces it

Any non-neutral emotion detected (score ≥ 0.25)

speed

Base speech rate

Never overridden

Your choice always respected

volume

Base volume

Never overridden

Your choice always respected

voice_id

Voice persona

Per-agent voice config overrides

Agent version has voice config set

keyterms

Domain vocabulary for STT boost

Merged with service keyterms

Always additive, never overridden

correction_categories

Domain hints for audio correction

None

Used as additional context

sensitive_topics

Topics for proactive tone softening

Falls back to healthcare defaults

Preemptive, not reactive

post_call_analysis_enabled

Quality scoring on/off

None

Full PM control

transcript_correction_enabled

Re-verification on/off

None

Full PM control

Key principle: Workspace managers control the baseline experience and domain knowledge. The emotion intelligence system overrides the baseline only when it detects a strong signal - and always in the direction of more empathy, never less.

Graceful Degradation

Every intelligence layer is best-effort with an explicit fallback. A failed intelligence layer must never fail a call.

Layer

Failure Mode

Fallback

Impact

Emotion connection

Auth error, billing, timeout

Session continues without emotion detection

No emotional adaptation, workspace defaults used

Emotion segment

Processing error, connection close

Consecutive failure counter → disable after 5

Degrades gracefully to less data

Emotion detection

Insufficient data (< 2 segments)

No emotional steering, default fillers

First few seconds may lack adaptation

Burst detection

No burst events

Falls through to prosody-derived emotion

Loses immediate reaction, uses rolling average

Language model

No language results

Coherence defaults to 1.0 (agreement assumed)

Loses word-vs-tone disagreement detection

Audio verification

Timeout or error

No corrections injected, call continues

Relies on raw STT only

Voice settings

Parse error

Defaults (filler on, emotion on)

Baseline experience still works

Post-call analysis

Any error

Logged, not raised (fire-and-forget)

Quality data missing, call unaffected

TTS connection

Close/error mid-stream

Auto-reconnect on next turn

Brief silence, then recovery

STT connection

Connection loss

Exponential backoff reconnect (max 3 attempts)

Brief gap in transcription

Context graph engine

Backend unavailable

Falls back to static prompt mode (no HSM)

Agent still converses, just without state machine navigation

Tool Execution

Skills configured in the context graph execute asynchronously during calls - the agent acknowledges the action and continues speaking while tools run in the background. Results arrive as continuation turns.

Execution Tiers

Tool calls are routed through an execution tier system that matches the tool's complexity to the right execution model:

Tier

Name

Execution Model

Latency

Use Cases

Direct

Single integration API call, no LLM

< 2s

Patient lookup, allergy check, medication list

Orchestrated

Multi-turn LLM agent with tool access

2–30s

Eligibility cascades, multi-step writes

Autonomous

Extended agent loop with checkpointing and MCP tools

30s–5min

Complex prior auth, cross-system reconciliation

T3 autonomous agents use a full agent SDK with:

Custom MCP tools injected per-task (world model tools, integration tools)
Session checkpointing for pause/resume across retries
Cost caps per task to prevent runaway execution
Isolated working directories per task

Write-tool deduplication: All write tools are deduplicated within a session (30-second TTL). Identical tool calls return cached results. Only successful results are cached - errors are always retryable.

Built-in Clinical Tools

Healthcare workspaces get 13 built-in tools automatically - no integration configuration required:

Read tools:

Tool

Purpose

Key Feature

Patient lookup

Search by DOB, name, phone, or MRN

DOB preferred for accuracy

Slot search

Available appointment slots by location and date

Returns human-readable times + doctor names, caches slot internals

Appointment lookup

Patient's existing appointments

Returns appointment references for cancel/confirm

Semantic patient search

Fuzzy, embedding-based patient matching

Handles misspellings and partial information

Semantic event search

Embedding-based search across clinical events

Optionally scoped to a specific patient

Write tools:

Tool

Purpose

Key Feature

Patient create

Create patient with automatic deduplication

Dedup by name + DOB

Patient update

Update contact info (phone, email, address)

Requires entity reference

Save patient

Create-or-update with dedup check

Accepts natural field names and flexible date formats

Schedule appointment

Book from slot search results or explicit times

Accepts slot_ref from slot search - auto-resolves booking details

Cancel appointment

Cancel by appointment reference

Writes cancellation event

Confirm appointment

Confirm a booked appointment

Writes confirmation event

Create insurance

Insurance record with carrier fuzzy-matching

Supports policy holder info

Schedule outbound call

Schedule a future callback

Creates outbound_task entity atomically

All write tools pass through the multi-stage verification pipeline before data reaches external systems. All write tools enforce patient safety isolation.

Call Forwarding

A built-in forward_call tool transfers the caller to a human. Two modes:

Static forwarding - per-phone-number fallback, configured via Phone Numbers
Location-based forwarding - the agent selects from location phone numbers in the patient's context

The agent cannot specify arbitrary phone numbers - the destination always comes from the resolved config or location entity state. When the caller requests a human, the agent is required to invoke the tool - the actual transfer happens via the telephony system, not through words alone.

Deferred transfer - Call transfers are deferred until the agent's goodbye message finishes playing. The transfer is cancellable by barge-in or operator join.

Audio Verification

When the agent needs to capture structured data (names, dates, phone numbers, insurance IDs), it can trigger audio verification - sending the caller's raw audio for AI-powered correction alongside the real-time transcript.

This catches STT errors on structured data that streaming transcription commonly gets wrong: proper names, alphanumeric IDs, phone numbers, and dates.

Domain-aware: correction_categories from voice settings are injected as domain hints. This tells the correction model: "This workspace commonly handles medication names, insurance carriers. STT frequently gets these wrong. Pay extra attention."

Correction Output

Corrections are structured as field-level pairs showing what STT heard vs. the corrected value:

name: "Micah Adeline" → "Mika Adlin" (confidence: 9)
dob: "March 15 1990" → "1990-03-15" (confidence: 8)

Correction Confidence

Level

Score

Agent Behavior

Certain

8–9

Use corrected value directly without confirming

Likely

5–7

Confirm with caller ("I have [value], is that correct?")

Uncertain

1–4

Ask caller to spell out or repeat slowly

Both models wrong

Audio quality is poor - ask for letter-by-letter spelling

Observer events include both the original STT value, the corrected value, and the numeric confidence - enabling frontend visualization of correction accuracy.

Safety & Monitoring

Conversation Monitor

An embedding-based safety detection system evaluates every turn against configured safety concepts using a two-stage pipeline:

Standalone fallback: If semantic similarity exceeds a high threshold (default 0.85), escalation triggers immediately without waiting for the AI judge - providing a safety net even if the judge model is unavailable.

Default safety concepts (always active): suicidal ideation, self harm, domestic violence, adverse drug reaction, post-discharge red flag. Custom concepts can be added via the Safety API with pre-computed embeddings.

Auto-Escalation

When an escalation triggers, the system:

Writes an escalation event to the world model (dual-entity: both call and operator entities)
Notifies the operator dashboard
For hard escalations - immediately suspends the AI agent pending human intervention

Observer WebSocket

Monitor active calls in real time via a cross-pod WebSocket connection:

WS /voice-agent/observe/{call_sid}?token={api_key}

Requires a valid workspace API key. Any observer instance can monitor any active call in the workspace, regardless of which pod handles the call (events are distributed via pub/sub).

Late-join replay: Observers connecting mid-call receive a buffered replay of recent events before transitioning to the live stream. Events carry monotonic sequence numbers for ordering.

Event Types

Event

Key Data

Source

session_start

call_sid, service_id, workspace_id, initial_state, trace_id

Session init

session_info

Full call snapshot (sent on observer connect)

Observer connect

user_transcript

transcript, emotion_label, emotion_valence

Turn controller

agent_transcript

transcript, action, interrupted

Speaker

state_transition

previous_state, next_state

Turn controller

tool_call_started

tool_name, call_id, input

Turn controller

tool_call_completed

tool_name, duration_ms, output (truncated), succeeded, error_message

Turn controller

nav_timing

nav_ms, render_ms, total_ms, input_tokens, output_tokens, model, state

Turn controller

latency

e2e_ttfb_ms, engine_ms, nav_ms, render_ms, audio_ttfb_ms, continuation

Speaker

emotion

dominant, valence, arousal

Transport

session_end

call_sid, duration_s, turns, completion_reason, final_state

Session shutdown

ping

(empty)

Keepalive (30s)

Call Record & Persistence

Every call produces a detailed record persisted to the database:

Turns - Each turn carries a 5-layer timing model (all fields in milliseconds):
- Layer 1 (STT): user_speech_start_ms, user_speech_end_ms - speech boundaries
- Layer 2 (Engine): engine_ms, nav_ms, render_ms, audio_ttfb_ms - processing latency breakdown
- Layer 4 (TTS/Transport): agent_speech_start_ms, agent_speech_end_ms - when agent audio played
Tool calls - Name, input, output, duration, success/failure
State transitions - Full HSM navigation history
Emotional summary - See below
Escalation history - Full escalation lifecycle if operator joined
Config snapshot - Version set, agent version, HSM version used

Calls API

Active Calls

GET /voice-agent/calls/active
Authorization: Bearer <api_key>

Lists all currently active calls across the workspace. Active call state is maintained in a distributed registry - any API pod can serve this request regardless of which pod handles the call.

Call History

GET /voice-agent/calls?limit=20&continuation_token=0
Authorization: Bearer <api_key>

Call Detail

GET /voice-agent/calls/{call_id}
Authorization: Bearer <api_key>

Full call record including turns with timing model, tool calls, state transitions, emotional summary, escalation history, safety state, and config snapshot.

Recordings

Endpoint

Description

GET /calls/{call_id}/recording/stereo

Stereo WAV (caller left channel, agent right channel)

GET /calls/{call_id}/recording/waveform

Amplitude envelope for timeline visualization

GET /calls/{call_id}/recording/{channel}

Single channel WAV (caller or agent)

POST /calls/{call_id}/verify-transcript

Re-transcribe with high-accuracy batch model for ground-truth timestamps

Outbound Calls

POST /voice-agent/create_outbound_call
Authorization: Bearer <api_key>

Emotional Summary

At call end, the system persists a complete emotional record available in the call detail response:

{
  "dominant_emotion": "Anxiety",
  "average_valence": -0.312,
  "average_arousal": 0.654,
  "peak_negative_valence": -0.587,
  "peak_negative_emotion": "Fear",
  "emotional_shifts": 3,
  "final_trend": "improving",
  "segment_count": 42,
  "barge_in_count": 2,
  "short_response_streak": 0,
  "silence_gap_count": 1,
  "coherence": 0.72,
  "language_sentiment": 0.45,
  "burst_types": {"Sigh": 2, "Hmm": 3}
}

Roadmap: Toward Deeper Empathy

The emotional intelligence system is actively evolving. These are areas where we're investing to push beyond current capabilities:

Area

Where We Are Today

Where We're Heading

Prosodic rhythm

Text-level rhythm guidance (shorter sentences for urgency, gentle transitions when rushing)

Audio-level prosodic planning - breath-like pauses, per-word speed variation, rhythm that matches the emotional weight of each sentence

Emotional response time

Emotion applied on the next turn after detection (~2-4s). Burst detection (laughs, sighs) provides faster sub-segment signals

Sub-second emotional adaptation - responding to a voice crack within the same conversational beat

Emotional memory across calls

Each call persists a full emotional summary. Patient context injected from world model

Cross-call emotional profiles - "this patient was anxious about test results last call" surfaced proactively in future calls

Mixed-emotion voice

Single emotion label per generation; text structure conveys nuance

Emotion blending - "warm concern with a hint of encouragement" expressed in a single sentence through TTS-level control

API Reference

PreviousIntegrations NextPhone Numbers

Last updated 47 minutes ago

Was this helpful?

Good evening

hashtagAudio Pipeline Architecture

hashtagLayer 1: Signal Capture

hashtagLayer 2: Intelligence

hashtagLayer 3: Context Graph Engine

hashtagLayer 4: Audio Output

hashtagLayer 5: Post-Call Intelligence

hashtagHow Calls Work

hashtagInbound Call Flow (Instant Greeting)

hashtagOutbound Call Flow

hashtagConference Architecture

hashtagContext Graph (HSM) Engine

hashtagState Types

hashtagPer-Turn Flow

hashtagTerminal State & Auto-Hangup

hashtagWorld Model Integration

hashtagChannel 1: Ambient (Pushed)

hashtagChannel 2: Queried (Pulled)

hashtagChannel 3: Extracted (Captured)

hashtagMulti-Stage Verification

hashtagPatient Safety Isolation

hashtagEmotional Adaptation

hashtagValence–Arousal Model

hashtagVoice Tone Priority Chain

hashtagEmotion → Response Matrix

hashtagFiller Speech

hashtagCall Phase Escalation

hashtagProactive Intelligence

hashtagCoherence Detection

hashtagControl Plane ↔ Adaptation

hashtagGraceful Degradation

hashtagTool Execution

hashtagExecution Tiers

hashtagBuilt-in Clinical Tools

hashtagCall Forwarding

hashtagAudio Verification

hashtagCorrection Output

hashtagCorrection Confidence

hashtagSafety & Monitoring

hashtagConversation Monitor

hashtagAuto-Escalation

hashtagObserver WebSocket

hashtagEvent Types

hashtagCall Record & Persistence

hashtagCalls API

hashtagActive Calls

hashtagCall History

hashtagCall Detail

hashtagRecordings

hashtagOutbound Calls

hashtagEmotional Summary

hashtagRoadmap: Toward Deeper Empathy

hashtagAPI Reference

Audio Pipeline Architecture

Layer 1: Signal Capture

Layer 2: Intelligence

Layer 3: Context Graph Engine

Layer 4: Audio Output

Layer 5: Post-Call Intelligence

How Calls Work

Inbound Call Flow (Instant Greeting)

Outbound Call Flow

Conference Architecture

Context Graph (HSM) Engine

State Types

Per-Turn Flow

Terminal State & Auto-Hangup

World Model Integration

Channel 1: Ambient (Pushed)

Channel 2: Queried (Pulled)

Channel 3: Extracted (Captured)

Multi-Stage Verification

Patient Safety Isolation

Emotional Adaptation

Valence–Arousal Model

Voice Tone Priority Chain

Emotion → Response Matrix

Filler Speech

Call Phase Escalation

Proactive Intelligence

Coherence Detection

Control Plane ↔ Adaptation

Graceful Degradation

Tool Execution

Execution Tiers

Built-in Clinical Tools

Call Forwarding

Audio Verification

Correction Output

Correction Confidence

Safety & Monitoring

Conversation Monitor

Auto-Escalation

Observer WebSocket

Event Types

Call Record & Persistence

Calls API

Active Calls

Call History

Call Detail

Recordings

Outbound Calls

Emotional Summary

Roadmap: Toward Deeper Empathy

API Reference