Voice Agent
Real-time voice pipeline with emotion detection, context graph engine, tool execution, and post-call analysis.
The Amigo voice agent powers real-time, emotionally intelligent voice conversations. It handles inbound and outbound phone calls - executing context graph (HSM - Hierarchical State Machine) logic with speech understanding, text-to-speech, tool execution, safety monitoring, and continuous emotional adaptation. Every call connects to the world model, reads live patient context, writes clinical events with multi-stage verification, and adapts its behavior based on real-time vocal emotion analysis.
Reliability target: This system handles healthcare scheduling calls where callers may be in distress, pain, or crisis. Every design decision prioritizes graceful degradation - if any intelligence layer fails, the call continues with the next-best behavior, never silence.
Voice settings and Classic API differences - Voice settings (tone, speed, keyterms, sensitive topics, post-call flags) are configured at the workspace level; see Workspaces - Voice Settings. Classic API offers WebSocket voice streaming for text-based apps; Platform API voice is phone-based with emotion detection, EHR context, and operator escalation.
Audio Pipeline Architecture
Every voice call flows through a five-layer pipeline that transforms the caller's audio into emotionally adaptive agent speech - while simultaneously reading from and writing to the world model.
Layer 1: Signal Capture
Two parallel streams process the caller's audio simultaneously - neither blocks the other. This dual-stream architecture is fundamental: speech recognition and emotion detection are completely independent. A failure in one never impacts the other.
Speech-to-Text - Real-time streaming transcription with sub-300ms latency. Three layers of domain vocabulary boost recognition accuracy:
Service-level keyterms - Managed by workspace administrators, applied to all calls for that service
Workspace voice settings keyterms - API-configurable per workspace (see voice settings)
System defaults - Engineering-level fallback vocabulary
All sources are merged and deduplicated per call. Configurable end-of-turn detection with tunable confidence thresholds determines when the caller has finished speaking - balancing responsiveness against cutting off mid-sentence.
Emotion Detection - Parallel audio analysis with zero impact on the voice pipeline. Three concurrent models analyze every audio segment on a single persistent connection:
Prosody
2-second audio segments
48 emotions from vocal tone, pitch, rhythm, timbre
Real-time voice quality analysis
Vocal Burst
Same audio segments
67 non-speech vocal types (laughs, sighs, cries, gasps, groans)
Captures sounds that transcription loses entirely
Language
Final STT transcripts
53 emotions + 9-point sentiment + 6-category toxicity
Detects sarcasm, tiredness, annoyance, disapproval, enthusiasm - 5 emotions unavailable from audio alone
Dual-payload multiplexing: Audio segments request prosody + burst models; text transcripts request the language model. Responses are unambiguous - each contains only the models requested. This separation is architecturally important: the language model requires text input, not audio, and runs on STT output rather than raw audio - ensuring it analyzes what the caller said, not just how they sounded.
Audio buffering: 2-second segments with non-blocking queues (max 5 segments for audio, max 20 for text). The emotion pipeline never blocks the voice pipeline, and dropped segments gracefully degrade to slightly less precise emotion detection rather than failure.
Circuit breaker protection: Emotion detection is protected by a circuit breaker (2 failures triggers 10-second recovery). If the emotion service is degraded, calls continue smoothly with workspace defaults - the circuit breaker prevents cascading latency from affecting the critical voice path.
Layer 2: Intelligence
The Emotional State maintains a rolling 30-second window (~15 segments) of recent caller signals with recency-weighted linear averaging - the most recent signals have the highest influence. The agent responds to the caller's current emotional state, not an average of the whole call.
Valence/Arousal computation: Every emotion maps to a (valence, arousal) coordinate via a complete emotion-dimension mapping. Weighted sums across all detected emotions per segment, then recency-weighted across the rolling window, produce stable yet responsive emotional tracking.
Trend detection: Compares first-half vs second-half valence of the rolling window. A delta > 0.1 → improving; delta < -0.1 → deteriorating; otherwise stable. This powers the call-phase escalation system - deteriorating trends trigger increasingly urgent adaptation.
Coherence (prosody vs language agreement): Measures whether what the caller says matches how they sound:
Same valence sign (both positive or both negative)
High (0.7–1.0)
Normal adaptation
Opposite valence (words say fine, voice says distressed)
Low (0.0–0.3)
Trust the vocal tone over the words
One signal neutral
Mild (0.8)
Use the available signal
When coherence < 0.4: The caller may be masking their true state. The agent's emotional steering instruction shifts: "Respond to how they sound, not what they claim." This is injected into the system prompt without the agent ever explicitly mentioning the discrepancy.
Behavioral signal tracking: Updated in real-time from the session and turn controller:
barge_in_count
Caller interrupts agent speech
≥ 2
Frustration - agent talking too much
short_response_streak
Consecutive responses ≤ 4 words
≥ 3
Disengagement - caller withdrawing
silence_gap_count
Gaps ≥ 5 seconds
≥ 2
Confusion, hesitation, or distress
Semantic barge-in detection - Barge-in detection uses semantic confirmation - it requires actual recognized words from the STT engine (not just voice activity detection). This filters false triggers from coughs, breathing, echo, and background noise. Minimum speech duration is 0.5 seconds with recognized words, with a 1.0-second fallback for delayed word recognition.
These behavioral signals are injected into the system prompt alongside emotional steering - the LLM receives a complete picture of both how the caller sounds and how they're behaving.
Layer 3: Context Graph Engine
Each turn processes through a two-stage LLM pipeline:
Navigator - Selects the next action or exit condition from the current context graph state. Also generates a filler phrase to cover processing latency and determines whether to trigger audio verification for structured data capture. Uses structured output validation with automatic retry (up to 3 attempts) and fallback to first valid action.
Engage LLM - Generates the caller-facing response, informed by the selected action, full conversation history with per-message emotion annotations (
[VOICE: EmotionName, valence=V.VVV]), audio correction results, emotional steering context, ambient patient context, and available tools.
Emotion reaches the LLM via two independent paths:
Per-message annotations
Every user message
Inline [VOICE: Anxiety, valence=-0.312] - the LLM sees the emotional trajectory across the full conversation
Session-level steering
System prompt
Dominant emotion + trend, quadrant-specific adaptation instructions, behavioral signals, call-phase urgency, coherence warnings
Communication micro-behaviors: The engage template contains hardcoded guidelines that instruct the LLM on micro-level conversational behaviors that are always active - not gated by emotion:
Speech rhythm mirroring
If the caller speaks in short bursts, respond concisely; if conversational, match warmth
Emotional name usage
Use the caller's name at moments of emotional significance, not mechanically
Pause injection
When delivering difficult information, pause naturally before the key detail
Pace inversion
When the caller is rushing, slow the pace with longer sentences and gentle transitions
Completion inference
When the caller trails off mid-sentence, acknowledge what they were trying to say
Emotion concealment
Never explicitly mention that the system can detect emotions
Natural laughter
Contextual laughter available for naturally warm moments - used sparingly
Layer 4: Audio Output
The engage LLM's text streams to the TTS engine for speech synthesis with per-turn dynamic controls:
Emotion - Derived from the voice tone priority chain
Speed - From workspace voice settings
Volume - From workspace voice settings
Word-level timestamps are collected for every generated word - start time and end time - enabling transcript-to-audio scrubbing in the call playback UI. This is critical for the review queue workflow where operators need to jump to specific moments in a call.
Layer 5: Post-Call Intelligence
Two optional analyses run after every call (controlled via voice settings):
Transcript verification - Re-transcribes the full call audio with a high-accuracy batch model and computes Word Error Rate (WER) against the real-time transcript. Produces verified_transcript, verified_words, and transcript_accuracy - enabling quality comparisons between the real-time and batch transcription.
Quality analysis - Listens to the full stereo recording (caller + agent) and scores on 5 dimensions (1–5 each):
Task Completion
Did the agent achieve the caller's goal?
Information Accuracy
Was the information provided correct?
Conversation Flow
Was the conversation natural and smooth?
Error Recovery
How well did the agent recover from mistakes?
Caller Experience
How did the caller feel at the end?
Self-improving feedback loop: Quality analysis also produces stt_suggestions - words the STT misheard, formatted as recognition keywords for future calls. This creates a closed loop:
How Calls Work
Every call runs inside a conference architecture - a multi-party audio bridge that enables the caller, AI agent, and optionally a human operator to all participate simultaneously.
Inbound Call Flow (Instant Greeting)
The system eliminates dead air at call start through parallel pre-warming - the engine, greeting, and agent connection all initialize while the phone is still ringing.
Key insight: The telephony conference API accepts friendly names, not just IDs. The conference name is known at webhook time. The agent leg is created immediately - the conference is created on-demand when the agent joins. This means the agent can be fully connected and waiting before the caller even picks up.
Timeline comparison:
Webhook → Engine ready
After pickup (+1-3s)
During ring (hidden)
Agent leg creation
After pickup (+200ms)
During ring (hidden)
WebSocket connection
After pickup (+200ms)
During ring (hidden)
Greeting generation
After pickup (+500ms)
During ring (hidden)
Total dead air
~1200ms
~200-300ms (TTS streaming only)
Safety guarantees:
Caller hangs up during ring → cache entry expires (30s TTL), resources cleaned up lazily
WebSocket lands on different pod → cache miss, standard initialization (no degradation)
Pre-warm exceeds timeout → TwiML returned anyway, standard initialization on pickup
Session capacity is NOT consumed during pre-warm (no active session yet)
Pre-warm is best-effort. If initialization takes longer than expected, the system falls back to standard initialization - no degradation in call quality, just a slightly longer time to first greeting.
Outbound Call Flow
Outbound calls are world-model-native - scheduled as outbound_task entities via the schedule_outbound_call tool during inbound calls, then dispatched by the connector runner when they become due.
Five business logic patterns can produce outbound tasks:
Scheduled
Decision made, execution deferred
"I'll call you back tomorrow at 2pm"
Event-Reactive
Trigger → evaluate → maybe act
New lab result → is it critical? → call patient
Continuous Monitoring
Periodic population sweep
Patients with no contact in 30 days
Conversational Follow-Through
Track preconditions from agent promises
"I'll call after the doctor reviews" → pending on doctor event
Orchestrated Campaign
Achieve outcome for population over time
"Get all 200 patients to complete annual wellness by Q4"
Each outbound task carries: patient reference, reason, goal, priority (1–10), business-hours window (timezone-aware), retry config (max attempts with configurable backoff), and rich context from the patient's world model projection. The dispatch loop enriches the system prompt so the agent starts the call with full patient knowledge - the agent never needs to "look up" the patient.
Conference Architecture
Conference architecture - telephony details
The conference architecture supports multiple simultaneous audio participants with independent per-participant streams:
Caller
Person who called or was called
PSTN
Dedicated per-participant stream
Agent
AI voice agent
Bidirectional WebSocket
Main session STT
Operator
Human monitor/takeover (optional)
PSTN or browser WebRTC
Dedicated per-participant stream
Three-party speaker resolution: When multiple parties are on the call, speaker attribution uses a priority chain: operator STT → caller STT → default (caller). Every turn in the call record carries speaker_id and speaker_role for accurate attribution in the transcript.
Context Graph (HSM) Engine
The voice agent executes a Hierarchical State Machine (HSM) loaded from the service's version set. Each call gets its own engine instance with an in-memory state database for zero-latency state tracking, flushed to persistent storage after the call ends.
State Types
ActionState
Agent performs actions and evaluates exit conditions to transition
Yes - Engage LLM
DecisionState
Agent evaluates conditions and chooses a transition
Yes - Navigator only
ReflectionState
Agent reasons deeply over a problem with optional tool calls
Yes - deep reasoning
ToolCallState
Enforces execution of a designated tool before transitioning
No - automatic
RecallState
Retrieves information from memory before transitioning
No - automatic
AnnotationState
Injects an inner thought and transitions immediately
No - automatic
Per-Turn Flow
The navigator handles multi-state traversal automatically - decision states, annotation states, and recall states are resolved without user interaction before landing on an action state for the engage LLM.
Navigator resilience: Structured output validation with automatic retry (up to 3 total attempts). On all retries exhausted, falls back to the first valid action or exit. Filler text from earlier attempts is preserved across retries (first-wins) - the caller never hears silence even during recovery.
Terminal State & Auto-Hangup
When the context graph reaches its terminal state (an ActionState with one action and zero exits), the agent speaks its goodbye and automatically ends the call:
Navigator lands on terminal state →
is_terminal = trueAgent speaks the goodbye response
Waits for TTS to finish + grace period (audio buffer flush)
Terminates the call via telephony API
Silence detection: When the caller goes silent, the silence monitor fires check-ins at increasing intervals (10s → 20s → 40s). After 3 unanswered check-ins, the agent says a brief goodbye and auto-disconnects.
Session shutdown contract: Every code path that stops the session must also stop the audio speaker - otherwise the speaker blocks indefinitely. This is enforced across all shutdown triggers: hangup, STT failure, WebSocket disconnect, and terminal state.
World Model Integration
The voice agent connects to the workspace's world model through three data channels - this architecture is informed by the Liquid World Model thesis where the distinction between data infrastructure and intelligence dissolves.
Channel 1: Ambient (Pushed)
Data the LLM should always have without asking. Injected into the system prompt at session start and refreshed as the conversation evolves:
Patient demographics - name, DOB, MRN, phone, email, address
Clinical context - active conditions, medications, allergies (filtered to text-only for LLM consumption)
Upcoming appointments - with patient entity references for cross-referencing
Insurance coverage - active plans and subscriber info
Location context - clinic details, available appointment types, hours (resolved from the inbound phone number)
Design principle: ambient over queried. If the LLM will almost certainly need this data, push it into context. Don't make it ask. A voice agent that already has the patient's insurance in context doesn't need to dispatch a tool call to look it up.
Channel 2: Queried (Pulled)
Data that can't be ambient because the search space is too large. The agent calls built-in clinical tools to retrieve specific information.
Key simplification: Queried tools return human-readable results, not database internals. Slot search returns doctor names and times, not template IDs and slot UUIDs. When the agent says "book the 1:45 with Dr. Jones," the system resolves scheduling internals from cached slot data. The LLM never touches scheduling internals.
Channel 3: Extracted (Captured)
Structured data mentioned in conversation - insurance details, contact information, preferences - is automatically captured and written to the world model without requiring explicit tool calls. This eliminates the mode switch where the LLM stops being a conversationalist and becomes a database operator. The conversation IS the data entry.
Extracted data is written with moderate confidence (below verified threshold) - the LLM can still use explicit write tools for high-stakes data where precision matters. Extraction is a complement, not a replacement.
Multi-Stage Verification
All data written by the voice agent during calls starts at a low confidence level and must pass through a verification pipeline before syncing to external systems. This is the trust architecture for autonomous agents acting on noisy phone audio.
Three-stage automated review:
Call Classifier
Is this a real clinical call or junk? (prank, ad, bot, silence)
Real → continue; Junk → reject all session events
Per-Event Judge
Cross-references each event against transcript + existing entity state
Approve, auto-correct (formatting), or flag for human review
Session Coherence
Do all events tell a coherent story? Contradictions? Missing data?
Upgrade confidence if coherent, flag if contradictions found
Why three stages, not one: Per-event review catches data-level errors (wrong phone format, impossible DOB, name doesn't match transcript). Session-level review catches narrative-level errors (contradictions between events, discussed insurance but no coverage event recorded). These are fundamentally different kinds of errors requiring different analysis approaches.
Patient Safety Isolation
A write scope is enforced per session - write tools can only target the patient identified in the current call. This prevents cross-patient data errors. Write tools are also deduplicated - identical calls within the same session return cached results rather than creating duplicate records (30-second TTL, successful results only - errors are always retryable).
Emotional Adaptation
The voice agent adapts across four independent output channels simultaneously based on real-time caller emotion. Each row in the matrix below is a detected situation; columns show how each output channel responds. All adaptation is automatic - workspace managers control only the baseline via voice settings.
Valence–Arousal Model
Every detected emotion maps to a two-dimensional (valence, arousal) coordinate. The system tracks these coordinates across a rolling window to build a stable yet responsive picture of the caller's emotional state:
High-arousal negative (anger, frustration)
De-escalate
calm
Direct, concise, acknowledge frustration, skip pleasantries, match urgency
Low-arousal negative (sadness, disappointment)
Comfort
sympathetic
Warm, patient, gentle language, give extra space, do not rush
High-arousal positive (excitement, joy)
Match energy
enthusiastic
Enthusiastic language, keep momentum, match positive energy
Low-arousal positive (contentment, relief)
Maintain
content
Warm and steady, reinforce positive outcome, conversational
Confusion (high confidence)
Clarify
calm
Simplify explanations, break into small pieces, check understanding, offer to repeat
Anxiety (high confidence)
Reassure
sympathetic
Calm and reassuring, provide clear next steps, avoid uncertainty
Voice Tone Priority Chain
The agent's voice tone is determined by a six-level priority chain - each layer fires only if the previous returned no signal:
Why this ordering matters:
Bursts are the highest-priority signal because they capture the most immediate emotional state. A caller who just laughed should hear warmth immediately - not the rolling average of the last 30 seconds. Burst detection (within last 5 seconds, confidence ≥ 0.5) overrides everything.
Tone momentum (layer 4) prevents jarring voice tone changes. When the current emotional signal is weak (score < 0.25) or doesn't map to a tone, the previous turn's tone persists. Only a strong contradictory signal changes the tone - making the voice feel continuous across the conversation:
Proactive topic sensitivity (layer 3) fires before the caller shows distress. When the agent is about to discuss test results, billing, surgery, or other loaded topics, the voice tone preemptively shifts to sympathetic - even without an emotion signal.
Emotion → Response Matrix
All four adaptation channels respond simultaneously to each caller state. The agent mirrors empathy, not the caller's emotion:
Anger, Annoyance, Contempt
calm
Suppressed
Direct, concise, acknowledge frustration
De-escalate - don't mirror aggression
Anxiety, Fear, Distress
sympathetic
Reassuring
Calm, clear next steps, avoid uncertainty
Reassure - steady presence
Sadness, Disappointment, Guilt
sympathetic
Warm
Patient, supportive, don't rush
Warm empathy - give space
Confusion
calm
Simple
Simplify, small pieces, check understanding
Patient clarity
Excitement, Joy, Enthusiasm
enthusiastic
Warm, matching
Match positive energy, keep momentum
Mirror positive energy
Contentment, Relief, Gratitude
content
Warm
Steady, reinforce outcome
Warm and grounding
Interest, Concentration
curious
Engaged
Engaged tone, match intellectual focus
Show interest
Embarrassment, Doubt
calm
Encouraging
Non-judgmental, encouraging
Put at ease
Boredom, Tiredness
enthusiastic
Concise
Re-engage with energy, be efficient
Re-energize
Sarcasm
calm
Professional
Respond to underlying concern, not surface tone
Stay professional
Burst-to-experience mapping: 25 vocal burst types are mapped to specific agent tones and caller state interpretations:
Laugh, Giggle
enthusiastic
Amused
Sigh
sympathetic
Weary
Cry, Sob, Whimper
sympathetic
Distressed
Gasp
calm
Alarmed
Groan, Ugh
sympathetic / calm
Frustrated
Growl, Tsk
calm
Angry
Hmm, Mhm
calm
Thinking / Acknowledging
Aww
sympathetic
Touched
Filler Speech
Fillers cover processing latency so the caller never hears silence. The system uses principle-based guidance - not hardcoded phrase lists - generating contextually appropriate fillers from emotional context, the current action, and the expected latency.
Three-layer filler generation:
Latency adaptation
Always
Filler length matches expected processing time (2-4 words for normal latency, 3-5 words for audio verification processing)
Emotional attunement
When emotion data available
Emotional register matches the caller's state - not specific phrases, but principles like "gentle and reassuring" or "a verbal hand on the shoulder"
Action context
Always
Current context graph action description injected so the filler hints at what the agent is about to do
Per-action filler hints: Context graph actions can include optional PM-configured filler suggestions. These are weak steering - emotion-adaptive principles always dominate. The LLM sees hints as suggestions to draw from, not commands.
Suppression rule: When valence < -0.2 AND arousal > 0.4 AND emotion is NOT Anxiety/Fear/Distress → fillers disabled entirely. Frustrated callers don't want acknowledgments - they want the answer. Exception: Anxious callers still receive reassuring fillers, because anxiety benefits from reassurance while frustration does not.
Call Phase Escalation
The system automatically increases urgency as calls extend with negative sentiment:
Early
< 5 min
Any
Standard emotional adaptation
Mid
5–10 min
Trend deteriorating
"Focus on resolution speed. Shorten responses."
Late
≥ 10 min
Negative valence
URGENCY. "Prioritize resolution. Be maximally concise. Escalate if unable to resolve."
Proactive Intelligence
The system detects emotionally sensitive topics from the current context graph action before the caller shows distress:
sensitive_topics is configurable via voice settings. Falls back to healthcare defaults: test results, diagnosis, billing, payment, insurance, denial, emergency, referral, specialist, surgery, procedure, medication.
This fires at priority level 3 in the TTS emotion chain - below burst and prosody (which have actual data about the caller's current state) but above tone momentum and workspace defaults.
Coherence Detection
When what the caller says doesn't match how they sound (coherence < 0.4), the system shifts its steering: "The caller's words suggest X but voice sounds Y. Trust the vocal tone over the words - respond to how they sound, not what they claim."
This is injected into the system prompt without the agent ever explicitly mentioning the discrepancy to the caller.
Control Plane ↔ Adaptation
How each workspace voice setting interacts with the automatic emotion adaptation system:
tone
Baseline voice emotion for neutral callers
Emotion-derived tone replaces it
Any non-neutral emotion detected (score ≥ 0.25)
speed
Base speech rate
Never overridden
Your choice always respected
volume
Base volume
Never overridden
Your choice always respected
voice_id
Voice persona
Per-agent voice config overrides
Agent version has voice config set
keyterms
Domain vocabulary for STT boost
Merged with service keyterms
Always additive, never overridden
correction_categories
Domain hints for audio correction
None
Used as additional context
sensitive_topics
Topics for proactive tone softening
Falls back to healthcare defaults
Preemptive, not reactive
post_call_analysis_enabled
Quality scoring on/off
None
Full PM control
transcript_correction_enabled
Re-verification on/off
None
Full PM control
Key principle: Workspace managers control the baseline experience and domain knowledge. The emotion intelligence system overrides the baseline only when it detects a strong signal - and always in the direction of more empathy, never less.
Graceful Degradation
Every intelligence layer is best-effort with an explicit fallback. A failed intelligence layer must never fail a call.
Emotion connection
Auth error, billing, timeout
Session continues without emotion detection
No emotional adaptation, workspace defaults used
Emotion segment
Processing error, connection close
Consecutive failure counter → disable after 5
Degrades gracefully to less data
Emotion detection
Insufficient data (< 2 segments)
No emotional steering, default fillers
First few seconds may lack adaptation
Burst detection
No burst events
Falls through to prosody-derived emotion
Loses immediate reaction, uses rolling average
Language model
No language results
Coherence defaults to 1.0 (agreement assumed)
Loses word-vs-tone disagreement detection
Audio verification
Timeout or error
No corrections injected, call continues
Relies on raw STT only
Voice settings
Parse error
Defaults (filler on, emotion on)
Baseline experience still works
Post-call analysis
Any error
Logged, not raised (fire-and-forget)
Quality data missing, call unaffected
TTS connection
Close/error mid-stream
Auto-reconnect on next turn
Brief silence, then recovery
STT connection
Connection loss
Exponential backoff reconnect (max 3 attempts)
Brief gap in transcription
Context graph engine
Backend unavailable
Falls back to static prompt mode (no HSM)
Agent still converses, just without state machine navigation
Tool Execution
Skills configured in the context graph execute asynchronously during calls - the agent acknowledges the action and continues speaking while tools run in the background. Results arrive as continuation turns.
Execution Tiers
Tool calls are routed through an execution tier system that matches the tool's complexity to the right execution model:
T1
Direct
Single integration API call, no LLM
< 2s
Patient lookup, allergy check, medication list
T2
Orchestrated
Multi-turn LLM agent with tool access
2–30s
Eligibility cascades, multi-step writes
T3
Autonomous
Extended agent loop with checkpointing and MCP tools
30s–5min
Complex prior auth, cross-system reconciliation
T3 autonomous agents use a full agent SDK with:
Custom MCP tools injected per-task (world model tools, integration tools)
Session checkpointing for pause/resume across retries
Cost caps per task to prevent runaway execution
Isolated working directories per task
Write-tool deduplication: All write tools are deduplicated within a session (30-second TTL). Identical tool calls return cached results. Only successful results are cached - errors are always retryable.
Built-in Clinical Tools
Healthcare workspaces get 13 built-in tools automatically - no integration configuration required:
Read tools:
Patient lookup
Search by DOB, name, phone, or MRN
DOB preferred for accuracy
Slot search
Available appointment slots by location and date
Returns human-readable times + doctor names, caches slot internals
Appointment lookup
Patient's existing appointments
Returns appointment references for cancel/confirm
Semantic patient search
Fuzzy, embedding-based patient matching
Handles misspellings and partial information
Semantic event search
Embedding-based search across clinical events
Optionally scoped to a specific patient
Write tools:
Patient create
Create patient with automatic deduplication
Dedup by name + DOB
Patient update
Update contact info (phone, email, address)
Requires entity reference
Save patient
Create-or-update with dedup check
Accepts natural field names and flexible date formats
Schedule appointment
Book from slot search results or explicit times
Accepts slot_ref from slot search - auto-resolves booking details
Cancel appointment
Cancel by appointment reference
Writes cancellation event
Confirm appointment
Confirm a booked appointment
Writes confirmation event
Create insurance
Insurance record with carrier fuzzy-matching
Supports policy holder info
Schedule outbound call
Schedule a future callback
Creates outbound_task entity atomically
All write tools pass through the multi-stage verification pipeline before data reaches external systems. All write tools enforce patient safety isolation.
Call Forwarding
A built-in forward_call tool transfers the caller to a human. Two modes:
Static forwarding - per-phone-number fallback, configured via Phone Numbers
Location-based forwarding - the agent selects from location phone numbers in the patient's context
The agent cannot specify arbitrary phone numbers - the destination always comes from the resolved config or location entity state. When the caller requests a human, the agent is required to invoke the tool - the actual transfer happens via the telephony system, not through words alone.
Deferred transfer - Call transfers are deferred until the agent's goodbye message finishes playing. The transfer is cancellable by barge-in or operator join.
Audio Verification
When the agent needs to capture structured data (names, dates, phone numbers, insurance IDs), it can trigger audio verification - sending the caller's raw audio for AI-powered correction alongside the real-time transcript.
This catches STT errors on structured data that streaming transcription commonly gets wrong: proper names, alphanumeric IDs, phone numbers, and dates.
Domain-aware: correction_categories from voice settings are injected as domain hints. This tells the correction model: "This workspace commonly handles medication names, insurance carriers. STT frequently gets these wrong. Pay extra attention."
Correction Output
Corrections are structured as field-level pairs showing what STT heard vs. the corrected value:
Correction Confidence
Certain
8–9
Use corrected value directly without confirming
Likely
5–7
Confirm with caller ("I have [value], is that correct?")
Uncertain
1–4
Ask caller to spell out or repeat slowly
Both models wrong
-
Audio quality is poor - ask for letter-by-letter spelling
Observer events include both the original STT value, the corrected value, and the numeric confidence - enabling frontend visualization of correction accuracy.
Safety & Monitoring
Conversation Monitor
An embedding-based safety detection system evaluates every turn against configured safety concepts using a two-stage pipeline:
Standalone fallback: If semantic similarity exceeds a high threshold (default 0.85), escalation triggers immediately without waiting for the AI judge - providing a safety net even if the judge model is unavailable.
Default safety concepts (always active): suicidal ideation, self harm, domestic violence, adverse drug reaction, post-discharge red flag. Custom concepts can be added via the Safety API with pre-computed embeddings.
Auto-Escalation
When an escalation triggers, the system:
Writes an escalation event to the world model (dual-entity: both call and operator entities)
Notifies the operator dashboard
For hard escalations - immediately suspends the AI agent pending human intervention
Observer WebSocket
Monitor active calls in real time via a cross-pod WebSocket connection:
Requires a valid workspace API key. Any observer instance can monitor any active call in the workspace, regardless of which pod handles the call (events are distributed via pub/sub).
Late-join replay: Observers connecting mid-call receive a buffered replay of recent events before transitioning to the live stream. Events carry monotonic sequence numbers for ordering.
Event Types
session_start
call_sid, service_id, workspace_id, initial_state, trace_id
Session init
session_info
Full call snapshot (sent on observer connect)
Observer connect
user_transcript
transcript, emotion_label, emotion_valence
Turn controller
agent_transcript
transcript, action, interrupted
Speaker
state_transition
previous_state, next_state
Turn controller
tool_call_started
tool_name, call_id, input
Turn controller
tool_call_completed
tool_name, duration_ms, output (truncated), succeeded, error_message
Turn controller
nav_timing
nav_ms, render_ms, total_ms, input_tokens, output_tokens, model, state
Turn controller
latency
e2e_ttfb_ms, engine_ms, nav_ms, render_ms, audio_ttfb_ms, continuation
Speaker
emotion
dominant, valence, arousal
Transport
session_end
call_sid, duration_s, turns, completion_reason, final_state
Session shutdown
ping
(empty)
Keepalive (30s)
Call Record & Persistence
Every call produces a detailed record persisted to the database:
Turns - Each turn carries a 5-layer timing model (all fields in milliseconds):
Layer 1 (STT):
user_speech_start_ms,user_speech_end_ms- speech boundariesLayer 2 (Engine):
engine_ms,nav_ms,render_ms,audio_ttfb_ms- processing latency breakdownLayer 4 (TTS/Transport):
agent_speech_start_ms,agent_speech_end_ms- when agent audio played
Tool calls - Name, input, output, duration, success/failure
State transitions - Full HSM navigation history
Emotional summary - See below
Escalation history - Full escalation lifecycle if operator joined
Config snapshot - Version set, agent version, HSM version used
Calls API
Active Calls
Lists all currently active calls across the workspace. Active call state is maintained in a distributed registry - any API pod can serve this request regardless of which pod handles the call.
Call History
Call Detail
Full call record including turns with timing model, tool calls, state transitions, emotional summary, escalation history, safety state, and config snapshot.
Recordings
GET /calls/{call_id}/recording/stereo
Stereo WAV (caller left channel, agent right channel)
GET /calls/{call_id}/recording/waveform
Amplitude envelope for timeline visualization
GET /calls/{call_id}/recording/{channel}
Single channel WAV (caller or agent)
POST /calls/{call_id}/verify-transcript
Re-transcribe with high-accuracy batch model for ground-truth timestamps
Outbound Calls
Emotional Summary
At call end, the system persists a complete emotional record available in the call detail response:
Roadmap: Toward Deeper Empathy
The emotional intelligence system is actively evolving. These are areas where we're investing to push beyond current capabilities:
Prosodic rhythm
Text-level rhythm guidance (shorter sentences for urgency, gentle transitions when rushing)
Audio-level prosodic planning - breath-like pauses, per-word speed variation, rhythm that matches the emotional weight of each sentence
Emotional response time
Emotion applied on the next turn after detection (~2-4s). Burst detection (laughs, sighs) provides faster sub-segment signals
Sub-second emotional adaptation - responding to a voice crack within the same conversational beat
Emotional memory across calls
Each call persists a full emotional summary. Patient context injected from world model
Cross-call emotional profiles - "this patient was anxious about test results last call" surfaced proactively in future calls
Mixed-emotion voice
Single emotion label per generation; text structure conveys nuance
Emotion blending - "warm concern with a hint of encouragement" expressed in a single sentence through TTS-level control
API Reference
Last updated
Was this helpful?

