Audio Pipeline
How speech recognition, emotion detection, filler speech, barge-in detection, and TTS work together in real time.
Media-Plane Architecture
The voice pipeline uses a split architecture where audio processing and conversation reasoning run as separate services. The media gateway handles all real-time audio concerns - receiving caller audio, streaming it to speech recognition, synthesizing agent responses to audio, and managing playback. The agent engine handles all reasoning concerns - interpreting transcripts, running context graphs, executing skills, and deciding what the agent should say.
The two services communicate over a bidirectional relay. The media gateway sends transcripts to the agent engine and receives commands back: enqueue an utterance for playback, interrupt the current playback (barge-in), reconfigure speech recognition mid-call, or end the session. This separation means audio latency is isolated from reasoning latency, and each layer can scale independently.
Each voice session runs concurrent pipelines:
Audio receive - Caller audio flows from the telephony transport to the speech recognition service in real time
Transcript relay - Completed transcripts are forwarded to the agent engine for processing
Command dispatch - Agent engine commands (speak, interrupt, stop, reconfigure) are dispatched to the appropriate audio subsystem
Playback - Text-to-speech synthesis runs over a persistent streaming connection, with each utterance tracked independently for interruption and lifecycle management
Sessions enforce a maximum call duration and handle graceful cleanup when any pipeline exits or encounters an error. The gateway also enforces per-pod capacity limits, rejecting new connections when at capacity to protect active sessions from overload.
The voice audio pipeline is split into two logical layers: a media layer that owns the telephony connection, speech-to-text, and text-to-speech, and an agent engine that owns reasoning, tool execution, and conversation state. The two layers communicate over an internal relay protocol, allowing them to scale and deploy independently. This separation means the audio processing components can be updated, scaled, or replaced without affecting the agent logic, and vice versa.
The relay protocol defines typed messages in each direction. The media layer sends session lifecycle events (start, end), speech transcripts, and playback status to the agent engine. The agent engine sends utterance requests, playback controls (interrupt, stop, drain), speech recognition configuration updates, and session termination commands back to the media layer. All messages carry a type discriminator for efficient dispatch.
This architecture supports graceful rolling deployments. During a deploy, the media layer stops accepting new sessions while allowing in-flight calls to complete naturally. Health probes distinguish between liveness (the process is running) and readiness (the service is accepting new calls), so the load balancer can drain traffic from a pod without terminating active calls.
The audio pipeline converts a caller's voice into text, processes it through the agent's reasoning, and converts the response back to speech. Two independent streams handle the input side: one for speech recognition, one for emotion analysis. They run in parallel and never block each other.
Signal Capture
Audio arrives from the telephony layer as a standard telephony audio stream. The system splits it into two parallel paths the moment it arrives:
Speech-to-text - Converts audio to transcript text in real time
Emotion detection - Analyzes vocal qualities for emotional signals (covered in Emotion Detection)
If either path fails, the other continues unaffected. A failure in emotion detection does not delay transcription. A failure in transcription does not block emotion analysis.
Progressive Initialization
Before the caller picks up, the voice agent runs a prewarm phase during ring time: fetching workspace configuration, skill definitions, and integration credentials. When the call connects, only three components are on the critical path before the greeting plays: configuration resolution, the reasoning engine, and text-to-speech. Speech-to-text and conversation monitoring initialize in parallel with the greeting. The caller hears the agent within two seconds of pickup.
This progressive approach applies to both inbound and outbound calls. For outbound calls, prewarm runs during the dialing and ringing phase (typically 5-15 seconds before the patient answers). When the patient picks up, the engine and greeting are already initialized - the patient hears an instant greeting instead of several seconds of silence.
STT is ready before the greeting finishes, so by the time the caller speaks their first words, transcription is active. Monitoring embeddings load in the background without affecting latency.
Greeting
The agent delivers the greeting to the caller as soon as the connection is established, without waiting for the speech recognition pipeline to finish initializing. Speech recognition connects concurrently in the background and is ready before the caller begins speaking. This overlap eliminates unnecessary startup delay and keeps first-message latency under 2 seconds.
For conference-mode calls where the agent leg is created during ring time, the platform waits for the caller to actually join the conference before releasing the greeting. This prevents the greeting from playing into an empty conference.
Speech Handling
While the agent plays its greeting, any caller speech is discarded. The caller is "not heard" until the greeting finishes. This prevents the agent from interpreting ambient noise, simultaneous "hello" responses, or partial utterances as meaningful input before the conversation has properly started. Once the greeting completes, the speech-to-text pipeline begins processing caller audio normally.
Speech-to-Text
The speech-to-text stage converts caller audio into text transcripts in real time. For multilingual deployments, the platform detects the caller's spoken language automatically and consolidates on a dominant language once sufficient audio has been processed. This language detection feeds downstream components - when the caller's language is identified, the text-to-speech output language is switched to match, ensuring the agent responds in the same language the caller is speaking.
Amigo uses a streaming speech recognition engine for real-time transcription. Audio is transcribed with sub-300ms latency, meaning the system has text available almost as fast as the caller speaks.
The STT engine is not statically configured at call start. It adapts mid-conversation as context changes - reconfiguring recognition vocabulary, end-of-turn sensitivity, and language hints without reconnecting or interrupting the audio stream.
Language Selection
The platform supports English-optimized and multilingual STT models, selected per-service:
English
English-optimized (lowest error rate)
Monolingual English-speaking populations
Multilingual
Multi-language with code-switching
Populations that switch between languages mid-conversation
Auto
Multi-language with auto-detection
Unknown caller language; narrows automatically once detected
In auto mode, the platform tracks which language the caller is speaking on each turn. Once a dominant language reaches high confidence, the STT hints narrow automatically to improve recognition accuracy for that language. This happens mid-call without interruption.
When a patient's preferred language is known from their world model record, the platform uses it automatically - selecting the optimal STT model for that language before the caller needs to speak enough for auto-detection. This is particularly useful for multilingual populations where patient demographics already capture language preference. The priority order is: patient's recorded language preference, then workspace-level voice setting, then default. For clinical copilot sessions, the same logic applies using the patient context loaded during pre-encounter setup.
Keyterm Boosting
Medical terminology, provider names, medication names, and organization-specific vocabulary are difficult for general-purpose speech recognition. Three layers of keyterm boosting improve recognition accuracy:
Service-level
Workspace administrators
Applied to all calls for a given service
Workspace-level
API configuration
Per-workspace vocabulary (clinic names, local terminology)
System defaults
Amigo engineering
Baseline medical and scheduling vocabulary
All three layers are merged and deduplicated at call start. Then they update dynamically:
Patient context injection - When the patient's identity resolves and their record loads, the platform extracts medical vocabulary from their medications, allergies, and active conditions. These terms are injected into the active STT session mid-call. A patient on metformin and lisinopril gets those drug names boosted as soon as their record is resolved.
Clinical copilot parity - The clinical copilot uses the same keyterm resolution. After patient context loads during an encounter, medical vocabulary is injected into the STT stream.
End-of-Turn Detection
The system must determine when the caller has finished speaking so the agent can respond. This uses configurable confidence thresholds that balance two concerns:
Responding too early cuts the caller off mid-sentence
Responding too late creates awkward silence
Base thresholds are configurable per workspace. On top of that, context graph states can override end-of-turn sensitivity through turn policy settings:
Data collection states (collecting a date of birth, spelling a name) use higher thresholds and longer silence timeouts, because the caller is thinking and pausing between pieces of information
Action states (confirming an appointment, answering a yes/no question) use default thresholds for snappy turn-taking
These overrides take effect on state transitions - when the agent moves to a data collection state, the STT engine reconfigures mid-call to be more patient with pauses.
Speculative Processing
The voice pipeline does not wait for full end-of-turn confirmation before starting work. When the STT engine signals moderate confidence that the caller may have finished speaking, the system begins the navigation step speculatively - processing the transcript through the context graph engine in the background. If the caller continues speaking, the speculative result is discarded. If the end-of-turn is confirmed and the transcript matches, the pre-computed navigation result is used immediately, saving the cost of a redundant LLM call.
The caller hears the agent respond faster without any change in response quality. When speculation fails (the caller was mid-sentence), the only cost is a discarded background computation - the caller experience is unaffected.
Text-to-Speech
Per-Language Provider Routing
The platform supports routing text-to-speech to different providers based on the caller's detected language. This is configured through a language-provider map that associates language codes with specific TTS providers and voice configurations.
When a caller's language is detected, the platform resolves the TTS provider through a priority matrix:
Exact language match - e.g.,
ar-SAmatches an Arabic (Saudi Arabia) entryBase language match - e.g.,
ar-SAfalls back to anarentryMultilingual fallback - a catch-all entry for any language not explicitly mapped
At each level, service configuration takes priority over agent configuration, which takes priority over workspace configuration. The first match wins.
If no language-specific entry matches, the platform uses the standard TTS provider selection (service > agent > workspace > default). Per-language configuration is isolated - when a language-specific provider is selected, only that entry's voice settings are used, preventing configuration for one provider from affecting another.
The platform supports multiple TTS providers, selectable at the workspace level. Each provider offers different trade-offs across latency, voice quality, language support, and expressive capabilities. Workspace administrators choose the provider and configure provider-specific parameters (voice, model, speed, quality tuning) through the Developer Console voice settings page or the API. The Developer Console includes a voice library where operators can browse available voices across providers, preview audio samples, and select a voice for the workspace without needing to look up voice identifiers manually.
Provider-specific settings are isolated - changing providers does not affect call recordings, session management, or downstream analytics. If a configured provider is unavailable, the system falls back to the default provider.
The platform supports multiple TTS providers, selectable per workspace through voice settings. Each provider offers different trade-offs:
Default provider - Low-latency WebSocket streaming with emotion-aware prosody adjustments and pronunciation dictionary support.
Provider B - Persistent multi-stream WebSocket with per-turn context isolation, word-level timing alignment, and regional endpoint support for data residency requirements.
Provider C - Ultra-fast REST-based synthesis with sentence-level pipelining (audio starts before the full response is generated), vocal direction tags for expressive delivery, and automatic Arabic language detection.
All providers integrate with the same text generation pipeline: a streaming language model produces text fragments that are forwarded to the selected TTS engine in real time. Each provider has its own circuit breaker for fault isolation - a degradation in one provider does not affect the others.
Provider selection is transparent to callers and does not change the call experience, recordings, or API behavior.
The TTS engine converts the agent's generated text into spoken audio with dynamic per-turn control over emotion, speed, and volume. Each utterance - fillers, responses, empathy pauses - carries its own voice parameters, so a warm filler at reduced speed can precede a normal-pace informational response without shared state.
Emotion Priority Chain
The agent's vocal tone is not a single static setting. A six-level priority chain selects the most contextually appropriate emotion on every turn. Each level fires only if the previous one produced no signal:
1. Vocal burst
Caller laughed, sighed, gasped, or cried in the last 5 seconds
Caller laughs → agent responds with warm enthusiasm immediately
2. Prosody
Acoustic emotion model detects a strong signal from the caller's voice
Anxiety detected → sympathetic tone
3. Proactive topic
The current context graph action matches a sensitive topic
Agent about to discuss test results → preemptive sympathetic tone
4. Tone momentum
Previous turn's tone is carried forward when the current signal is weak
Tone stays sympathetic across a brief neutral pause
5. Workspace baseline
The service's configured default tone
Friendly baseline for scheduling services
6. System default
Engineering fallback
Calm
This means the agent's voice adapts in real time to what is happening, not to a pre-configured setting. The workspace baseline sets the floor, but any strong emotional signal overrides it - always in the direction of more empathy, never less.
Split Model Architecture
Navigation and response generation use different LLM models optimized for their respective jobs:
Navigation uses a smaller, faster model. The nav output is roughly 5 tokens (a structured code line), so raw intelligence matters less than speed. The smaller model shaves latency off every turn without sacrificing decision quality on the constrained output format.
Response generation uses the full-size model. Response text is what the caller actually hears, so quality matters more than throughput. For typical voice responses (under 20 tokens), the speed difference between the two models adds less than 200ms - a good trade for noticeably better phrasing.
Situation-Response Adaptation
The voice pipeline adapts across four independent dimensions simultaneously. Each dimension operates on different output channels, so the agent can change what it says, how it says it, and whether it fills silence - all independently and in real time.
Emotion → Voice Tone
The agent mirrors empathy, not the caller's emotion. An angry caller hears a calm voice (de-escalation), not an angry one. An anxious caller hears a sympathetic voice (reassurance). A happy caller hears enthusiasm (matching energy).
Emotion → Filler Behavior
Filler speech adapts to the caller's emotional state. Anxious callers hear reassuring fillers ("Of course," "I'm here to help"). Frustrated callers with high arousal hear no fillers at all - the system suppresses them because frustrated callers want answers, not acknowledgments. Happy callers hear warm, matching fillers.
Emotion → Response Content
Emotional context is injected into every prompt the response model receives. The injection includes the caller's dominant emotion, trend direction, and adaptation guidance. A caller with deteriorating mood gets responses prioritizing resolution speed. A confused caller gets simplified explanations broken into small pieces. The agent adapts what it says based on how the caller is feeling, not just how it sounds.
Behavioral Signals → Response Content
Three behavioral signals are tracked in real time and injected into prompts when thresholds are crossed:
Interruption count
Caller has interrupted the agent multiple times
Shorten responses - the agent is talking too much
Short response streak
Caller is giving very brief answers consecutively
The caller is disengaging or withdrawing
Silence gaps
Extended silence from the caller
Confusion, hesitation, or distress
These signals augment, not replace, the emotion detection system. A caller who is interrupting frequently may not sound frustrated in their voice, but the behavioral pattern tells the engine to shorten its responses.
Response Micro-Behaviors
The response generation model follows a set of communication guidelines that produce natural conversational behavior regardless of emotional state:
Speech rhythm mirroring - Short bursts from the caller produce concise responses; conversational callers get warmer, flowing replies
Emotional name usage - The caller's name is used at moments of emotional significance, not mechanically
Pause injection - When delivering difficult information, the agent pauses naturally before the key detail
Pace inversion - When the caller is rushing, the agent slows down with longer sentences and gentle transitions
Completion inference - When a caller trails off mid-sentence, the agent acknowledges what they were trying to say
The agent never mentions that it can detect the caller's emotions. Emotional adaptation is experienced as natural attentiveness, not surveillance.
Voice Timeline
The voice pipeline applies the same cut/navigate/engage pattern that drives conversation-level reasoning - but within each turn, managing what the caller hears and when.
A single actor processes signals sequentially. Fillers, responses, empathy pauses, and tool progress narration are not separate systems competing for the audio output. They are sequential states in one timeline, managed by one actor, producing one stream of utterances. "Let me check on that" followed by "Her appointment is Thursday" is one trajectory in two parts - not two unrelated items from different subsystems.
Three Operations
Cut - A signal arrives (the caller stopped speaking, a tool started, empathy shifted). The actor asks: did something change? Between two cuts, dozens of raw events may arrive - emotion scores, behavioral signals, audio segments. The cut compresses them into a handful of causally relevant fields and discards the rest. This compression is what makes the next step tractable.
Navigate - Given the compressed state and the trajectory of previous states, select the next voice state. Navigation is a pure decision with no side effects - it can be called speculatively without producing audio.
Engage - Enqueue an utterance with its own emotion and speed baked in, then set a deadline. When the deadline fires, it becomes the next signal - the system re-enters cut/navigate/engage. The actor is self-driving.
Signal-to-State Mapping
Each signal produces a specific voice state:
Caller finished speaking
Breath
Brief pause before the agent responds (configurable, default 200ms)
Navigation complete
Transition
Filler window opens - if the response is not ready by the deadline, a filler plays
Tool started
Progress
Tool wait narration on a repeating interval ("Let me check on that...")
Tool finished
Response
Agent delivers the tool result
All audio finished
Listen
Silence deadline starts - check-ins escalate if the caller stays quiet
Empathy tier shifted
Hold
Intentional silence - the agent pauses to give the caller space
Caller started speaking
Listen
Pending fillers drain - the caller has the floor
Deadline expired
Next state
Self-signal - the actor re-enters cut/navigate/engage
Deadlines are what make this self-driving. A transition state sets a deadline: "if no response arrives in 800ms, play a filler." If the response arrives first, the deadline is cancelled. If the deadline fires, it becomes a signal. No polling loops, no scattered timers.
Per-Utterance Voice Parameters
Each utterance carries its own emotion and speed, set at engage time. The audio output reads parameters directly from the utterance - not from shared state that could change between enqueue and playback. A filler set to "sympathetic" at 0.85x speed cannot be overwritten by a response that arrives a moment later. Races are eliminated by construction, not by adding synchronization.
The utterance queue is the boundary between the actor and the audio output. The actor writes utterances in. The audio output reads them out. No callbacks, no shared mutable state, no coordination needed beyond the queue itself.
Voice Timing Configuration
The voice timeline exposes two categories of configuration per service:
When - timing in milliseconds:
Post end-of-turn pause
200ms
Breath duration after the caller finishes speaking
Transition deadline
800ms
Maximum silence before a filler plays
Progress interval
3000ms
How often to narrate during tool waits
Empathy hold
500ms
Intentional silence for empathetic moments
Filler cooldown
2000ms
Minimum gap between consecutive fillers
What - vocabulary and style:
Filler style
Phrase, backchannel, or silent (see below)
Filler vocabulary
Custom backchannel words ("Mm," "Yeah," "Mhm")
Progress vocabulary
Custom tool-wait phrases ("One moment...," "Let me check...")
Everything else - signal routing, deadline management, state transitions, trajectory tracking - is derived from the signal-to-state mapping and these two knobs.
Filler Styles
Three filler styles are available, configurable per service:
Phrase
Contextual phrases like "Let me check that for you" or "One moment"
General-purpose services where the agent should sound active
Backchannel
Short acknowledgments like "Mm," "Yeah," "Mhm"
Services where brief, natural-sounding turn-taking is preferred
Silent
No filler at all - the agent pauses until the response is ready
Services where silence between turns is acceptable or preferred
The filler style is enforced end-to-end: when a service is configured as silent, the pipeline suppresses filler generation, filler guidelines in the navigation prompt, and filler text in the response - not just the final audio output. Receipt and working fillers ("Got it," "Let me check") inherit the nav-selected emotion, so they sound consistent with the rest of the turn. Backchannel vocabulary is customizable per service.
When navigation is skipped - typically in single-action context graphs where the agent always stays in the same state - the orchestrator starts a short timer (configurable per service). If the response has not produced audio by the time the timer fires, a backchannel sound plays to hold the conversational rhythm. If the response arrives first, the timer is cancelled. Services using the "silent" filler style suppress this timer entirely.
Empathy-Gated Filler Behavior
Filler behavior is controlled by the caller's empathy tier. At higher tiers, silence replaces fillers because silence is the empathy:
T0-T1 - Normal filler emission. At T1, the filler type is set to "empathy" (warmer, acknowledging) rather than "receipt" or "working."
T2 Full Empathy - The system inserts a 0.5-second pause before speaking any filler. This anti-filler silence gives the caller space to continue.
T3 Hold Space - Fillers are suppressed entirely. The agent pauses for one second, then delivers a pure empathy response.
When the caller's empathy tier changes mid-turn, the orchestrator reacts immediately - shifting from normal filler behavior to empathy-appropriate silence or vice versa without waiting for the next turn boundary.
When empathy fillers do play, the TTS engine switches to a presence mode: 0.85x speed with neutral emotion. This prevents the uncanny effect of strong emotion applied to very short phrases (two to four words), where the TTS engine stretches vowels in ways that sound artificial.
Principle-Based Filler Generation
Filler phrases are not drawn from a hardcoded list. The system generates them using an LLM with the current emotional context and conversation intent as inputs. This means fillers adapt to the situation: a caller who sounds anxious gets "I'm looking into that for you right now" rather than a generic "One moment." The generation is guided by emotional guidelines and the current context graph action, so fillers stay contextually appropriate.
Filler emissions are rate-limited to prevent cascading acknowledgements when rapid turns or false barge-ins cause multiple navigation cycles in quick succession. The orchestrator also caps consecutive filler emissions per turn, so extended processing time produces at most two filler phrases before the agent falls silent.
Tool-Wait Progress Hints
Fillers emitted while a tool is running can be shaped per state and per tool, not just per service. A progress hint describes the shape of the wait rather than supplying a phrase list: what kind of work the tool is doing (record lookup, write, external call, computation, multi-step workflow), roughly how long it is expected to take, and how the agent should cover the wait (auto, verbal, backchannel, or silent).
The orchestrator turns the hint into utterances at runtime using tool semantics, the caller's current emotion, and conversation context. Retries produce attempt-aware language: an initial acknowledgement, then a brief apology, then a "still working on it" update, each scaled to how long the tool has actually been running. A tool-level hint field-merges with the state's channel-level hint, so a state can declare the default wait shape for its channel and individual tools only override the fields that actually differ.
For tools with expected latency of four seconds or more, a custom phrase can override the generated progress text. This gives agent engineers precise control over what callers hear during long waits - for example, a tool that queries multiple external systems. The custom phrase is bounded to 30 words and requires a progress class as fallback.
When the orchestrator is handling tool progress, it takes over from the default filler pipeline entirely - there is no duplicate filler injection from multiple sources during tool execution.
Result Persistence Modes
Tool call specs support a result_persistence setting that controls how tool results accumulate in the agent's prompt context:
accumulate (default)
Every tool result is retained in the prompt history
Tools called once or a few times per conversation
override
Only the latest result per tool name is retained; previous results for the same tool are replaced
Polling tools called repeatedly (availability checks, status lookups) where only the most recent result matters
Override mode prevents context bloat from tools that the agent calls multiple times during a conversation. A scheduling agent that checks availability five times during a complex multi-provider booking only carries the most recent availability snapshot in its context, not all five results stacked up. This keeps the prompt focused and reduces token usage without losing the information the agent actually needs.
Context Window Management
The engine tracks cumulative token usage across a conversation and applies automatic policies when utilization crosses configurable thresholds:
Warning (default 60%)
The engine compresses conversation history by capping the number of retained turns, reducing context size while preserving the most recent and most relevant exchanges
Exhaustion (default 80%)
The engine flags the conversation for operator escalation, ensuring a human can take over before the context window is fully consumed
Token tracking is cumulative - it counts all input and output tokens across the conversation's lifetime, not just the current turn. This catches gradually growing conversations (long scheduling sessions, multi-topic calls) before they hit hard context limits. The warning threshold triggers silent degradation that the caller does not notice. The exhaustion threshold surfaces a handoff signal through the standard operator escalation path.
Barge-In Detection
If the caller starts speaking while the agent is talking, the system needs to decide whether to stop the agent's audio. Barge-in uses semantic confirmation - it requires actual recognized words rather than just acoustic energy. This filters out coughs, breathing, background conversation, and echo from the agent's own audio that would otherwise cause false interruptions.
The decision is based on four conditions evaluated together:
Whether the caller's speech contains actual recognized words from the speech-to-text engine (not breathing, echo, or background noise). Voice activity detection alone is not sufficient - the system requires at least one recognized word before triggering a barge-in.
Whether the speech has lasted long enough with recognized words (minimum duration is configurable per service). The default threshold is low enough that short responses like "yes," "no," and "okay" (200-400ms of speech) can trigger barge-in.
Whether the cooldown period has elapsed since the last barge-in (configurable per service, prevents rapid false triggers)
Whether the agent is currently speaking
When all conditions are met, the agent's audio stops and the system returns to listening mode. This prevents the agent from talking over a caller who is trying to ask a question or correct a misunderstanding.
There is also a fast path for end-of-turn interrupts. When the speech engine produces a complete transcript with an end-of-turn signal while the agent is speaking, the system interrupts the agent's audio immediately from the recognition listener rather than waiting for the transcript to pass through the processing queue. This eliminates queue latency on short-phrase interrupts where the caller finishes speaking quickly.
Response Length Enforcement
The voice pipeline enforces maximum response length at the streaming level - the TTS engine stops generating audio when the configured sentence or word cap is reached. This is a mechanical limit, not a prompt instruction, so it cannot be exceeded regardless of what the LLM generates. Response caps are configurable per service, allowing scheduling services to keep responses brief while clinical services allow longer explanations.
Call Completion
When the agent reaches a terminal state in the context graph and decides to end the call, it signals its intent to hang up but does not disconnect immediately. The system waits for signal convergence: the agent's closing utterance must finish playing and any in-flight tool results must resolve before the call disconnects. If the caller speaks during this window (barge-in) or a transfer is initiated, the hangup intent is retracted and the conversation continues.
This ensures the caller always hears the agent's full closing message, even when the terminal state involves a tool call with a follow-up utterance.
Per-Service Voice Configuration
Voice behavior is configurable at the service level, allowing different services within the same workspace to have different voice characteristics. Configuration follows a three-level hierarchy described in the Voice Control Plane section. Per-service settings cover:
Filler behavior - Style (phrase, backchannel, or silent), custom vocabulary, backchannel timing
Barge-in sensitivity - Minimum speech duration and cooldown period
Response limits - Maximum sentences and words per response
End-of-turn detection - Eagerness threshold and timeout
TTS settings - Model selection and buffer delay
Voice timing - The "when" and "what" knobs described in Voice Timing Configuration above
Call forwarding - Whether the agent can transfer calls to external numbers (opt-in, disabled by default). Supports both pre-configured forwarding numbers (from EHR location data or workspace settings) and dynamic forwarding to any E.164 number provided at call time
These settings are managed through the Platform API and the Agent Forge CLI.
Real-Time Audio Correction
The voice agent runs a parallel audio verification layer alongside the live STT stream. When the navigator detects that a turn contains structured data - names, dates of birth, phone numbers, insurance IDs, or medication names - it flags the turn for audio verification. A separate correction model cross-checks the STT transcript against the raw audio buffer for that segment, catching misrecognized digits, transposed characters, and phonetically similar names before they enter the agent's reasoning pipeline.
Domain-Aware Correction Hints
Each workspace can configure voice settings with domain-specific correction hints that tell the verification model what kinds of structured data to listen for. A dental practice might hint on insurance group numbers and procedure codes; a primary care clinic might hint on medication names and dosage quantities. These hints narrow the correction model's focus so it spends its budget on the data types that matter most for that service, rather than re-verifying every word in the transcript.
Confidence-Gated Correction
When the correction model produces a candidate correction, it assigns a confidence score (1-9) that determines what happens next:
8-9
Certain
The corrected value replaces the original transcript silently. The caller is not asked to confirm.
5-7
Likely
The agent confirms the corrected value with the caller ("I heard your date of birth as March 12, 1985 - is that right?").
1-4
Uncertain
The agent asks the caller to repeat the information ("Could you say that date of birth again for me?").
This prevents the agent from acting on low-confidence corrections while avoiding unnecessary confirmation prompts when the correction model is highly confident. The confidence thresholds are tuned for healthcare data where accuracy matters more than conversational speed.
This is distinct from post-call re-transcription. Real-time correction happens during the conversation, so the agent reasons from corrected text - not from raw STT output that might contain errors.
Post-Call Processing
The real-time STT stream prioritizes speed over accuracy. Post-call re-transcription catches words the live stream may have missed.
After a call ends, the system runs a higher-accuracy batch transcription of the full recording. This re-transcription catches words that the real-time stream may have missed or misrecognized and adds speaker diarization - identifying which speaker said each word. The result is a verified, speaker-attributed transcript where every segment is tagged with the speaker who produced it. For voice calls, this distinguishes between the caller and the agent. For clinical copilot sessions, it distinguishes between the clinician, the patient, and any additional speakers present.
The system also feeds recognition accuracy data back into the STT configuration, identifying which keyterms were recognized correctly and which were missed. Transcription accuracy increases over time for your specific vocabulary.
Post-Call Text Intelligence
After re-transcription, the verified transcript is analyzed for sentiment, topics, intents, and a structured summary. These results are stored alongside the call intelligence data and are available through the same API endpoints. This analysis runs on the text transcript (not the audio), so it captures conversational content that audio-only analysis misses - for example, whether the caller's stated intent matched the outcome.
Post-Call Quality Scoring
After a call ends, the system runs a structured quality analysis on the stereo call recording (caller audio on one channel, agent audio on the other). The analysis scores the call across five dimensions:
Task completion
Did the agent accomplish what the caller needed? Fully, partially, or not at all.
Information accuracy
Were speech recognition results correct? Did the agent act on accurate transcriptions?
Conversation flow
Was the conversation natural? Were there awkward pauses, unnecessary repetitions, or disjointed transitions?
Error recovery
When confusion occurred, did the agent recover gracefully or compound the problem?
Caller experience
Based on tone and interaction patterns, did the caller seem satisfied with the exchange?
Each dimension is scored on a 1-5 scale. The system also produces a summary, an outcome classification (succeeded, partially succeeded, failed, or abandoned), and specific STT correction suggestions that feed back into keyterm boosting.
Call Intelligence
In addition to post-call quality scoring, the system computes a structured intelligence summary from in-memory session state at the moment the call ends. This captures operational telemetry that the quality scoring (which runs asynchronously on recordings) cannot see.
Emotion
Dominant emotion, valence/arousal/dominance averages, peak negative emotion, emotional shifts, final trend, language sentiment score, toxicity categories, recent vocal bursts, speaker-normalized energy deltas
Risk
Composite risk score, risk level, contributing signals with individual weights
Latency
Engine response time (avg, p50, p95), audio time-to-first-byte, silence ratio
Conversation
Turn count, states visited, loop count, barge-in count, completion reason, final state
Tool
Total tool calls, success/failure counts, failure rate, per-tool breakdown
Safety
Safety rule matches during the call, escalation triggers
Operator
Whether escalation occurred, operator connect time, resolution
Composite Quality Score
A rule-based quality score (0-100) is computed from the intelligence summaries. The score starts at 100 and applies penalties for negative signals:
High response latency
p95 audio TTFB > 1s
-5 to -15
Excessive silence
Silence ratio > 20%
-10 to -20
Caller barge-ins
> 2 interruptions
-5 to -15
Agent loops
Revisited states
-10 to -20
Operator escalation
Any
-10
Tool failures
Failure rate > 5%
-5 to -15
The quality score is designed for dashboard filtering and trend analysis - it identifies calls that need attention without requiring someone to listen to every recording.
Call Intelligence API
Two API endpoints expose call intelligence data:
Completed call intelligence
Full intelligence profile for a finished call - joins the persisted summaries with per-turn data reconstructed from conversation transcripts. Includes emotion trajectory, risk timeline, latency waterfall, tool performance breakdown, and conversation quality events (loops, barge-ins).
Active call intelligence
All currently active calls enriched with a live intelligence overlay - current emotion, risk score and trend, turn count, escalation status, and current conversation state. Updated after every caller speech turn.
The live intelligence overlay is written after each caller turn and refreshed alongside the active call heartbeat. If a call ends or the session is lost, the live data expires automatically.
Per-turn reconstruction is computed on read from the stored conversation turns - it is not stored separately. This means the intelligence profile reflects the full turn data without duplicating storage.
Last updated
Was this helpful?

