microphone-linesAudio Pipeline

How speech recognition, emotion detection, filler speech, barge-in detection, and TTS work together in real time.

Media-Plane Architecture

The voice pipeline uses a split architecture where audio processing and conversation reasoning run as separate services. The media gateway handles all real-time audio concerns - receiving caller audio, streaming it to speech recognition, synthesizing agent responses to audio, and managing playback. The agent engine handles all reasoning concerns - interpreting transcripts, running context graphs, executing skills, and deciding what the agent should say.

The two services communicate over a bidirectional relay. The media gateway sends transcripts to the agent engine and receives commands back: enqueue an utterance for playback, interrupt the current playback (barge-in), reconfigure speech recognition mid-call, or end the session. This separation means audio latency is isolated from reasoning latency, and each layer can scale independently.

Each voice session runs concurrent pipelines:

  • Audio receive - Caller audio flows from the telephony transport to the speech recognition service in real time

  • Transcript relay - Completed transcripts are forwarded to the agent engine for processing

  • Command dispatch - Agent engine commands (speak, interrupt, stop, reconfigure) are dispatched to the appropriate audio subsystem

  • Playback - Text-to-speech synthesis runs over a persistent streaming connection, with each utterance tracked independently for interruption and lifecycle management

Sessions enforce a maximum call duration and handle graceful cleanup when any pipeline exits or encounters an error. The gateway also enforces per-pod capacity limits, rejecting new connections when at capacity to protect active sessions from overload.

The voice audio pipeline is split into two logical layers: a media layer that owns the telephony connection, speech-to-text, and text-to-speech, and an agent engine that owns reasoning, tool execution, and conversation state. The two layers communicate over an internal relay protocol, allowing them to scale and deploy independently. This separation means the audio processing components can be updated, scaled, or replaced without affecting the agent logic, and vice versa.

The relay protocol defines typed messages in each direction. The media layer sends session lifecycle events (start, end), speech transcripts, and playback status to the agent engine. The agent engine sends utterance requests, playback controls (interrupt, stop, drain), speech recognition configuration updates, and session termination commands back to the media layer. All messages carry a type discriminator for efficient dispatch.

This architecture supports graceful rolling deployments. During a deploy, the media layer stops accepting new sessions while allowing in-flight calls to complete naturally. Health probes distinguish between liveness (the process is running) and readiness (the service is accepting new calls), so the load balancer can drain traffic from a pod without terminating active calls.

The audio pipeline converts a caller's voice into text, processes it through the agent's reasoning, and converts the response back to speech. Two independent streams handle the input side: one for speech recognition, one for emotion analysis. They run in parallel and never block each other.

Signal Capture

Audio arrives from the telephony layer as a standard telephony audio stream. The system splits it into two parallel paths the moment it arrives:

  1. Speech-to-text - Converts audio to transcript text in real time

  2. Emotion detection - Analyzes vocal qualities for emotional signals (covered in Emotion Detection)

If either path fails, the other continues unaffected. A failure in emotion detection does not delay transcription. A failure in transcription does not block emotion analysis.

Voice timeline: signals drive cut/navigate/engage phases to produce voice states

Progressive Initialization

Before the caller picks up, the voice agent runs a prewarm phase during ring time: fetching workspace configuration, skill definitions, and integration credentials. When the call connects, only three components are on the critical path before the greeting plays: configuration resolution, the reasoning engine, and text-to-speech. Speech-to-text and conversation monitoring initialize in parallel with the greeting. The caller hears the agent within two seconds of pickup.

This progressive approach applies to both inbound and outbound calls. For outbound calls, prewarm runs during the dialing and ringing phase (typically 5-15 seconds before the patient answers). When the patient picks up, the engine and greeting are already initialized - the patient hears an instant greeting instead of several seconds of silence.

STT is ready before the greeting finishes, so by the time the caller speaks their first words, transcription is active. Monitoring embeddings load in the background without affecting latency.

Greeting

The agent delivers the greeting to the caller as soon as the connection is established, without waiting for the speech recognition pipeline to finish initializing. Speech recognition connects concurrently in the background and is ready before the caller begins speaking. This overlap eliminates unnecessary startup delay and keeps first-message latency under 2 seconds.

For conference-mode calls where the agent leg is created during ring time, the platform waits for the caller to actually join the conference before releasing the greeting. This prevents the greeting from playing into an empty conference.

Speech Handling

While the agent plays its greeting, any caller speech is discarded. The caller is "not heard" until the greeting finishes. This prevents the agent from interpreting ambient noise, simultaneous "hello" responses, or partial utterances as meaningful input before the conversation has properly started. Once the greeting completes, the speech-to-text pipeline begins processing caller audio normally.

Speech-to-Text

The speech-to-text stage converts caller audio into text transcripts in real time. For multilingual deployments, the platform detects the caller's spoken language automatically and consolidates on a dominant language once sufficient audio has been processed. This language detection feeds downstream components - when the caller's language is identified, the text-to-speech output language is switched to match, ensuring the agent responds in the same language the caller is speaking.

Amigo uses a streaming speech recognition engine for real-time transcription. Audio is transcribed with sub-300ms latency, meaning the system has text available almost as fast as the caller speaks.

The STT engine is not statically configured at call start. It adapts mid-conversation as context changes - reconfiguring recognition vocabulary, end-of-turn sensitivity, and language hints without reconnecting or interrupting the audio stream.

Language Selection

The platform supports English-optimized and multilingual STT models, selected per-service:

Setting
Model
Best For

English

English-optimized (lowest error rate)

Monolingual English-speaking populations

Multilingual

Multi-language with code-switching

Populations that switch between languages mid-conversation

Auto

Multi-language with auto-detection

Unknown caller language; narrows automatically once detected

In auto mode, the platform tracks which language the caller is speaking on each turn. Once a dominant language reaches high confidence, the STT hints narrow automatically to improve recognition accuracy for that language. This happens mid-call without interruption.

When a patient's preferred language is known from their world model record, the platform uses it automatically - selecting the optimal STT model for that language before the caller needs to speak enough for auto-detection. This is particularly useful for multilingual populations where patient demographics already capture language preference. The priority order is: patient's recorded language preference, then workspace-level voice setting, then default. For clinical copilot sessions, the same logic applies using the patient context loaded during pre-encounter setup.

Keyterm Boosting

Medical terminology, provider names, medication names, and organization-specific vocabulary are difficult for general-purpose speech recognition. Three layers of keyterm boosting improve recognition accuracy:

Level
Managed By
Scope

Service-level

Workspace administrators

Applied to all calls for a given service

Workspace-level

API configuration

Per-workspace vocabulary (clinic names, local terminology)

System defaults

Amigo engineering

Baseline medical and scheduling vocabulary

All three layers are merged and deduplicated at call start. Then they update dynamically:

  • Patient context injection - When the patient's identity resolves and their record loads, the platform extracts medical vocabulary from their medications, allergies, and active conditions. These terms are injected into the active STT session mid-call. A patient on metformin and lisinopril gets those drug names boosted as soon as their record is resolved.

  • Clinical copilot parity - The clinical copilot uses the same keyterm resolution. After patient context loads during an encounter, medical vocabulary is injected into the STT stream.

End-of-Turn Detection

The system must determine when the caller has finished speaking so the agent can respond. This uses configurable confidence thresholds that balance two concerns:

  • Responding too early cuts the caller off mid-sentence

  • Responding too late creates awkward silence

Base thresholds are configurable per workspace. On top of that, context graph states can override end-of-turn sensitivity through turn policy settings:

  • Data collection states (collecting a date of birth, spelling a name) use higher thresholds and longer silence timeouts, because the caller is thinking and pausing between pieces of information

  • Action states (confirming an appointment, answering a yes/no question) use default thresholds for snappy turn-taking

These overrides take effect on state transitions - when the agent moves to a data collection state, the STT engine reconfigures mid-call to be more patient with pauses.

Speculative Processing

The voice pipeline does not wait for full end-of-turn confirmation before starting work. When the STT engine signals moderate confidence that the caller may have finished speaking, the system begins the navigation step speculatively - processing the transcript through the context graph engine in the background. If the caller continues speaking, the speculative result is discarded. If the end-of-turn is confirmed and the transcript matches, the pre-computed navigation result is used immediately, saving the cost of a redundant LLM call.

The caller hears the agent respond faster without any change in response quality. When speculation fails (the caller was mid-sentence), the only cost is a discarded background computation - the caller experience is unaffected.

Text-to-Speech

Per-Language Provider Routing

The platform supports routing text-to-speech to different providers based on the caller's detected language. This is configured through a language-provider map that associates language codes with specific TTS providers and voice configurations.

When a caller's language is detected, the platform resolves the TTS provider through a priority matrix:

  1. Exact language match - e.g., ar-SA matches an Arabic (Saudi Arabia) entry

  2. Base language match - e.g., ar-SA falls back to an ar entry

  3. Multilingual fallback - a catch-all entry for any language not explicitly mapped

At each level, service configuration takes priority over agent configuration, which takes priority over workspace configuration. The first match wins.

If no language-specific entry matches, the platform uses the standard TTS provider selection (service > agent > workspace > default). Per-language configuration is isolated - when a language-specific provider is selected, only that entry's voice settings are used, preventing configuration for one provider from affecting another.

The platform supports multiple TTS providers, selectable at the workspace level. Each provider offers different trade-offs across latency, voice quality, language support, and expressive capabilities. Workspace administrators choose the provider and configure provider-specific parameters (voice, model, speed, quality tuning) through the Developer Console voice settings page or the API. The Developer Console includes a voice library where operators can browse available voices across providers, preview audio samples, and select a voice for the workspace without needing to look up voice identifiers manually.

Provider-specific settings are isolated - changing providers does not affect call recordings, session management, or downstream analytics. If a configured provider is unavailable, the system falls back to the default provider.

The platform supports multiple TTS providers, selectable per workspace through voice settings. Each provider offers different trade-offs:

  • Default provider - Low-latency WebSocket streaming with emotion-aware prosody adjustments and pronunciation dictionary support.

  • Provider B - Persistent multi-stream WebSocket with per-turn context isolation, word-level timing alignment, and regional endpoint support for data residency requirements.

  • Provider C - Ultra-fast REST-based synthesis with sentence-level pipelining (audio starts before the full response is generated), vocal direction tags for expressive delivery, and automatic Arabic language detection.

All providers integrate with the same text generation pipeline: a streaming language model produces text fragments that are forwarded to the selected TTS engine in real time. Each provider has its own circuit breaker for fault isolation - a degradation in one provider does not affect the others.

Provider selection is transparent to callers and does not change the call experience, recordings, or API behavior.

The TTS engine converts the agent's generated text into spoken audio with dynamic per-turn control over emotion, speed, and volume. Each utterance - fillers, responses, empathy pauses - carries its own voice parameters, so a warm filler at reduced speed can precede a normal-pace informational response without shared state.

Emotion Priority Chain

The agent's vocal tone is not a single static setting. A six-level priority chain selects the most contextually appropriate emotion on every turn. Each level fires only if the previous one produced no signal:

Priority
Source
Example

1. Vocal burst

Caller laughed, sighed, gasped, or cried in the last 5 seconds

Caller laughs → agent responds with warm enthusiasm immediately

2. Prosody

Acoustic emotion model detects a strong signal from the caller's voice

Anxiety detected → sympathetic tone

3. Proactive topic

The current context graph action matches a sensitive topic

Agent about to discuss test results → preemptive sympathetic tone

4. Tone momentum

Previous turn's tone is carried forward when the current signal is weak

Tone stays sympathetic across a brief neutral pause

5. Workspace baseline

The service's configured default tone

Friendly baseline for scheduling services

6. System default

Engineering fallback

Calm

This means the agent's voice adapts in real time to what is happening, not to a pre-configured setting. The workspace baseline sets the floor, but any strong emotional signal overrides it - always in the direction of more empathy, never less.

Six-level TTS emotion priority chain from vocal burst to system default

Split Model Architecture

Navigation and response generation use different LLM models optimized for their respective jobs:

  • Navigation uses a smaller, faster model. The nav output is roughly 5 tokens (a structured code line), so raw intelligence matters less than speed. The smaller model shaves latency off every turn without sacrificing decision quality on the constrained output format.

  • Response generation uses the full-size model. Response text is what the caller actually hears, so quality matters more than throughput. For typical voice responses (under 20 tokens), the speed difference between the two models adds less than 200ms - a good trade for noticeably better phrasing.

Situation-Response Adaptation

Four-dimensional adaptation: voice tone, filler behavior, response content, and behavioral signals

The voice pipeline adapts across four independent dimensions simultaneously. Each dimension operates on different output channels, so the agent can change what it says, how it says it, and whether it fills silence - all independently and in real time.

Emotion → Voice Tone

The agent mirrors empathy, not the caller's emotion. An angry caller hears a calm voice (de-escalation), not an angry one. An anxious caller hears a sympathetic voice (reassurance). A happy caller hears enthusiasm (matching energy).

Emotion → Filler Behavior

Filler speech adapts to the caller's emotional state. Anxious callers hear reassuring fillers ("Of course," "I'm here to help"). Frustrated callers with high arousal hear no fillers at all - the system suppresses them because frustrated callers want answers, not acknowledgments. Happy callers hear warm, matching fillers.

Emotion → Response Content

Emotional context is injected into every prompt the response model receives. The injection includes the caller's dominant emotion, trend direction, and adaptation guidance. A caller with deteriorating mood gets responses prioritizing resolution speed. A confused caller gets simplified explanations broken into small pieces. The agent adapts what it says based on how the caller is feeling, not just how it sounds.

Behavioral Signals → Response Content

Three behavioral signals are tracked in real time and injected into prompts when thresholds are crossed:

Signal
What It Detects
What the Agent Does

Interruption count

Caller has interrupted the agent multiple times

Shorten responses - the agent is talking too much

Short response streak

Caller is giving very brief answers consecutively

The caller is disengaging or withdrawing

Silence gaps

Extended silence from the caller

Confusion, hesitation, or distress

These signals augment, not replace, the emotion detection system. A caller who is interrupting frequently may not sound frustrated in their voice, but the behavioral pattern tells the engine to shorten its responses.

Response Micro-Behaviors

The response generation model follows a set of communication guidelines that produce natural conversational behavior regardless of emotional state:

  • Speech rhythm mirroring - Short bursts from the caller produce concise responses; conversational callers get warmer, flowing replies

  • Emotional name usage - The caller's name is used at moments of emotional significance, not mechanically

  • Pause injection - When delivering difficult information, the agent pauses naturally before the key detail

  • Pace inversion - When the caller is rushing, the agent slows down with longer sentences and gentle transitions

  • Completion inference - When a caller trails off mid-sentence, the agent acknowledges what they were trying to say

The agent never mentions that it can detect the caller's emotions. Emotional adaptation is experienced as natural attentiveness, not surveillance.

Voice Timeline

The voice pipeline applies the same cut/navigate/engage pattern that drives conversation-level reasoning - but within each turn, managing what the caller hears and when.

A single actor processes signals sequentially. Fillers, responses, empathy pauses, and tool progress narration are not separate systems competing for the audio output. They are sequential states in one timeline, managed by one actor, producing one stream of utterances. "Let me check on that" followed by "Her appointment is Thursday" is one trajectory in two parts - not two unrelated items from different subsystems.

Three Operations

  1. Cut - A signal arrives (the caller stopped speaking, a tool started, empathy shifted). The actor asks: did something change? Between two cuts, dozens of raw events may arrive - emotion scores, behavioral signals, audio segments. The cut compresses them into a handful of causally relevant fields and discards the rest. This compression is what makes the next step tractable.

  2. Navigate - Given the compressed state and the trajectory of previous states, select the next voice state. Navigation is a pure decision with no side effects - it can be called speculatively without producing audio.

  3. Engage - Enqueue an utterance with its own emotion and speed baked in, then set a deadline. When the deadline fires, it becomes the next signal - the system re-enters cut/navigate/engage. The actor is self-driving.

Signal-to-State Mapping

Each signal produces a specific voice state:

Signal
Voice State
What Happens

Caller finished speaking

Breath

Brief pause before the agent responds (configurable, default 200ms)

Navigation complete

Transition

Filler window opens - if the response is not ready by the deadline, a filler plays

Tool started

Progress

Tool wait narration on a repeating interval ("Let me check on that...")

Tool finished

Response

Agent delivers the tool result

All audio finished

Listen

Silence deadline starts - check-ins escalate if the caller stays quiet

Empathy tier shifted

Hold

Intentional silence - the agent pauses to give the caller space

Caller started speaking

Listen

Pending fillers drain - the caller has the floor

Deadline expired

Next state

Self-signal - the actor re-enters cut/navigate/engage

Deadlines are what make this self-driving. A transition state sets a deadline: "if no response arrives in 800ms, play a filler." If the response arrives first, the deadline is cancelled. If the deadline fires, it becomes a signal. No polling loops, no scattered timers.

Per-Utterance Voice Parameters

Each utterance carries its own emotion and speed, set at engage time. The audio output reads parameters directly from the utterance - not from shared state that could change between enqueue and playback. A filler set to "sympathetic" at 0.85x speed cannot be overwritten by a response that arrives a moment later. Races are eliminated by construction, not by adding synchronization.

The utterance queue is the boundary between the actor and the audio output. The actor writes utterances in. The audio output reads them out. No callbacks, no shared mutable state, no coordination needed beyond the queue itself.

Voice Timing Configuration

The voice timeline exposes two categories of configuration per service:

When - timing in milliseconds:

Parameter
Default
What It Controls

Post end-of-turn pause

200ms

Breath duration after the caller finishes speaking

Transition deadline

800ms

Maximum silence before a filler plays

Progress interval

3000ms

How often to narrate during tool waits

Empathy hold

500ms

Intentional silence for empathetic moments

Filler cooldown

2000ms

Minimum gap between consecutive fillers

What - vocabulary and style:

Parameter
What It Controls

Filler style

Phrase, backchannel, or silent (see below)

Filler vocabulary

Custom backchannel words ("Mm," "Yeah," "Mhm")

Progress vocabulary

Custom tool-wait phrases ("One moment...," "Let me check...")

Everything else - signal routing, deadline management, state transitions, trajectory tracking - is derived from the signal-to-state mapping and these two knobs.

Filler Styles

Three filler styles are available, configurable per service:

Style
Behavior
Best For

Phrase

Contextual phrases like "Let me check that for you" or "One moment"

General-purpose services where the agent should sound active

Backchannel

Short acknowledgments like "Mm," "Yeah," "Mhm"

Services where brief, natural-sounding turn-taking is preferred

Silent

No filler at all - the agent pauses until the response is ready

Services where silence between turns is acceptable or preferred

The filler style is enforced end-to-end: when a service is configured as silent, the pipeline suppresses filler generation, filler guidelines in the navigation prompt, and filler text in the response - not just the final audio output. Receipt and working fillers ("Got it," "Let me check") inherit the nav-selected emotion, so they sound consistent with the rest of the turn. Backchannel vocabulary is customizable per service.

When navigation is skipped - typically in single-action context graphs where the agent always stays in the same state - the orchestrator starts a short timer (configurable per service). If the response has not produced audio by the time the timer fires, a backchannel sound plays to hold the conversational rhythm. If the response arrives first, the timer is cancelled. Services using the "silent" filler style suppress this timer entirely.

Empathy-Gated Filler Behavior

Filler behavior is controlled by the caller's empathy tier. At higher tiers, silence replaces fillers because silence is the empathy:

  • T0-T1 - Normal filler emission. At T1, the filler type is set to "empathy" (warmer, acknowledging) rather than "receipt" or "working."

  • T2 Full Empathy - The system inserts a 0.5-second pause before speaking any filler. This anti-filler silence gives the caller space to continue.

  • T3 Hold Space - Fillers are suppressed entirely. The agent pauses for one second, then delivers a pure empathy response.

When the caller's empathy tier changes mid-turn, the orchestrator reacts immediately - shifting from normal filler behavior to empathy-appropriate silence or vice versa without waiting for the next turn boundary.

When empathy fillers do play, the TTS engine switches to a presence mode: 0.85x speed with neutral emotion. This prevents the uncanny effect of strong emotion applied to very short phrases (two to four words), where the TTS engine stretches vowels in ways that sound artificial.

Principle-Based Filler Generation

Filler phrases are not drawn from a hardcoded list. The system generates them using an LLM with the current emotional context and conversation intent as inputs. This means fillers adapt to the situation: a caller who sounds anxious gets "I'm looking into that for you right now" rather than a generic "One moment." The generation is guided by emotional guidelines and the current context graph action, so fillers stay contextually appropriate.

Filler emissions are rate-limited to prevent cascading acknowledgements when rapid turns or false barge-ins cause multiple navigation cycles in quick succession. The orchestrator also caps consecutive filler emissions per turn, so extended processing time produces at most two filler phrases before the agent falls silent.

Tool-Wait Progress Hints

Fillers emitted while a tool is running can be shaped per state and per tool, not just per service. A progress hint describes the shape of the wait rather than supplying a phrase list: what kind of work the tool is doing (record lookup, write, external call, computation, multi-step workflow), roughly how long it is expected to take, and how the agent should cover the wait (auto, verbal, backchannel, or silent).

The orchestrator turns the hint into utterances at runtime using tool semantics, the caller's current emotion, and conversation context. Retries produce attempt-aware language: an initial acknowledgement, then a brief apology, then a "still working on it" update, each scaled to how long the tool has actually been running. A tool-level hint field-merges with the state's channel-level hint, so a state can declare the default wait shape for its channel and individual tools only override the fields that actually differ.

For tools with expected latency of four seconds or more, a custom phrase can override the generated progress text. This gives agent engineers precise control over what callers hear during long waits - for example, a tool that queries multiple external systems. The custom phrase is bounded to 30 words and requires a progress class as fallback.

When the orchestrator is handling tool progress, it takes over from the default filler pipeline entirely - there is no duplicate filler injection from multiple sources during tool execution.

Result Persistence Modes

Tool call specs support a result_persistence setting that controls how tool results accumulate in the agent's prompt context:

Mode
Behavior
Best For

accumulate (default)

Every tool result is retained in the prompt history

Tools called once or a few times per conversation

override

Only the latest result per tool name is retained; previous results for the same tool are replaced

Polling tools called repeatedly (availability checks, status lookups) where only the most recent result matters

Override mode prevents context bloat from tools that the agent calls multiple times during a conversation. A scheduling agent that checks availability five times during a complex multi-provider booking only carries the most recent availability snapshot in its context, not all five results stacked up. This keeps the prompt focused and reduces token usage without losing the information the agent actually needs.

Context Window Management

The engine tracks cumulative token usage across a conversation and applies automatic policies when utilization crosses configurable thresholds:

Threshold
Action

Warning (default 60%)

The engine compresses conversation history by capping the number of retained turns, reducing context size while preserving the most recent and most relevant exchanges

Exhaustion (default 80%)

The engine flags the conversation for operator escalation, ensuring a human can take over before the context window is fully consumed

Token tracking is cumulative - it counts all input and output tokens across the conversation's lifetime, not just the current turn. This catches gradually growing conversations (long scheduling sessions, multi-topic calls) before they hit hard context limits. The warning threshold triggers silent degradation that the caller does not notice. The exhaustion threshold surfaces a handoff signal through the standard operator escalation path.

Barge-In Detection

If the caller starts speaking while the agent is talking, the system needs to decide whether to stop the agent's audio. Barge-in uses semantic confirmation - it requires actual recognized words rather than just acoustic energy. This filters out coughs, breathing, background conversation, and echo from the agent's own audio that would otherwise cause false interruptions.

The decision is based on four conditions evaluated together:

  1. Whether the caller's speech contains actual recognized words from the speech-to-text engine (not breathing, echo, or background noise). Voice activity detection alone is not sufficient - the system requires at least one recognized word before triggering a barge-in.

  2. Whether the speech has lasted long enough with recognized words (minimum duration is configurable per service). The default threshold is low enough that short responses like "yes," "no," and "okay" (200-400ms of speech) can trigger barge-in.

  3. Whether the cooldown period has elapsed since the last barge-in (configurable per service, prevents rapid false triggers)

  4. Whether the agent is currently speaking

When all conditions are met, the agent's audio stops and the system returns to listening mode. This prevents the agent from talking over a caller who is trying to ask a question or correct a misunderstanding.

There is also a fast path for end-of-turn interrupts. When the speech engine produces a complete transcript with an end-of-turn signal while the agent is speaking, the system interrupts the agent's audio immediately from the recognition listener rather than waiting for the transcript to pass through the processing queue. This eliminates queue latency on short-phrase interrupts where the caller finishes speaking quickly.

Response Length Enforcement

The voice pipeline enforces maximum response length at the streaming level - the TTS engine stops generating audio when the configured sentence or word cap is reached. This is a mechanical limit, not a prompt instruction, so it cannot be exceeded regardless of what the LLM generates. Response caps are configurable per service, allowing scheduling services to keep responses brief while clinical services allow longer explanations.

Call Completion

When the agent reaches a terminal state in the context graph and decides to end the call, it signals its intent to hang up but does not disconnect immediately. The system waits for signal convergence: the agent's closing utterance must finish playing and any in-flight tool results must resolve before the call disconnects. If the caller speaks during this window (barge-in) or a transfer is initiated, the hangup intent is retracted and the conversation continues.

This ensures the caller always hears the agent's full closing message, even when the terminal state involves a tool call with a follow-up utterance.

Per-Service Voice Configuration

Voice control plane: per-service config overrides workspace settings overrides defaults, with automatic emotional adaptation

Voice behavior is configurable at the service level, allowing different services within the same workspace to have different voice characteristics. Configuration follows a three-level hierarchy described in the Voice Control Plane section. Per-service settings cover:

  • Filler behavior - Style (phrase, backchannel, or silent), custom vocabulary, backchannel timing

  • Barge-in sensitivity - Minimum speech duration and cooldown period

  • Response limits - Maximum sentences and words per response

  • End-of-turn detection - Eagerness threshold and timeout

  • TTS settings - Model selection and buffer delay

  • Voice timing - The "when" and "what" knobs described in Voice Timing Configuration above

  • Call forwarding - Whether the agent can transfer calls to external numbers (opt-in, disabled by default). Supports both pre-configured forwarding numbers (from EHR location data or workspace settings) and dynamic forwarding to any E.164 number provided at call time

These settings are managed through the Platform API and the Agent Forge CLI.

Real-Time Audio Correction

The voice agent runs a parallel audio verification layer alongside the live STT stream. When the navigator detects that a turn contains structured data - names, dates of birth, phone numbers, insurance IDs, or medication names - it flags the turn for audio verification. A separate correction model cross-checks the STT transcript against the raw audio buffer for that segment, catching misrecognized digits, transposed characters, and phonetically similar names before they enter the agent's reasoning pipeline.

Domain-Aware Correction Hints

Each workspace can configure voice settings with domain-specific correction hints that tell the verification model what kinds of structured data to listen for. A dental practice might hint on insurance group numbers and procedure codes; a primary care clinic might hint on medication names and dosage quantities. These hints narrow the correction model's focus so it spends its budget on the data types that matter most for that service, rather than re-verifying every word in the transcript.

Confidence-Gated Correction

When the correction model produces a candidate correction, it assigns a confidence score (1-9) that determines what happens next:

Confidence
Label
Behavior

8-9

Certain

The corrected value replaces the original transcript silently. The caller is not asked to confirm.

5-7

Likely

The agent confirms the corrected value with the caller ("I heard your date of birth as March 12, 1985 - is that right?").

1-4

Uncertain

The agent asks the caller to repeat the information ("Could you say that date of birth again for me?").

This prevents the agent from acting on low-confidence corrections while avoiding unnecessary confirmation prompts when the correction model is highly confident. The confidence thresholds are tuned for healthcare data where accuracy matters more than conversational speed.

Real-time audio intelligence: parallel STT and audio buffer feed verification with confidence-gated correction

This is distinct from post-call re-transcription. Real-time correction happens during the conversation, so the agent reasons from corrected text - not from raw STT output that might contain errors.

Post-Call Processing

circle-info

The real-time STT stream prioritizes speed over accuracy. Post-call re-transcription catches words the live stream may have missed.

After a call ends, the system runs a higher-accuracy batch transcription of the full recording. This re-transcription catches words that the real-time stream may have missed or misrecognized and adds speaker diarization - identifying which speaker said each word. The result is a verified, speaker-attributed transcript where every segment is tagged with the speaker who produced it. For voice calls, this distinguishes between the caller and the agent. For clinical copilot sessions, it distinguishes between the clinician, the patient, and any additional speakers present.

The system also feeds recognition accuracy data back into the STT configuration, identifying which keyterms were recognized correctly and which were missed. Transcription accuracy increases over time for your specific vocabulary.

Post-Call Text Intelligence

After re-transcription, the verified transcript is analyzed for sentiment, topics, intents, and a structured summary. These results are stored alongside the call intelligence data and are available through the same API endpoints. This analysis runs on the text transcript (not the audio), so it captures conversational content that audio-only analysis misses - for example, whether the caller's stated intent matched the outcome.

Post-Call Quality Scoring

After a call ends, the system runs a structured quality analysis on the stereo call recording (caller audio on one channel, agent audio on the other). The analysis scores the call across five dimensions:

Dimension
What It Measures

Task completion

Did the agent accomplish what the caller needed? Fully, partially, or not at all.

Information accuracy

Were speech recognition results correct? Did the agent act on accurate transcriptions?

Conversation flow

Was the conversation natural? Were there awkward pauses, unnecessary repetitions, or disjointed transitions?

Error recovery

When confusion occurred, did the agent recover gracefully or compound the problem?

Caller experience

Based on tone and interaction patterns, did the caller seem satisfied with the exchange?

Each dimension is scored on a 1-5 scale. The system also produces a summary, an outcome classification (succeeded, partially succeeded, failed, or abandoned), and specific STT correction suggestions that feed back into keyterm boosting.

Call Intelligence

In addition to post-call quality scoring, the system computes a structured intelligence summary from in-memory session state at the moment the call ends. This captures operational telemetry that the quality scoring (which runs asynchronously on recordings) cannot see.

Summary
What It Captures

Emotion

Dominant emotion, valence/arousal/dominance averages, peak negative emotion, emotional shifts, final trend, language sentiment score, toxicity categories, recent vocal bursts, speaker-normalized energy deltas

Risk

Composite risk score, risk level, contributing signals with individual weights

Latency

Engine response time (avg, p50, p95), audio time-to-first-byte, silence ratio

Conversation

Turn count, states visited, loop count, barge-in count, completion reason, final state

Tool

Total tool calls, success/failure counts, failure rate, per-tool breakdown

Safety

Safety rule matches during the call, escalation triggers

Operator

Whether escalation occurred, operator connect time, resolution

Composite Quality Score

A rule-based quality score (0-100) is computed from the intelligence summaries. The score starts at 100 and applies penalties for negative signals:

Signal
Threshold
Penalty

High response latency

p95 audio TTFB > 1s

-5 to -15

Excessive silence

Silence ratio > 20%

-10 to -20

Caller barge-ins

> 2 interruptions

-5 to -15

Agent loops

Revisited states

-10 to -20

Operator escalation

Any

-10

Tool failures

Failure rate > 5%

-5 to -15

The quality score is designed for dashboard filtering and trend analysis - it identifies calls that need attention without requiring someone to listen to every recording.

Call Intelligence API

Two API endpoints expose call intelligence data:

Endpoint
What It Returns

Completed call intelligence

Full intelligence profile for a finished call - joins the persisted summaries with per-turn data reconstructed from conversation transcripts. Includes emotion trajectory, risk timeline, latency waterfall, tool performance breakdown, and conversation quality events (loops, barge-ins).

Active call intelligence

All currently active calls enriched with a live intelligence overlay - current emotion, risk score and trend, turn count, escalation status, and current conversation state. Updated after every caller speech turn.

The live intelligence overlay is written after each caller turn and refreshed alongside the active call heartbeat. If a call ends or the session is lost, the live data expires automatically.

circle-info

Per-turn reconstruction is computed on read from the stored conversation turns - it is not stored separately. This means the intelligence profile reflects the full turn data without duplicating storage.

Post-call processing: call intelligence, diarized re-transcription, quality analysis, STT improvement

Last updated

Was this helpful?