# Audio Pipeline

## How the Audio Pipeline Works

The audio pipeline converts a caller's voice into text, processes the transcript through the agent's reasoning, and converts the response back to speech. Speech recognition, emotion analysis, tool execution, and speech output are coordinated so one slow component does not block the rest of the conversation.

Each voice session manages four user-visible concerns:

* **Listen** - Capture caller speech and detect when the caller has finished a turn.
* **Understand** - Produce transcripts and emotional context for the agent.
* **Decide** - Use the current context graph, patient context, and tool results to choose the next response.
* **Speak** - Deliver the response with the right timing, voice, and interruption behavior.

This design lets Amigo update recognition, emotion detection, and response generation independently while preserving a consistent call experience for the patient.

## Signal Capture

Audio arrives from the telephony layer as a standard telephony audio stream. The system splits it into two parallel paths the moment it arrives:

1. **Speech-to-text** - Converts audio to transcript text in real time
2. **Emotion detection** - Analyzes vocal qualities for emotional signals (covered in [Emotion Detection](/channels/voice/emotion-detection.md))

If either path fails, the other continues unaffected. A failure in emotion detection does not delay transcription. A failure in transcription does not block emotion analysis.

<figure><img src="/files/MjTMQk1HO46w0H1eLIOR" alt="Voice timeline: signals drive cut/navigate/engage phases to produce voice states"><figcaption></figcaption></figure>

### Progressive Initialization

Before the caller picks up, the voice agent runs a prewarm phase during ring time: fetching workspace configuration, skill definitions, and integration credentials. When the call connects, only three components are on the critical path before the greeting plays: configuration resolution, the reasoning engine, and text-to-speech. Speech-to-text and conversation monitoring initialize in parallel with the greeting. The caller hears the agent within two seconds of pickup.

This progressive approach applies to both inbound and outbound calls. For outbound calls, prewarm runs during the dialing and ringing phase (typically 5-15 seconds before the patient answers). When the patient picks up, the engine and greeting are already initialized - the patient hears an instant greeting instead of several seconds of silence.

STT is ready before the greeting finishes, so by the time the caller speaks their first words, transcription is active. Monitoring embeddings load in the background without affecting latency.

### Greeting

The agent delivers the greeting to the caller as soon as the connection is established, without waiting for the speech recognition pipeline to finish initializing. Speech recognition connects concurrently in the background and is ready before the caller begins speaking. This overlap eliminates unnecessary startup delay and keeps first-message latency under 2 seconds.

For conference-mode calls where the agent leg is created during ring time, the platform waits for the caller to actually join the conference before releasing the greeting. This prevents the greeting from playing into an empty conference.

### Speech Handling

While the agent plays its greeting, any caller speech is discarded. The caller is "not heard" until the greeting finishes. This prevents the agent from interpreting ambient noise, simultaneous "hello" responses, or partial utterances as meaningful input before the conversation has properly started. Once the greeting completes, the speech-to-text pipeline begins processing caller audio normally.

## Speech-to-Text

The speech-to-text stage converts caller audio into text transcripts in real time. For multilingual deployments, the platform detects the caller's spoken language automatically and consolidates on a dominant language once sufficient audio has been processed. This language detection feeds downstream components - when the caller's language is identified, the text-to-speech output language is switched to match, ensuring the agent responds in the same language the caller is speaking.

Amigo uses a streaming speech recognition engine for real-time transcription. Audio is transcribed with sub-300ms latency, meaning the system has text available almost as fast as the caller speaks.

The STT engine is not statically configured at call start. It adapts mid-conversation as context changes - reconfiguring recognition vocabulary, end-of-turn sensitivity, and language hints without reconnecting or interrupting the audio stream.

### Language Selection

The platform supports English-optimized and multilingual STT models, selected per-service:

| Setting          | Model                                 | Best For                                                     |
| ---------------- | ------------------------------------- | ------------------------------------------------------------ |
| **English**      | English-optimized (lowest error rate) | Monolingual English-speaking populations                     |
| **Multilingual** | Multi-language with code-switching    | Populations that switch between languages mid-conversation   |
| **Auto**         | Multi-language with auto-detection    | Unknown caller language; narrows automatically once detected |

In auto mode, the platform tracks which language the caller is speaking on each turn. Once a dominant language reaches high confidence, the STT hints narrow automatically to improve recognition accuracy for that language. This happens mid-call without interruption.

When a patient's preferred language is known from their [world model](/data/world-model.md) record, the platform uses it automatically - selecting the optimal STT model for that language before the caller needs to speak enough for auto-detection. This is particularly useful for multilingual populations where patient demographics already capture language preference. The priority order is: patient's recorded language preference, then workspace-level voice setting, then default.

### Keyterm Boosting

Medical terminology, provider names, medication names, and organization-specific vocabulary are difficult for general-purpose speech recognition. Three layers of keyterm boosting improve recognition accuracy:

| Level               | Managed By               | Scope                                                      |
| ------------------- | ------------------------ | ---------------------------------------------------------- |
| **Service-level**   | Workspace administrators | Applied to all calls for a given service                   |
| **Workspace-level** | API configuration        | Per-workspace vocabulary (clinic names, local terminology) |
| **System defaults** | Amigo engineering        | Baseline medical and scheduling vocabulary                 |

All three layers are merged and deduplicated at call start. Then they update dynamically:

* **Patient context injection** - When the patient's identity resolves and their record loads, the platform extracts medical vocabulary from their medications, allergies, and active conditions. These terms are injected into the active STT session mid-call. A patient on metformin and lisinopril gets those drug names boosted as soon as their record is resolved.

### End-of-Turn Detection

The system must determine when the caller has finished speaking so the agent can respond. This uses configurable confidence thresholds that balance two concerns:

* **Responding too early** cuts the caller off mid-sentence
* **Responding too late** creates awkward silence

Base thresholds are configurable per workspace. On top of that, context graph states can override end-of-turn sensitivity through turn policy settings:

* **Data collection states** (collecting a date of birth, spelling a name) use higher thresholds and longer silence timeouts, because the caller is thinking and pausing between pieces of information
* **Action states** (confirming an appointment, answering a yes/no question) use default thresholds for snappy turn-taking

These overrides take effect on state transitions - when the agent moves to a data collection state, the STT engine reconfigures mid-call to be more patient with pauses.

#### Speculative Processing

The voice pipeline does not wait for full end-of-turn confirmation before starting work. When the STT engine signals moderate confidence that the caller may have finished speaking, the system begins the navigation step speculatively - processing the transcript through the context graph engine in the background. If the caller continues speaking, the speculative result is discarded. If the end-of-turn is confirmed and the transcript matches, the pre-computed navigation result is used immediately, saving the cost of a redundant LLM call.

The caller hears the agent respond faster without any change in response quality. When speculation fails (the caller was mid-sentence), the only cost is a discarded background computation - the caller experience is unaffected.

## Text-to-Speech

### Per-Language Provider Routing

The platform supports routing text-to-speech to different providers based on the caller's detected language. This is configured through a language-provider map that associates language codes with specific TTS providers and voice configurations.

When a caller's language is detected, the platform resolves the TTS provider through a priority matrix:

1. **Exact language match** - e.g., `ar-SA` matches an Arabic (Saudi Arabia) entry
2. **Base language match** - e.g., `ar-SA` falls back to an `ar` entry
3. **Multilingual fallback** - a catch-all entry for any language not explicitly mapped

At each level, service configuration takes priority over agent configuration, which takes priority over workspace configuration. The first match wins.

If no language-specific entry matches, the platform uses the standard TTS provider selection (service > agent > workspace > default). Per-language configuration is isolated - when a language-specific provider is selected, only that entry's voice settings are used, preventing configuration for one provider from affecting another.

The platform supports multiple TTS providers, selectable at the workspace level. Each provider offers different trade-offs across latency, voice quality, language support, and expressive capabilities. Workspace administrators choose the provider and configure provider-specific parameters (voice, model, speed, quality tuning) through the Developer Console voice settings page or the API. The Developer Console includes a voice library where operators can browse available voices across providers, preview audio samples, and select a voice for the workspace without needing to look up voice identifiers manually.

Provider-specific settings are isolated - changing providers does not affect call recordings, session management, or downstream analytics. If a configured provider is unavailable, the system falls back to the default provider.

The platform supports multiple TTS providers, selectable per workspace through voice settings. Each provider offers different trade-offs:

* **Default provider** - Low-latency WebSocket streaming with emotion-aware prosody adjustments and pronunciation dictionary support.
* **Provider B** - Persistent multi-stream WebSocket with per-turn context isolation, word-level timing alignment, and regional endpoint support for data residency requirements.
* **Provider C** - Ultra-fast REST-based synthesis with sentence-level pipelining (audio starts before the full response is generated), vocal direction tags for expressive delivery, and automatic Arabic language detection.

All providers integrate with the same text generation pipeline: a streaming language model produces text fragments that are forwarded to the selected TTS engine in real time. Each provider has its own circuit breaker for fault isolation - a degradation in one provider does not affect the others.

Provider selection is transparent to callers and does not change the call experience, recordings, or API behavior.

The TTS engine converts the agent's generated text into spoken audio with dynamic per-turn control over emotion, speed, and volume. Each utterance - fillers, responses, empathy pauses - carries its own voice parameters, so a warm filler at reduced speed can precede a normal-pace informational response without shared state.

### Emotion Priority Chain

The agent's vocal tone is not a single static setting. A six-level priority chain selects the most contextually appropriate emotion on every turn. Each level fires only if the previous one produced no signal:

| Priority                  | Source                                                                                                                         | Example                                                           |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
| 1. **Vocal burst**        | Caller laughed, sighed, gasped, or cried in the last 5 seconds                                                                 | Caller laughs → agent responds with warm enthusiasm immediately   |
| 2. **Prosody**            | Acoustic emotion model detects a strong signal from the caller's voice                                                         | Anxiety detected → sympathetic tone                               |
| 3. **Proactive topic**    | The current context graph action matches a [sensitive topic](/channels/voice/emotion-detection.md#pre-emptive-tone-adjustment) | Agent about to discuss test results → preemptive sympathetic tone |
| 4. **Tone momentum**      | Previous turn's tone is carried forward when the current signal is weak                                                        | Tone stays sympathetic across a brief neutral pause               |
| 5. **Workspace baseline** | The service's configured default tone                                                                                          | Friendly baseline for scheduling services                         |
| 6. **System default**     | Engineering fallback                                                                                                           | Calm                                                              |

This means the agent's voice adapts in real time to what is happening, not to a pre-configured setting. The workspace baseline sets the floor, but any strong emotional signal overrides it - always in the direction of more empathy, never less.

<figure><img src="/files/UNUeHyGRNIQmdqkqFj1h" alt="Six-level TTS emotion priority chain from vocal burst to system default"><figcaption></figcaption></figure>

### Split Model Architecture

Navigation and response generation use different LLM models optimized for their respective jobs:

* **Navigation** uses a smaller, faster model. The nav output is roughly 5 tokens (a structured code line), so raw intelligence matters less than speed. The smaller model shaves latency off every turn without sacrificing decision quality on the constrained output format.
* **Response generation** uses the full-size model. Response text is what the caller actually hears, so quality matters more than throughput. For typical voice responses (under 20 tokens), the speed difference between the two models adds less than 200ms - a good trade for noticeably better phrasing.

### Situation-Response Adaptation

<figure><img src="/files/SOEp42rdZNRCPgEaaqSr" alt="Four-dimensional adaptation: voice tone, filler behavior, response content, and behavioral signals"><figcaption></figcaption></figure>

The voice pipeline adapts across four independent dimensions simultaneously. Each dimension operates on different output channels, so the agent can change *what* it says, *how* it says it, and *whether* it fills silence - all independently and in real time.

#### Emotion → Voice Tone

The agent mirrors empathy, not the caller's emotion. An angry caller hears a calm voice (de-escalation), not an angry one. An anxious caller hears a sympathetic voice (reassurance). A happy caller hears enthusiasm (matching energy).

#### Emotion → Filler Behavior

Filler speech adapts to the caller's emotional state. Anxious callers hear reassuring fillers ("Of course," "I'm here to help"). Frustrated callers with high arousal hear no fillers at all - the system suppresses them because frustrated callers want answers, not acknowledgments. Happy callers hear warm, matching fillers.

#### Emotion → Response Content

Emotional context is injected into every prompt the response model receives. The injection includes the caller's dominant emotion, trend direction, and adaptation guidance. A caller with deteriorating mood gets responses prioritizing resolution speed. A confused caller gets simplified explanations broken into small pieces. The agent adapts what it says based on how the caller is feeling, not just how it sounds.

#### Behavioral Signals → Response Content

Three behavioral signals are tracked in real time and injected into prompts when thresholds are crossed:

| Signal                    | What It Detects                                   | What the Agent Does                               |
| ------------------------- | ------------------------------------------------- | ------------------------------------------------- |
| **Interruption count**    | Caller has interrupted the agent multiple times   | Shorten responses - the agent is talking too much |
| **Short response streak** | Caller is giving very brief answers consecutively | The caller is disengaging or withdrawing          |
| **Silence gaps**          | Extended silence from the caller                  | Confusion, hesitation, or distress                |

These signals augment, not replace, the emotion detection system. A caller who is interrupting frequently may not sound frustrated in their voice, but the behavioral pattern tells the engine to shorten its responses.

### Response Micro-Behaviors

The response generation model follows a set of communication guidelines that produce natural conversational behavior regardless of emotional state:

* **Speech rhythm mirroring** - Short bursts from the caller produce concise responses; conversational callers get warmer, flowing replies
* **Emotional name usage** - The caller's name is used at moments of emotional significance, not mechanically
* **Pause injection** - When delivering difficult information, the agent pauses naturally before the key detail
* **Pace inversion** - When the caller is rushing, the agent slows down with longer sentences and gentle transitions
* **Completion inference** - When a caller trails off mid-sentence, the agent acknowledges what they were trying to say

The agent never mentions that it can detect the caller's emotions. Emotional adaptation is experienced as natural attentiveness, not surveillance.

### Voice Timeline

The voice pipeline applies the same [cut/navigate/engage](/agent/reasoning-engine.md#cut--navigate--engage) pattern that drives conversation-level reasoning - but within each turn, managing what the caller hears and when.

Fillers, responses, empathy pauses, and tool progress narration are treated as states in one timeline. "Let me check on that" followed by "Her appointment is Thursday" is one conversation trajectory in two parts, not two unrelated audio events.

#### Three Operations

1. **Cut** - A signal arrives (the caller stopped speaking, a tool started, empathy shifted). The actor asks: *did something change?* Between two cuts, dozens of raw events may arrive - emotion scores, behavioral signals, audio segments. The cut compresses them into a handful of causally relevant fields and discards the rest. This compression is what makes the next step tractable.
2. **Navigate** - Given the compressed state and the trajectory of previous states, select the next voice state. Navigation is a pure decision with no side effects - it can be called speculatively without producing audio.
3. **Engage** - Prepare an utterance with its own emotion, pace, and timing policy. If the main response is still pending, the voice timeline can move through filler, progress, or hold states without losing the caller's place.

#### Signal-to-State Mapping

Each signal produces a specific voice state:

| Signal                       | Voice State | What Happens                                                            |
| ---------------------------- | ----------- | ----------------------------------------------------------------------- |
| **Caller finished speaking** | Breath      | Brief pause before the agent responds                                   |
| **Navigation complete**      | Transition  | Filler window opens if the response is not ready                        |
| **Tool started**             | Progress    | Tool wait narration on a repeating interval ("Let me check on that...") |
| **Tool finished**            | Response    | Agent delivers the tool result                                          |
| **All audio finished**       | Listen      | Silence deadline starts - check-ins escalate if the caller stays quiet  |
| **Empathy tier shifted**     | Hold        | Intentional silence - the agent pauses to give the caller space         |
| **Caller started speaking**  | Listen      | Pending fillers drain - the caller has the floor                        |
| **Deadline expired**         | Next state  | Self-signal - the actor re-enters cut/navigate/engage                   |

Timing policies are what make this self-driving. A transition state can wait briefly for the real response, play a filler when the response is not ready, or hold silence when empathy calls for it. If the response arrives first, the pending filler is skipped.

#### Per-Utterance Voice Parameters

Each utterance carries its own emotion and speed, set when the utterance is prepared. This keeps filler speech, progress narration, and final responses from overwriting one another when they are close together in time.

#### Voice Timing Configuration

The voice timeline exposes two categories of configuration per service:

**When** - timing policies for pauses, filler windows, progress narration, empathy holds, and cooldowns.

**What** - vocabulary and style:

| Parameter               | What It Controls                                              |
| ----------------------- | ------------------------------------------------------------- |
| **Filler style**        | Phrase, backchannel, or silent (see below)                    |
| **Filler vocabulary**   | Custom backchannel words ("Mm," "Yeah," "Mhm")                |
| **Progress vocabulary** | Custom tool-wait phrases ("One moment...," "Let me check...") |

Everything else - signal routing, timing management, state transitions, trajectory tracking - is derived from the signal-to-state mapping and these settings.

#### Filler Styles

Three filler styles are available, configurable per service:

| Style           | Behavior                                                            | Best For                                                        |
| --------------- | ------------------------------------------------------------------- | --------------------------------------------------------------- |
| **Phrase**      | Contextual phrases like "Let me check that for you" or "One moment" | General-purpose services where the agent should sound active    |
| **Backchannel** | Short acknowledgments like "Mm," "Yeah," "Mhm"                      | Services where brief, natural-sounding turn-taking is preferred |
| **Silent**      | No filler at all - the agent pauses until the response is ready     | Services where silence between turns is acceptable or preferred |

The filler style is enforced end-to-end: when a service is configured as silent, the pipeline suppresses filler generation, filler guidelines in the navigation prompt, and filler text in the response - not just the final audio output. Receipt and working fillers ("Got it," "Let me check") inherit the nav-selected emotion, so they sound consistent with the rest of the turn. Backchannel vocabulary is customizable per service.

When navigation is skipped - typically in single-action context graphs where the agent always stays in the same state - the orchestrator starts a short timer (configurable per service). If the response has not produced audio by the time the timer fires, a backchannel sound plays to hold the conversational rhythm. If the response arrives first, the timer is cancelled. Services using the "silent" filler style suppress this timer entirely.

#### Empathy-Gated Filler Behavior

Filler behavior is controlled by the caller's [empathy tier](/channels/voice/emotion-detection.md#empathy-tier-classification). At higher tiers, silence replaces fillers because silence *is* the empathy:

* **T0-T1** - Normal filler emission. At T1, the filler type is set to "empathy" (warmer, acknowledging) rather than "receipt" or "working."
* **T2 Full Empathy** - The system inserts a brief pause before speaking any filler. This anti-filler silence gives the caller space to continue.
* **T3 Hold Space** - Fillers are suppressed entirely. The agent pauses, then delivers a pure empathy response.

When the caller's empathy tier changes mid-turn, the orchestrator reacts immediately - shifting from normal filler behavior to empathy-appropriate silence or vice versa without waiting for the next turn boundary.

When empathy fillers do play, the TTS engine switches to a quieter presence mode with neutral emotion. This prevents the uncanny effect of strong emotion applied to very short phrases, where the TTS engine can stretch vowels in ways that sound artificial.

#### Principle-Based Filler Generation

Filler phrases are not drawn from a hardcoded list. The system generates them using an LLM with the current emotional context and conversation intent as inputs. This means fillers adapt to the situation: a caller who sounds anxious gets "I'm looking into that for you right now" rather than a generic "One moment." The generation is guided by emotional guidelines and the current context graph action, so fillers stay contextually appropriate.

Filler emissions are rate-limited to prevent cascading acknowledgements when rapid turns or false barge-ins cause multiple navigation cycles in quick succession. The orchestrator also caps consecutive filler emissions per turn, so extended processing time produces at most two filler phrases before the agent falls silent.

#### Tool-Wait Progress Hints

Fillers emitted while a tool is running can be shaped per state and per tool, not just per service. A progress hint describes *the shape of the wait* rather than supplying a phrase list: what kind of work the tool is doing (record lookup, write, external call, computation, multi-step workflow), roughly how long it is expected to take, and how the agent should cover the wait (`auto`, `verbal`, `backchannel`, or `silent`).

The orchestrator turns the hint into utterances at runtime using tool semantics, the caller's current emotion, and conversation context. Retries produce attempt-aware language: an initial acknowledgement, then a brief apology, then a "still working on it" update, each scaled to how long the tool has actually been running. A tool-level hint field-merges with the state's channel-level hint, so a state can declare the default wait shape for its channel and individual tools only override the fields that actually differ.

For tools with expected latency of four seconds or more, a custom phrase can override the generated progress text. This gives agent engineers precise control over what callers hear during long waits - for example, a tool that queries multiple external systems. The custom phrase is bounded to 30 words and requires a progress class as fallback.

When the orchestrator is handling tool progress, it takes over from the default filler pipeline entirely - there is no duplicate filler injection from multiple sources during tool execution.

#### Result Persistence Modes

Tool call specs support a `result_persistence` setting that controls how tool results accumulate in the agent's prompt context:

| Mode                     | Behavior                                                                                          | Best For                                                                                                        |
| ------------------------ | ------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
| **accumulate** (default) | Every tool result is retained in the prompt history                                               | Tools called once or a few times per conversation                                                               |
| **override**             | Only the latest result per tool name is retained; previous results for the same tool are replaced | Polling tools called repeatedly (availability checks, status lookups) where only the most recent result matters |

Override mode prevents context bloat from tools that the agent calls multiple times during a conversation. A scheduling agent that checks availability five times during a complex multi-provider booking only carries the most recent availability snapshot in its context, not all five results stacked up. This keeps the prompt focused and reduces token usage without losing the information the agent actually needs.

### Context Window Management

The engine tracks cumulative token usage across a conversation and applies automatic policies when utilization crosses configurable thresholds:

| Threshold                    | Action                                                                                                                                                                 |
| ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Warning** (default 60%)    | The engine compresses conversation history by capping the number of retained turns, reducing context size while preserving the most recent and most relevant exchanges |
| **Exhaustion** (default 80%) | The engine flags the conversation for operator escalation, ensuring a human can take over before the context window is fully consumed                                  |

Token tracking is cumulative - it counts all input and output tokens across the conversation's lifetime, not just the current turn. This catches gradually growing conversations (long scheduling sessions, multi-topic calls) before they hit hard context limits. The warning threshold triggers silent degradation that the caller does not notice. The exhaustion threshold surfaces a handoff signal through the standard [operator escalation](/channels/operators.md) path.

### Barge-In Detection

If the caller starts speaking while the agent is talking, the system needs to decide whether to stop the agent's audio. Barge-in uses semantic confirmation - it requires actual recognized words rather than just acoustic energy. This filters out coughs, breathing, background conversation, and echo from the agent's own audio that would otherwise cause false interruptions.

The decision is based on four conditions evaluated together:

1. Whether the caller's speech contains actual recognized words from the speech-to-text engine (not breathing, echo, or background noise). Voice activity detection alone is not sufficient - the system requires at least one recognized word before triggering a barge-in.
2. Whether the speech has lasted long enough with recognized words (minimum duration is configurable per service). The default threshold is low enough that short responses like "yes," "no," and "okay" (200-400ms of speech) can trigger barge-in.
3. Whether the cooldown period has elapsed since the last barge-in (configurable per service, prevents rapid false triggers)
4. Whether the agent is currently speaking

When all conditions are met, the agent's audio stops and the system returns to listening mode. This prevents the agent from talking over a caller who is trying to ask a question or correct a misunderstanding.

There is also a fast path for end-of-turn interrupts. When the speech engine produces a complete transcript with an end-of-turn signal while the agent is speaking, the system interrupts the agent's audio immediately from the recognition listener rather than waiting for the transcript to pass through the processing queue. This eliminates queue latency on short-phrase interrupts where the caller finishes speaking quickly.

### Response Length Enforcement

The voice pipeline enforces maximum response length at the streaming level - the TTS engine stops generating audio when the configured sentence or word cap is reached. This is a mechanical limit, not a prompt instruction, so it cannot be exceeded regardless of what the LLM generates. Response caps are configurable per service, allowing scheduling services to keep responses brief while clinical services allow longer explanations.

### Call Completion

When the agent reaches a terminal state in the context graph and decides to end the call, it signals its intent to hang up but does not disconnect immediately. The system waits for signal convergence: the agent's closing utterance must finish playing and any in-flight tool results must resolve before the call disconnects. If the caller speaks during this window (barge-in) or a transfer is initiated, the hangup intent is retracted and the conversation continues.

This ensures the caller always hears the agent's full closing message, even when the terminal state involves a tool call with a follow-up utterance.

### Per-Service Voice Configuration

<figure><img src="/files/g9H4JnZZMZlBiAmbBiUc" alt="Voice control plane: per-service config overrides workspace settings overrides defaults, with automatic emotional adaptation"><figcaption></figcaption></figure>

Voice behavior is configurable at the service level, allowing different services within the same workspace to have different voice characteristics. Configuration follows a three-level hierarchy described in the [Voice Control Plane](/agent/reasoning-engine.md#voice-control-plane) section. Per-service settings cover:

* **Filler behavior** - Style (phrase, backchannel, or silent), custom vocabulary, backchannel timing
* **Barge-in sensitivity** - Minimum speech duration and cooldown period
* **Response limits** - Maximum sentences and words per response
* **End-of-turn detection** - Eagerness threshold and timeout
* **TTS settings** - Model selection and buffer delay
* **Voice timing** - The "when" and "what" knobs described in [Voice Timing Configuration](#voice-timing-configuration) above
* **Call forwarding** - Whether the agent can transfer calls to external numbers (opt-in, disabled by default). Supports both pre-configured forwarding numbers (from EHR location data or workspace settings) and dynamic forwarding to any E.164 number provided at call time

These settings are managed through the Platform API and the Agent Forge CLI.

## Real-Time Audio Correction

The voice agent runs a parallel audio verification layer alongside the live STT stream. When the navigator detects that a turn contains structured data - names, dates of birth, phone numbers, insurance IDs, or medication names - it flags the turn for audio verification. A separate correction model cross-checks the STT transcript against the raw audio buffer for that segment, catching misrecognized digits, transposed characters, and phonetically similar names before they enter the agent's reasoning pipeline.

### Domain-Aware Correction Hints

Each workspace can configure voice settings with domain-specific correction hints that tell the verification model what kinds of structured data to listen for. A dental practice might hint on insurance group numbers and procedure codes; a primary care clinic might hint on medication names and dosage quantities. These hints narrow the correction model's focus so it spends its budget on the data types that matter most for that service, rather than re-verifying every word in the transcript.

### Confidence-Gated Correction

When the correction model produces a candidate correction, it assigns a confidence score (1-9) that determines what happens next:

| Confidence | Label     | Behavior                                                                                                                  |
| ---------- | --------- | ------------------------------------------------------------------------------------------------------------------------- |
| **8-9**    | Certain   | The corrected value replaces the original transcript silently. The caller is not asked to confirm.                        |
| **5-7**    | Likely    | The agent confirms the corrected value with the caller ("I heard your date of birth as March 12, 1985 - is that right?"). |
| **1-4**    | Uncertain | The agent asks the caller to repeat the information ("Could you say that date of birth again for me?").                   |

This prevents the agent from acting on low-confidence corrections while avoiding unnecessary confirmation prompts when the correction model is highly confident. The confidence thresholds are tuned for healthcare data where accuracy matters more than conversational speed.

<figure><img src="/files/h2eq0Ds4DTf1B7uiKRgX" alt="Real-time audio intelligence: parallel STT and audio buffer feed verification with confidence-gated correction"><figcaption></figcaption></figure>

This is distinct from post-call re-transcription. Real-time correction happens during the conversation, so the agent reasons from corrected text - not from raw STT output that might contain errors.

## Post-Call Processing

{% hint style="info" %}
The real-time STT stream prioritizes speed over accuracy. Post-call re-transcription catches words the live stream may have missed.
{% endhint %}

After a call ends, the system runs a higher-accuracy batch transcription of the full recording. This re-transcription catches words that the real-time stream may have missed or misrecognized and adds speaker diarization - identifying which speaker said each word. The result is a verified, speaker-attributed transcript where every segment is tagged with the speaker who produced it, distinguishing between the caller and the agent.

The system also feeds recognition accuracy data back into the STT configuration, identifying which keyterms were recognized correctly and which were missed. Transcription accuracy increases over time for your specific vocabulary.

### Post-Call Text Intelligence

After re-transcription, the verified transcript is analyzed for sentiment, topics, intents, and a structured summary. These results are stored alongside the call intelligence data and are available through the same API endpoints. This analysis runs on the text transcript (not the audio), so it captures conversational content that audio-only analysis misses - for example, whether the caller's stated intent matched the outcome.

## Post-Call Quality Scoring

After a call ends, the system runs a structured quality analysis on the stereo call recording (caller audio on one channel, agent audio on the other). The analysis scores the call across five dimensions:

| Dimension                | What It Measures                                                                                             |
| ------------------------ | ------------------------------------------------------------------------------------------------------------ |
| **Task completion**      | Did the agent accomplish what the caller needed? Fully, partially, or not at all.                            |
| **Information accuracy** | Were speech recognition results correct? Did the agent act on accurate transcriptions?                       |
| **Conversation flow**    | Was the conversation natural? Were there awkward pauses, unnecessary repetitions, or disjointed transitions? |
| **Error recovery**       | When confusion occurred, did the agent recover gracefully or compound the problem?                           |
| **Caller experience**    | Based on tone and interaction patterns, did the caller seem satisfied with the exchange?                     |

Each dimension is scored on a 1-5 scale. The system also produces a summary, an outcome classification (succeeded, partially succeeded, failed, or abandoned), and specific STT correction suggestions that feed back into keyterm boosting.

## Call Intelligence

In addition to post-call quality scoring, the system computes a structured intelligence summary from in-memory session state at the moment the call ends. This captures operational telemetry that the quality scoring (which runs asynchronously on recordings) cannot see.

| Summary          | What It Captures                                                                                                                                                                                                 |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Emotion**      | Dominant emotion, valence/arousal/dominance averages, peak negative emotion, emotional shifts, final trend, language sentiment score, toxicity categories, recent vocal bursts, speaker-normalized energy deltas |
| **Risk**         | Composite risk score, risk level, contributing signals with individual weights                                                                                                                                   |
| **Latency**      | Engine response time (avg, p50, p95), audio time-to-first-byte, silence ratio                                                                                                                                    |
| **Conversation** | Turn count, states visited, loop count, barge-in count, completion reason, final state                                                                                                                           |
| **Tool**         | Total tool calls, success/failure counts, failure rate, per-tool breakdown                                                                                                                                       |
| **Safety**       | Safety rule matches during the call, escalation triggers                                                                                                                                                         |
| **Operator**     | Whether escalation occurred, operator connect time, resolution                                                                                                                                                   |

### Composite Quality Score

A rule-based quality score (0-100) is computed from the intelligence summaries. The score starts at 100 and applies penalties for negative signals:

| Signal                | Threshold           | Penalty    |
| --------------------- | ------------------- | ---------- |
| High response latency | p95 audio TTFB > 1s | -5 to -15  |
| Excessive silence     | Silence ratio > 20% | -10 to -20 |
| Caller barge-ins      | > 2 interruptions   | -5 to -15  |
| Agent loops           | Revisited states    | -10 to -20 |
| Operator escalation   | Any                 | -10        |
| Tool failures         | Failure rate > 5%   | -5 to -15  |

The quality score is designed for dashboard filtering and trend analysis - it identifies calls that need attention without requiring someone to listen to every recording.

### Call Intelligence API

Two API endpoints expose call intelligence data:

| Endpoint                        | What It Returns                                                                                                                                                                                                                                                                              |
| ------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Completed call intelligence** | Full intelligence profile for a finished call - joins the persisted summaries with per-turn data reconstructed from conversation transcripts. Includes emotion trajectory, risk timeline, latency waterfall, tool performance breakdown, and conversation quality events (loops, barge-ins). |
| **Active call intelligence**    | All currently active calls enriched with a live intelligence overlay - current emotion, risk score and trend, turn count, escalation status, and current conversation state. Updated after every caller speech turn.                                                                         |

The live intelligence overlay is written after each caller turn and refreshed alongside the active call heartbeat. If a call ends or the session is lost, the live data expires automatically.

{% hint style="info" %}
Per-turn reconstruction is computed on read from the stored conversation turns - it is not stored separately. This means the intelligence profile reflects the full turn data without duplicating storage.
{% endhint %}

<figure><img src="/files/1N5vvDEkqBValUdC3byK" alt="Post-call processing: call intelligence, diarized re-transcription, quality analysis, STT improvement"><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.amigo.ai/channels/voice/audio-pipeline.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.