Audio Pipeline
How speech recognition, emotion detection, filler speech, barge-in detection, and TTS work together in real time.
The audio pipeline converts a caller's voice into text, processes it through the agent's reasoning, and converts the response back to speech. Two independent streams handle the input side: one for speech recognition, one for emotion analysis. They run in parallel and never block each other.
Signal Capture
Audio arrives from the telephony layer as a standard telephony audio stream. The system splits it into two parallel paths the moment it arrives:
Speech-to-text - Converts audio to transcript text in real time
Emotion detection - Analyzes vocal qualities for emotional signals (covered in Emotion Detection)
If either path fails, the other continues unaffected. A failure in emotion detection does not delay transcription. A failure in transcription does not block emotion analysis.
Parallel Initialization (Prewarm)
Before the caller picks up, the voice agent runs a two-phase initialization. Phase 1 (during the ring time) fetches workspace configuration, skill definitions, and integration credentials. Phase 2 (on answer) sets up TTS, emotion detection, call recording, and escalation channels. By splitting initialization into prewarm and session setup, the agent is fully loaded before the first word is spoken. This is what enables the instant greeting described in How It Works.
Speech-to-Text
Amigo uses a streaming speech recognition engine for real-time transcription. Audio is transcribed with sub-300ms latency, meaning the system has text available almost as fast as the caller speaks.
Keyterm Boosting
Medical terminology, provider names, medication names, and organization-specific vocabulary are difficult for general-purpose speech recognition. Amigo addresses this with three layers of keyterm boosting that improve recognition accuracy for domain-specific words:
Service-level
Workspace administrators
Applied to all calls for a given service
Workspace-level
API configuration
Per-workspace vocabulary (clinic names, local terminology)
System defaults
Amigo engineering
Baseline medical and scheduling vocabulary
All three layers are merged and deduplicated at the start of each call. The STT engine receives a single combined vocabulary list.
End-of-Turn Detection
The system must determine when the caller has finished speaking so the agent can respond. This uses configurable confidence thresholds that balance two competing concerns:
Responding too early cuts the caller off mid-sentence
Responding too late creates awkward silence
The thresholds are tunable per workspace to match the pace and style of your patient population.
Text-to-Speech
Amigo uses a purpose-built text-to-speech engine for voice synthesis. The TTS engine converts the agent's generated text into spoken audio with control over:
Emotion - Tone adjusts based on the caller's detected emotional state
Speed - Pace adapts to match the conversation's urgency and the caller's communication style
Emphasis - Key information (dates, times, names) receives natural emphasis
The output is a two-stage process. First, the LLM generates the response text. Then, TTS converts that text to speech with the appropriate vocal qualities. These are separate steps because the voice characteristics depend on the emotional context at the moment of speaking, not at the moment of text generation.
Tone Momentum
The TTS engine does not compute voice emotion from scratch on every turn. It maintains a tone momentum cache that stores the previous turn's emotional parameters. When the emotion detection signal is weak or temporarily unavailable (circuit breaker open, low-confidence prosody signal), the system falls back to the cached tone from the previous turn rather than snapping to a neutral default.
This prevents jarring vocal shifts mid-conversation. If the agent has been speaking warmly for the last three turns and the emotion signal drops for one turn, the agent continues with warm delivery rather than switching to a flat, neutral voice.
Filler Speech
There is an inherent latency between when the caller finishes speaking and when the agent's full response is ready. This gap is roughly 900ms on average. Rather than leaving silence, the agent produces filler speech: brief, contextually appropriate phrases ("Let me check that for you," "One moment") that play while the full response is being generated.
Filler speech serves two purposes:
It signals to the caller that the system heard them and is working
It prevents the caller from repeating themselves or hanging up due to perceived silence
Filler emissions are rate-limited to one every three seconds. This prevents cascading acknowledgements when rapid turns or false barge-ins cause multiple navigation cycles in quick succession.
Principle-Based Filler Generation
Filler phrases are not drawn from a hardcoded list. The system generates them using an LLM with the current emotional context and conversation intent as inputs. This means fillers adapt to the situation: a caller who sounds anxious gets "I'm looking into that for you right now" rather than a generic "One moment." The generation is guided by emotional guidelines and the current HSM action, so fillers stay contextually appropriate.
Barge-In Detection
If the caller starts speaking while the agent is talking, the system needs to decide whether to stop the agent's audio. Barge-in uses semantic confirmation - it requires actual recognized words rather than just acoustic energy. This filters out coughs, breathing, background conversation, and echo from the agent's own audio that would otherwise cause false interruptions.
The decision is based on four conditions evaluated together:
Whether the caller's speech contains actual words (not just breathing, echo, or background noise) - the system checks for recognized words from the speech-to-text engine, not just voice activity detection
Whether the speech has lasted at least 0.5 seconds with recognized words (or 1.0 seconds as a fallback if word recognition is delayed)
Whether a 1.5-second cooldown has elapsed since the last barge-in (prevents rapid false triggers)
Whether the agent is currently speaking
When all conditions are met, the agent's audio stops and the system returns to listening mode. This prevents the agent from talking over a caller who is trying to ask a question or correct a misunderstanding.
Real-Time Audio Correction
While the live STT stream prioritizes speed, the voice agent runs a parallel audio verification layer during the call. This layer uses a separate LLM to cross-check transcription accuracy in real time, catching misrecognized medical terms, drug names, and proper nouns before they enter the agent's reasoning pipeline.
This is distinct from post-call re-transcription. Real-time correction happens during the conversation, so the agent reasons from corrected text - not from raw STT output that might contain errors.
Post-Call Processing
The real-time STT stream prioritizes speed over accuracy. Post-call re-transcription catches words the live stream may have missed.
After a call ends, the system runs a higher-accuracy batch transcription of the full recording. This re-transcription catches words that the real-time stream may have missed or misrecognized. The verified transcript becomes the canonical record of the call.
The system also feeds recognition accuracy data back into the STT configuration, identifying which keyterms were recognized correctly and which were missed. This creates a self-improving loop where transcription accuracy increases over time for your specific vocabulary.
Post-Call Quality Scoring
After a call ends, the system runs a structured quality analysis on the stereo call recording (caller audio on one channel, agent audio on the other). The analysis scores the call across five dimensions:
Task completion
Did the agent accomplish what the caller needed? Fully, partially, or not at all.
Information accuracy
Were speech recognition results correct? Did the agent act on accurate transcriptions?
Conversation flow
Was the conversation natural? Were there awkward pauses, unnecessary repetitions, or disjointed transitions?
Error recovery
When confusion occurred, did the agent recover gracefully or compound the problem?
Caller experience
Based on tone and interaction patterns, did the caller seem satisfied with the exchange?
Each dimension is scored on a 1-5 scale. The system also produces a summary, an outcome classification (succeeded, partially succeeded, failed, or abandoned), and specific STT correction suggestions that feed back into keyterm boosting.
Last updated
Was this helpful?

