microphone-linesAudio Pipeline

How speech recognition, emotion detection, filler speech, barge-in detection, and TTS work together in real time.

The audio pipeline converts a caller's voice into text, processes it through the agent's reasoning, and converts the response back to speech. Two independent streams handle the input side: one for speech recognition, one for emotion analysis. They run in parallel and never block each other.

Signal Capture

Audio arrives from the telephony layer as a standard telephony audio stream. The system splits it into two parallel paths the moment it arrives:

  1. Speech-to-text - Converts audio to transcript text in real time

  2. Emotion detection - Analyzes vocal qualities for emotional signals (covered in Emotion Detection)

If either path fails, the other continues unaffected. A failure in emotion detection does not delay transcription. A failure in transcription does not block emotion analysis.

Parallel Initialization (Prewarm)

Before the caller picks up, the voice agent runs a two-phase initialization. Phase 1 (during the ring time) fetches workspace configuration, skill definitions, and integration credentials. Phase 2 (on answer) sets up TTS, emotion detection, call recording, and escalation channels. By splitting initialization into prewarm and session setup, the agent is fully loaded before the first word is spoken. This is what enables the instant greeting described in How It Works.

Speech-to-Text

Amigo uses a streaming speech recognition engine for real-time transcription. Audio is transcribed with sub-300ms latency, meaning the system has text available almost as fast as the caller speaks.

Keyterm Boosting

Medical terminology, provider names, medication names, and organization-specific vocabulary are difficult for general-purpose speech recognition. Amigo addresses this with three layers of keyterm boosting that improve recognition accuracy for domain-specific words:

Level
Managed By
Scope

Service-level

Workspace administrators

Applied to all calls for a given service

Workspace-level

API configuration

Per-workspace vocabulary (clinic names, local terminology)

System defaults

Amigo engineering

Baseline medical and scheduling vocabulary

All three layers are merged and deduplicated at the start of each call. The STT engine receives a single combined vocabulary list.

End-of-Turn Detection

The system must determine when the caller has finished speaking so the agent can respond. This uses configurable confidence thresholds that balance two competing concerns:

  • Responding too early cuts the caller off mid-sentence

  • Responding too late creates awkward silence

The thresholds are tunable per workspace to match the pace and style of your patient population.

Text-to-Speech

Amigo uses a purpose-built text-to-speech engine for voice synthesis. The TTS engine converts the agent's generated text into spoken audio with control over:

  • Emotion - Tone adjusts based on the caller's detected emotional state

  • Speed - Pace adapts to match the conversation's urgency and the caller's communication style

  • Emphasis - Key information (dates, times, names) receives natural emphasis

The output is a two-stage process. First, the LLM generates the response text. Then, TTS converts that text to speech with the appropriate vocal qualities. These are separate steps because the voice characteristics depend on the emotional context at the moment of speaking, not at the moment of text generation.

Tone Momentum

The TTS engine does not compute voice emotion from scratch on every turn. It maintains a tone momentum cache that stores the previous turn's emotional parameters. When the emotion detection signal is weak or temporarily unavailable (circuit breaker open, low-confidence prosody signal), the system falls back to the cached tone from the previous turn rather than snapping to a neutral default.

This prevents jarring vocal shifts mid-conversation. If the agent has been speaking warmly for the last three turns and the emotion signal drops for one turn, the agent continues with warm delivery rather than switching to a flat, neutral voice.

Filler Speech

There is an inherent latency between when the caller finishes speaking and when the agent's full response is ready. This gap is roughly 900ms on average. Rather than leaving silence, the agent produces filler speech: brief, contextually appropriate phrases ("Let me check that for you," "One moment") that play while the full response is being generated.

Filler speech serves two purposes:

  • It signals to the caller that the system heard them and is working

  • It prevents the caller from repeating themselves or hanging up due to perceived silence

Filler emissions are rate-limited to one every three seconds. This prevents cascading acknowledgements when rapid turns or false barge-ins cause multiple navigation cycles in quick succession.

Principle-Based Filler Generation

Filler phrases are not drawn from a hardcoded list. The system generates them using an LLM with the current emotional context and conversation intent as inputs. This means fillers adapt to the situation: a caller who sounds anxious gets "I'm looking into that for you right now" rather than a generic "One moment." The generation is guided by emotional guidelines and the current HSM action, so fillers stay contextually appropriate.

Barge-In Detection

If the caller starts speaking while the agent is talking, the system needs to decide whether to stop the agent's audio. Barge-in uses semantic confirmation - it requires actual recognized words rather than just acoustic energy. This filters out coughs, breathing, background conversation, and echo from the agent's own audio that would otherwise cause false interruptions.

The decision is based on four conditions evaluated together:

  1. Whether the caller's speech contains actual words (not just breathing, echo, or background noise) - the system checks for recognized words from the speech-to-text engine, not just voice activity detection

  2. Whether the speech has lasted at least 0.5 seconds with recognized words (or 1.0 seconds as a fallback if word recognition is delayed)

  3. Whether a 1.5-second cooldown has elapsed since the last barge-in (prevents rapid false triggers)

  4. Whether the agent is currently speaking

When all conditions are met, the agent's audio stops and the system returns to listening mode. This prevents the agent from talking over a caller who is trying to ask a question or correct a misunderstanding.

Real-Time Audio Correction

While the live STT stream prioritizes speed, the voice agent runs a parallel audio verification layer during the call. This layer uses a separate LLM to cross-check transcription accuracy in real time, catching misrecognized medical terms, drug names, and proper nouns before they enter the agent's reasoning pipeline.

This is distinct from post-call re-transcription. Real-time correction happens during the conversation, so the agent reasons from corrected text - not from raw STT output that might contain errors.

Post-Call Processing

circle-info

The real-time STT stream prioritizes speed over accuracy. Post-call re-transcription catches words the live stream may have missed.

After a call ends, the system runs a higher-accuracy batch transcription of the full recording. This re-transcription catches words that the real-time stream may have missed or misrecognized. The verified transcript becomes the canonical record of the call.

The system also feeds recognition accuracy data back into the STT configuration, identifying which keyterms were recognized correctly and which were missed. This creates a self-improving loop where transcription accuracy increases over time for your specific vocabulary.

Post-Call Quality Scoring

After a call ends, the system runs a structured quality analysis on the stereo call recording (caller audio on one channel, agent audio on the other). The analysis scores the call across five dimensions:

Dimension
What It Measures

Task completion

Did the agent accomplish what the caller needed? Fully, partially, or not at all.

Information accuracy

Were speech recognition results correct? Did the agent act on accurate transcriptions?

Conversation flow

Was the conversation natural? Were there awkward pauses, unnecessary repetitions, or disjointed transitions?

Error recovery

When confusion occurred, did the agent recover gracefully or compound the problem?

Caller experience

Based on tone and interaction patterns, did the caller seem satisfied with the exchange?

Each dimension is scored on a 1-5 scale. The system also produces a summary, an outcome classification (succeeded, partially succeeded, failed, or abandoned), and specific STT correction suggestions that feed back into keyterm boosting.

spinner

Last updated

Was this helpful?