Emotion Detection
Three parallel models analyze vocal prosody, burst patterns, and language to track caller emotional state in real time.
The voice agent analyzes the caller's emotional state in real time using three parallel models. These models run continuously during every call, providing a composite picture of how the caller is feeling based on how they sound, what nonverbal vocalizations they produce, and what their words convey.
In healthcare phone calls, a frustrated caller needs a different interaction style than a confused one. A caller whose voice is shaking needs the agent to slow down and acknowledge difficulty, not barrel through a scheduling script.
Three Parallel Models
All three models run on a single WebSocket connection using dual-payload multiplexing. Audio segments request the prosody and vocal burst models; text transcripts request the language model. Each response contains only the models that were requested, so there is no ambiguity in parsing results.
Prosody Model
Analyzes the acoustic properties of speech: pitch, rhythm, timbre, pace, and vocal quality. Processes 2-second audio segments and returns scores across dozens of distinct emotions.
This model captures things that words alone cannot. A caller who says "I'm fine" in a flat, low-energy tone registers differently than one who says the same words with bright intonation. The prosody model detects frustration, anxiety, sadness, confusion, and relief from vocal qualities alone.
Vocal Burst Model
Analyzes non-speech vocalizations: sighs, laughs, groans, gasps, cries, and other sounds that fall outside of normal speech. Processes the same 2-second audio segments as the prosody model and classifies them across dozens of vocal types.
These signals are important because transcription systems discard them entirely. A deep sigh before answering a question, a nervous laugh, or a groan of pain carries information that never appears in a transcript. The vocal burst model ensures these signals contribute to the emotional picture.
Burst-to-Experience Mapping
Raw vocal burst classifications are mapped to the TTS engine's emotion parameters. The mapping translates non-speech sounds into voice delivery adjustments:
Laugh
Enthusiastic
Match positive energy
Sigh
Sympathetic
Acknowledge weariness or frustration
Cry
Sympathetic
Respond with care
Gasp
Concerned
React to surprise or alarm
Groan
Sympathetic
Acknowledge discomfort
This mapping covers 25 vocal burst types. The effect is that when a caller sighs before answering, the agent's next response is delivered with a warmer, more empathetic tone - even if the caller's words are neutral.
Language Model
Analyzes the text of what the caller said (from the speech-to-text transcript) and returns scores across dozens of emotions, a sentiment scale, and toxicity categories.
The language model runs on transcript text, not audio. This is an intentional separation: it analyzes what the caller said, not how they sounded. This catches emotional signals that audio alone misses, including sarcasm, tiredness, annoyance, disapproval, and enthusiasm - five emotions that are difficult or impossible to detect from audio properties.
Rolling Window
The system does not react to a single data point. Emotional state is computed over a rolling 30-second window of approximately 15 segments. Within this window, recent signals carry more weight than older ones through recency-weighted linear averaging.
This design means:
A single frustrated utterance does not cause the agent to overreact
A sustained shift in emotional tone is detected within seconds
The agent responds to the caller's current state, not a running average of the entire call
Output Signals
The rolling window produces four output signals that downstream systems consume:
Valence
Positive or negative emotional direction. Is the caller trending toward satisfaction or distress?
Arousal
Emotional intensity. Is the caller calm or agitated, regardless of whether the emotion is positive or negative?
Trend
Direction of change. Is the emotional state improving, stable, or deteriorating over the window?
Coherence
Agreement across models. When prosody, burst, and language models agree, confidence is high. When they disagree, the situation may be more complex than any single signal suggests.
Coherence as a Trust Signal
When prosody and language channels disagree - for example, a caller says "I'm fine" but their voice is flat and low-energy - the coherence score drops. Low coherence triggers a specific behavior: the system prioritizes the vocal tone over the words. This catches situations where callers minimize their concerns verbally but reveal their actual state through how they sound.
How Emotion Steers Agent Behavior
Emotional signals influence the agent at two points in the pipeline:
Navigation decisions. The context graph engine receives emotional context when selecting the next action. High distress may trigger a different conversational path than calm engagement. The emotion data does not override the graph structure, but it informs which transitions are most appropriate.
Response generation and delivery. The response LLM receives emotional context as part of its prompt, influencing word choice and tone. The TTS engine receives emotion parameters that adjust the agent's vocal delivery: speaking pace, warmth, volume, and emphasis.
The result is an agent that does not just say different things based on how the caller feels, but says them differently. A scheduling confirmation delivered to a frustrated caller sounds different from the same information delivered to a cheerful one.
Proactive Emotional Intelligence
The voice agent does not wait for the caller to show distress before adjusting its approach. It detects sensitive topics from the context graph's current action content - before the caller has reacted. If the agent is about to discuss a difficult diagnosis, a billing dispute, or a missed appointment, it preemptively shifts to a more careful, empathetic delivery.
This means the agent can be gentle about a sensitive topic from the first word, rather than detecting distress after delivering information bluntly and then trying to recover.
Call-Phase Adaptation
The agent adjusts its behavior based on call duration combined with emotional trajectory:
After 5+ minutes with a deteriorating mood trend, the agent increases its speaking pace and becomes more direct - respecting the caller's time when things aren't going well
After 10+ minutes with sustained negative emotion, the system raises an urgency flag that can trigger escalation to an operator
These thresholds prevent calls from dragging on when the caller is clearly unhappy, without cutting short calls where the caller is engaged and the conversation is productive.
Fault Tolerance
Emotion detection failures never interrupt or degrade calls. The system continues with workspace-default emotional settings when the emotion pipeline is unavailable.
Emotion detection is protected by a circuit breaker. If the emotion analysis service experiences two consecutive failures, the circuit opens for 10 seconds. During this recovery period, the agent continues operating with workspace-default emotional settings. Calls are never interrupted or degraded by emotion detection failures.
Audio segments are buffered in non-blocking queues (maximum 5 segments for audio, 20 for text). If the emotion pipeline is slow, segments are dropped rather than queued indefinitely. The effect of dropped segments is slightly less precise emotion detection, not failure.
Audio Embeddings
The platform supports native audio embeddings that capture paralinguistic features - tone, urgency, hesitation, confidence - directly from audio segments without transcription. These embeddings enable semantic search over how something was said, not just what was said.
This is distinct from the three emotion models described above, which produce structured signals (valence, arousal, trend). Audio embeddings produce dense vectors that can be compared across conversations, enabling queries like "find calls where the caller sounded similar to this one" without relying on keyword matching or emotion labels.
Last updated
Was this helpful?

