Emotion Detection

Three parallel models analyze vocal prosody, burst patterns, and language to track caller emotional state in real time.

The voice agent analyzes the caller's emotional state in real time using three parallel models. These models run continuously during every call, providing a composite picture of how the caller is feeling based on how they sound, what nonverbal vocalizations they produce, and what their words convey.

In healthcare phone calls, a frustrated caller needs a different interaction style than a confused one. A caller whose voice is shaking needs the agent to slow down and acknowledge difficulty, not barrel through a scheduling script.

Three Parallel Models

All three models run on a single WebSocket connection using dual-payload multiplexing. Audio segments request the prosody and vocal burst models; text transcripts request the language model. Each response contains only the models that were requested, so there is no ambiguity in parsing results.

Prosody Model

Analyzes the acoustic properties of speech: pitch, rhythm, timbre, pace, and vocal quality. Processes 2-second audio segments and returns scores across dozens of distinct emotions.

This model captures things that words alone cannot. A caller who says "I'm fine" in a flat, low-energy tone registers differently than one who says the same words with bright intonation. The prosody model detects frustration, anxiety, sadness, confusion, and relief from vocal qualities alone.

Vocal Burst Model

Analyzes non-speech vocalizations: sighs, laughs, groans, gasps, cries, and other sounds that fall outside of normal speech. Processes the same 2-second audio segments as the prosody model and classifies them across dozens of vocal types.

These signals are important because transcription systems discard them entirely. A deep sigh before answering a question, a nervous laugh, or a groan of pain carries information that never appears in a transcript. The vocal burst model ensures these signals contribute to the emotional picture.

Burst-to-Experience Mapping

Raw vocal burst classifications are mapped to the TTS engine's emotion parameters. The mapping translates non-speech sounds into voice delivery adjustments:

Vocal Burst

TTS Emotion

Why

Laugh

Enthusiastic

Match positive energy

Sigh

Sympathetic

Acknowledge weariness or frustration

Cry

Sympathetic

Respond with care

Gasp

Concerned

React to surprise or alarm

Groan

Sympathetic

Acknowledge discomfort

This mapping covers 25 vocal burst types. The effect is that when a caller sighs before answering, the agent's next response is delivered with a warmer, more empathetic tone - even if the caller's words are neutral.

Language Model

Analyzes the text of what the caller said (from the speech-to-text transcript) and returns scores across dozens of emotions, a sentiment scale, and toxicity categories.

The language model runs on transcript text, not audio. This is an intentional separation: it analyzes what the caller said, not how they sounded. This catches emotional signals that audio alone misses, including sarcasm, tiredness, annoyance, disapproval, and enthusiasm - five emotions that are difficult or impossible to detect from audio properties.

Rolling Window

The system does not react to a single data point. Emotional state is computed over a rolling 30-second window of approximately 15 segments. Within this window, recent signals carry more weight than older ones through recency-weighted linear averaging.

This design means:

A single frustrated utterance does not cause the agent to overreact
A sustained shift in emotional tone is detected within seconds
The agent responds to the caller's current state, not a running average of the entire call

Output Signals

The rolling window produces four output signals that downstream systems consume:

Signal

What It Represents

Valence

Positive or negative emotional direction. Is the caller trending toward satisfaction or distress?

Arousal

Emotional intensity. Is the caller calm or agitated, regardless of whether the emotion is positive or negative?

Trend

Direction of change. Is the emotional state improving, stable, or deteriorating over the window?

Coherence

Agreement across models. When prosody, burst, and language models agree, confidence is high. When they disagree, the situation may be more complex than any single signal suggests.

Coherence as a Trust Signal

When prosody and language channels disagree - for example, a caller says "I'm fine" but their voice is flat and low-energy - the coherence score drops. Low coherence triggers a specific behavior: the system prioritizes the vocal tone over the words. This catches situations where callers minimize their concerns verbally but reveal their actual state through how they sound.

How Emotion Steers Agent Behavior

Emotional signals influence the agent at two points in the pipeline:

Navigation decisions. The context graph engine receives emotional context when selecting the next action. High distress may trigger a different conversational path than calm engagement. The emotion data does not override the graph structure, but it informs which transitions are most appropriate.

Response generation and delivery. The response LLM receives emotional context as part of its prompt, influencing word choice and tone. The TTS engine receives emotion parameters that adjust the agent's vocal delivery: speaking pace, warmth, volume, and emphasis.

The result is an agent that does not just say different things based on how the caller feels, but says them differently. A scheduling confirmation delivered to a frustrated caller sounds different from the same information delivered to a cheerful one.

Proactive Emotional Intelligence

The voice agent does not wait for the caller to show distress before adjusting its approach. It detects sensitive topics from the context graph's current action content - before the caller has reacted. If the agent is about to discuss a difficult diagnosis, a billing dispute, or a missed appointment, it preemptively shifts to a more careful, empathetic delivery.

This means the agent can be gentle about a sensitive topic from the first word, rather than detecting distress after delivering information bluntly and then trying to recover.

Call-Phase Adaptation

The agent adjusts its behavior based on call duration combined with emotional trajectory:

After 5+ minutes with a deteriorating mood trend, the agent increases its speaking pace and becomes more direct - respecting the caller's time when things aren't going well
After 10+ minutes with sustained negative emotion, the system raises an urgency flag that can trigger escalation to an operator

These thresholds prevent calls from dragging on when the caller is clearly unhappy, without cutting short calls where the caller is engaged and the conversation is productive.

Fault Tolerance

Emotion detection failures never interrupt or degrade calls. The system continues with workspace-default emotional settings when the emotion pipeline is unavailable.

Emotion detection is protected by a circuit breaker. If the emotion analysis service experiences two consecutive failures, the circuit opens for 10 seconds. During this recovery period, the agent continues operating with workspace-default emotional settings. Calls are never interrupted or degraded by emotion detection failures.

Audio segments are buffered in non-blocking queues (maximum 5 segments for audio, 20 for text). If the emotion pipeline is slow, segments are dropped rather than queued indefinitely. The effect of dropped segments is slightly less precise emotion detection, not failure.

Audio Embeddings

The platform supports native audio embeddings that capture paralinguistic features - tone, urgency, hesitation, confidence - directly from audio segments without transcription. These embeddings enable semantic search over how something was said, not just what was said.

This is distinct from the three emotion models described above, which produce structured signals (valence, arousal, trend). Audio embeddings produce dense vectors that can be compared across conversations, enabling queries like "find calls where the caller sounded similar to this one" without relying on keyword matching or emotion labels.

PreviousAudio Pipeline NextOperators and Escalation

Last updated 18 hours ago

Was this helpful?

Good evening

hashtagThree Parallel Models

hashtagProsody Model

hashtagVocal Burst Model

hashtagBurst-to-Experience Mapping

hashtagLanguage Model

hashtagRolling Window

hashtagOutput Signals

hashtagCoherence as a Trust Signal

hashtagHow Emotion Steers Agent Behavior

hashtagProactive Emotional Intelligence

hashtagCall-Phase Adaptation

hashtagFault Tolerance

hashtagAudio Embeddings

Three Parallel Models

Prosody Model

Vocal Burst Model

Burst-to-Experience Mapping

Language Model

Rolling Window

Output Signals

Coherence as a Trust Signal

How Emotion Steers Agent Behavior

Proactive Emotional Intelligence

Call-Phase Adaptation

Fault Tolerance

Audio Embeddings