face-smileEmotion Detection

Dual-model prosody, vocal burst detection, language analysis, and speaker verification track caller emotional state and identity in real time, with per-caller normalization, context fusion, and compou

The voice agent analyzes the caller's emotional state in real time through three complementary analysis layers: acoustic prosody (how they sound), vocal burst detection (non-speech vocalizations like sighs and laughs), and language analysis (what their words convey). These run continuously during every call, providing a composite picture of the caller's emotional state.

In healthcare phone calls, a frustrated caller needs a different response than a confused one. A caller whose voice is shaking needs the agent to slow down and acknowledge difficulty, not push through a scheduling script.

Analysis Layers

Audio segments feed the prosody and vocal burst analysis; text transcripts feed the language analysis.

Prosody Analysis

Analyzes the acoustic properties of speech using two complementary approaches on each 2-second audio segment:

  • Categorical model - Classifies the segment across distinct emotion categories (angry, sad, happy, fearful, surprised, disgusted, neutral, and others). This determines what the caller is feeling and drives TTS emotion routing.

  • Dimensional model - Directly predicts continuous valence (positive/negative), arousal (calm/agitated), and dominance (submissive/assertive) values from the audio waveform. This determines how much the caller is feeling and drives empathy tier classification and risk scoring.

Separating categorical and dimensional inference avoids a fundamental limitation: deriving continuous emotional intensity from discrete category labels requires lookup tables that lose information. By predicting dimensional values directly from the audio signal, the system produces stable, precise measurements even on telephony-grade audio where categorical scores can be ambiguous.

Telephony audio (8kHz) is lower quality than what the models were trained on. Before inference, the system upsamples to the models' native sample rate using high-quality interpolation. The valence and arousal baselines are calibrated specifically for telephony audio characteristics, which differ systematically from studio-quality recordings. Without this telephony-specific calibration, the prosody model would over-report negative emotions on normal calls because 8kHz audio inherently sounds flatter and more compressed than the training distribution.

The system also gates dominant emotion labels behind an extremity threshold. On calls where the emotional signal is weak or ambiguous (average telephony calls), the system reports neutral rather than claiming a specific named emotion. Only clearly strong signals produce named emotion labels. This prevents the system from labeling every routine call with a specific emotion and keeps the signal meaningful when it does appear.

Prosody analysis catches signals invisible in text. A caller who says "I'm fine" in a flat, low-energy tone registers differently than one who says the same words with bright intonation. The prosody model detects frustration, anxiety, sadness, confusion, and relief from vocal qualities alone.

Vocal Burst Detection

Detects non-speech vocalizations - sighs, laughs, groans, gasps, cries - from the same 2-second audio segments as the prosody analysis. The detector identifies acoustic signatures specific to each vocalization type:

  • Laugh: Periodic energy bursts characteristic of the "ha-ha-ha" pattern

  • Sigh: Breathy exhale with falling energy and low pitch

  • Gasp: Sharp inhale with rapid energy onset

  • Cry: Sustained vocalization with high pitch variation, confirmed by negative emotion context from the prosody model

A deep sigh before answering a question, a nervous laugh, or a groan of pain carries information that never appears in a transcript. Vocal burst detection captures these signals and feeds them into the emotional picture. Detected vocalizations feed directly into the TTS emotion priority chain, where bursts take the highest priority above prosody-derived emotion signals.

Burst-to-Experience Mapping

Raw vocal burst classifications are mapped to voice delivery emotion parameters. The mapping translates non-speech sounds into voice delivery adjustments:

Vocal Burst
Delivery Emotion
Why

Laugh

Enthusiastic

Match positive energy

Sigh

Sympathetic

Acknowledge weariness or frustration

Cry

Sympathetic

Respond with care

Gasp

Concerned

React to surprise or alarm

Groan

Sympathetic

Acknowledge discomfort

This mapping covers 25 vocal burst types. The effect is that when a caller sighs before answering, the agent's next response is delivered with a warmer, more empathetic tone - even if the caller's words are neutral.

Language Analysis

Analyzes the text of what the caller said (from the speech-to-text transcript) and returns scores across emotions, a sentiment scale, and toxicity categories.

Language analysis runs on transcript text, not audio. This is an intentional separation: it analyzes what the caller said, not how they sounded. This catches emotional signals that audio alone misses, including sarcasm, tiredness, annoyance, disapproval, and enthusiasm - emotions that are difficult or impossible to detect from audio properties.

Speaker Normalization

Raw emotion models are trained on population-average baselines, which creates blind spots for callers at the extremes. A naturally quiet speaker who raises their voice slightly is escalating - but the model sees "neutral" because the absolute energy is still low. A naturally loud speaker who gets quiet is disengaging - but the model sees "calm."

Speaker normalization solves this by building a per-call acoustic profile for each caller. Starting from the first audio segment, the system tracks running statistics on three features: energy (volume level), pitch, and speech rate. After a brief warmup period (approximately 10 seconds), the system begins producing normalized deltas - how far the caller's current acoustic features deviate from their own baseline, not from a population average.

This means "getting louder" is measured relative to how loud this particular caller normally speaks, not relative to the average caller. The normalized deltas feed into the emotion models alongside raw scores, improving detection accuracy for speakers whose baseline differs significantly from the training population.

The system also tracks energy trends over a sliding window. A consistently rising energy trend reinforces escalation signals even when individual segments look ambiguous. A falling trend supports disengagement detection.

Rolling Window

The system does not react to a single data point. Emotional state is computed over a short rolling window of the most recent audio segments. Within this window, recent signals carry more weight than older ones through recency-weighted linear averaging.

The window is deliberately short (seconds, not minutes) so the agent tracks mood shifts in real time rather than averaging over stale history. This design means:

  • A single frustrated utterance does not cause the agent to overreact

  • A sustained shift in emotional tone is detected within seconds

  • The agent responds to the caller's current state, not a running average of the entire call

Rolling Average vs. Per-Segment

The emotion system produces two parallel signal layers:

  • Rolling average - The smoothed emotional state computed across the window. This is what the agent adapts to: empathy tier classification, response tone, and context graph navigation all use the rolling average. It filters out momentary noise.

  • Per-segment - The raw emotional reading from the most recent 2-second segment. This is what is happening right now. Per-segment values are available in call intelligence and the observer feed for real-time monitoring, so operators can see instantaneous emotional shifts even before they affect the rolling average.

Output Signals

The rolling window produces five output signals that downstream systems consume:

Signal
What It Represents

Valence

Positive or negative emotional direction. Is the caller trending toward satisfaction or distress?

Arousal

Emotional intensity. Is the caller calm or agitated, regardless of whether the emotion is positive or negative?

Dominance

Perceived control. Is the caller assertive and in control of the conversation, or submissive and deferential? Low dominance combined with negative valence is a strong indicator of helplessness.

Trend

Direction of change. Is the emotional state improving, stable, or deteriorating over the window?

Coherence

Agreement across models. When prosody, burst, and language models agree, confidence is high. When they disagree, the situation may be more complex than any single signal suggests.

Coherence as a Trust Signal

When prosody and language channels disagree - for example, a caller says "I'm fine" but their voice is flat and low-energy - the coherence score drops. Low coherence triggers a specific behavior: the system prioritizes the vocal tone over the words. This catches situations where callers minimize their concerns verbally but reveal their actual state through how they sound.

Coherence also factors in toxicity. A caller who sounds calm but is using hostile or abusive language produces low coherence, because the acoustic signal (calm) contradicts the language signal (hostile). In these cases, the language signal takes priority: the system recognizes the situation as hostility rather than calmness, regardless of how composed the voice sounds. This prevents the agent from responding warmly to someone who is being abusive simply because their tone of voice is controlled.

Display Label Gating

The Developer Console applies confidence gating to emotion labels before displaying them. Dominant emotion labels only appear when the signal shows clearly extreme valence or arousal values. When the signal is moderate - as it is on the majority of typical telephony calls - the display shows "Neutral" rather than a specific emotion name. This prevents low-confidence detections from appearing as authoritative assessments during routine conversations.

Compound emotion signals are similarly gated behind a higher confidence threshold. Only compound detections with strong confidence scores are surfaced in the UI. Display names for both dominant and compound emotions use approachable, non-clinical terminology (for example, "Quiet stretch" instead of "Disengagement") to better reflect the provisional nature of the signal.

Compound Emotions

Basic emotion categories (angry, sad, happy, fearful) are useful for broad classification but miss the nuance of real conversations. A caller who is simultaneously sad and angry is experiencing bitterness, not sadness or anger. A caller who has been negative for five turns with low energy is resigned, not just sad.

The compound emotion resolver runs at each caller turn boundary and fuses all available evidence - acoustic scores, dimensional values, sentiment, toxicity, behavioral signals, and conversation context - to identify these nuanced emotional states.

Five signal layers contribute to compound resolution:

Layer
What It Detects
Examples

Emotion co-activation

When two basic emotions are present simultaneously

Bitterness (sad + angry), Contempt (angry + disgusted), Despair (sad + fearful), Delight (happy + surprised)

Temporal trajectory

Emotional patterns sustained or changing across multiple turns

Resignation (sustained negative + low energy), Escalating (rising arousal), Recovering (improving valence), Ambivalence (oscillating positive/negative)

Behavioral amplifiers

Turn-level behaviors that reveal emotional state beyond what voice or words show

Impatience (repeated barge-ins + negative valence), Withdrawal (very short responses + negative sentiment), Disengagement (long silences + low arousal)

Contextual modulation

Conversation state that colors the emotional interpretation

Process Frustration (tool failures + negative valence), Helplessness (stuck conversation), Relief (just resolved)

Linguistic override

Language content that overrides or contradicts acoustic signals

Cold Hostility (high toxicity + calm voice), Masked Distress (very negative words + neutral-sounding voice), Sarcasm (negative sentiment + positive-sounding voice)

Compound emotions require stronger evidence than basic emotion labels before they fire. Disengagement requires multiple sustained silences with low arousal, not just a quiet caller. Resignation requires negative valence, low arousal, and confirming negative language in the transcript, not just a flat voice. These conjunctive rules prevent the system from claiming nuanced emotional states based on ambiguous signals.

Compound emotions are persisted at two levels. Each turn in the call record carries the compound emotions detected during that turn, so you can trace exactly when frustration emerged or when resignation set in. At the call level, the intelligence summary aggregates the peak score for each compound emotion across all turns, giving a single view of which emotional patterns appeared during the call and how strongly.

Compound emotions are also available in real time through the observer feed, enabling live monitoring dashboards to show nuanced emotional states as they develop. They enable more targeted agent responses - the agent can distinguish between a caller who is frustrated with the process (Process Frustration) and one who is emotionally withdrawing (Withdrawal), even when the basic emotion classification shows "negative" for both.

Empathy Tier Classification

Before the agent generates a response, the system classifies each caller turn into one of four empathy tiers. This classification runs as a rule-based engine (no LLM call) so it completes in under 100 milliseconds and never adds latency to the voice pipeline.

The empathy tier is not a style modifier - it gates what the pipeline does. Higher tiers change whether the agent speaks, how much it says, how fast it talks, and whether it advances the task at all.

Tier
Name
Pipeline Behavior

T0

Functional

Normal task-oriented pipeline. Standard fillers and pacing.

T1

Light Touch

Empathy filler before task content. Slightly warmer delivery.

T2

Full Empathy

0.5-second pause before speaking. Response leads with empathy, task is secondary.

T3

Hold Space

1-second pause. Pure empathy response. Zero task advancement. Fillers suppressed entirely.

What Triggers Each Tier

The classifier evaluates transcript keywords, emotion signals (valence, arousal, dominant emotion), and emotional trend history. Higher tiers take precedence - the system checks T3 first, then T2, then T1.

T3 Hold Space fires on crisis language (grief, loss of a loved one, suicidal ideation), implicit grief markers (funeral arrangements, hospice), or extreme negative emotion with high arousal.

T2 Full Empathy fires on significant negative valence, distress emotions (anxiety, fear, sadness), sustained emotional decline across three or more turns, helplessness patterns ("I've tried everything," "nobody will help"), or financial distress. When T2 is triggered by the categorical emotion label alone, the system requires valence agreement - the dimensional model must independently confirm negative valence before T2 activates. This conjunctive check (matching the pattern already used for T3) prevents false empathy escalation on ambiguous audio.

T1 Light Touch fires on mild negative valence, concern keywords (pain, discomfort, nervousness), concern for dependents (calling about a child or elderly parent), or vulnerability cues (embarrassment, difficulty discussing a topic).

Linguistic Safeguards

The classifier handles three categories of false signals that would otherwise produce bizarre or patronizing responses:

  • Negation detection - "I'm not worried" and "I don't feel scared" are excluded. The system checks a three-word window before each keyword for negation words.

  • Idiomatic expressions - "Dying to know," "kills me," and "dead serious" do not trigger crisis mode. Known figurative phrases are excluded before keyword matching.

  • Resolved emotions - "I was worried but I'm fine now" is recognized as past-tense resolution and excluded from the distress tier.

Empathy Baseline Decay

After the caller shows distress, the system maintains an empathy baseline that decays gradually (0.15 per turn). This prevents the agent from snapping back to a chipper, transactional tone one turn after someone was in tears. The baseline keeps empathetic delivery in place as the conversation moves forward.

Context Fusion

The same words carry different emotional weight depending on what is happening in the conversation. "Okay, fine" after three failed appointment searches is frustrated resignation. "Okay, fine" after a successful booking is satisfied agreement. Raw acoustic and language models cannot distinguish between these because they analyze each segment in isolation.

Context fusion closes this gap by feeding conversation state into the emotion engine alongside audio and transcript data. After each turn, the agent sends a context summary that includes the current state in the context graph, what the agent just did, how many tool calls have failed, and whether escalation is active. The emotion engine uses this context to apply targeted adjustments:

  • Stuck conversations (repeated tool failures or no state progress) amplify negative emotional signals - the system becomes more sensitive to frustration and resignation

  • Resolved conversations (task completed, booking confirmed) amplify positive signals - ambiguous expressions are interpreted more favorably

  • Active escalation heightens arousal sensitivity, ensuring distress signals are not underweighted during handoff

  • Long unresolved calls (10+ turns without resolution) generate a fatigue signal that contributes to empathy tier classification

  • Sensitive topics (crisis, loss, abuse) boost empathy sensitivity, lowering the threshold for higher empathy tiers

Context fusion does not override the acoustic models. It provides weighting adjustments that shift interpretation when the conversation context makes certain emotional states more likely. The raw model output is always available alongside the contextually adjusted signals.

Behavioral Signal Tracking

Beyond acoustic and linguistic analysis, the system tracks three behavioral signals updated in real time during every conversation:

Signal
What It Detects
Threshold
Meaning

Interruption count

Number of times the caller has interrupted the agent

2+

Caller frustration, agent talking too much

Short response streak

Consecutive responses of four words or fewer

3+

Disengagement - the caller is withdrawing

Silence gaps

Extended pauses from the caller (5+ seconds)

2+

Confusion, hesitation, or distress

These signals are independent of emotion detection. A caller who is interrupting frequently may not sound frustrated in their voice, but the behavioral pattern reveals impatience. A caller giving one-word answers may sound calm but is disengaging.

When thresholds are crossed, the signals are injected into the agent's response generation prompt as context. The agent sees "2 barge-ins this call" or "3 consecutive short responses" and adapts its behavior accordingly - shortening responses for impatient callers, offering reassurance for withdrawing ones.

Behavioral signals combine with emotion detection for compound state detection. Angry voice + frequent interruptions strengthens the escalation signal. Sad voice + extended silence suggests the caller needs space. Confused voice + high coherence (words match tone) means the agent should simplify its language.

How Emotion Steers Agent Behavior

Emotional signals influence the agent at two points in the pipeline:

Navigation decisions. The context graph engine receives emotional context when selecting the next action. High distress may trigger a different conversational path than calm engagement. The emotion data does not override the graph structure, but it informs which transitions are most appropriate.

Response generation and delivery. The response generation receives emotional context, influencing word choice and tone. At T2 and above, the response generation is explicitly instructed to lead with empathy before any task content. At T3, the entire response is empathy - the agent does not collect information, ask questions, or offer solutions. The text-to-speech engine receives emotion parameters that adjust the agent's vocal delivery: speaking pace, warmth, volume, and emphasis.

The empathy engine reduces TTS speed after detecting caller distress to convey warmth. A configurable minimum TTS speed prevents over-modulation - without a floor, the speed reduction can make the agent sound unnaturally slow. The minimum speed is set per service in the voice configuration.

The agent changes both what it says and how it says it based on the caller's emotional state. A scheduling confirmation delivered to a frustrated caller sounds different from the same information delivered to a cheerful one.

Four-dimensional adaptation: voice tone, filler behavior, response content, and behavioral signals adapt simultaneously

The adaptation operates across four independent dimensions simultaneously. Each dimension uses a different output channel, so the agent can change its vocal tone, suppress fillers, adjust response content, and react to behavioral patterns - all independently and in real time. See Situation-Response Adaptation for the complete mapping.

Pre-Emptive Tone Adjustment

The voice agent does not wait for the caller to show distress before adjusting its approach. It detects sensitive topics from the context graph's current action content before the caller has reacted. If the agent is about to discuss a difficult diagnosis, a billing dispute, or a missed appointment, it shifts to a more careful, empathetic delivery before the first word.

Without this, the agent would deliver information bluntly, detect distress, and then try to recover.

Call-Phase Adaptation

The agent adjusts its behavior based on call duration combined with emotional trajectory:

  • After 5+ minutes with a deteriorating mood trend, the agent increases its speaking pace and becomes more direct to respect the caller's time

  • After 10+ minutes with sustained negative emotion, the system raises an urgency flag that can trigger escalation to an operator

These thresholds prevent calls from dragging on when the caller is clearly unhappy, without cutting short calls where the caller is engaged and the conversation is productive.

Fault Tolerance

circle-check

Emotion detection is protected by automatic fault isolation. If the emotion pipeline experiences consecutive failures, the system temporarily falls back to workspace-default emotional settings. Calls are never interrupted or degraded by emotion detection failures.

If the emotion pipeline is slow, older segments are dropped rather than queued indefinitely. The effect of dropped segments is slightly less precise emotion detection, not failure.

Speaker Verification

The same audio analysis infrastructure supports voice biometric verification - identifying the caller based on how they sound, not just what they say.

Speaker verification works through enrollment and matching. During enrollment, the platform captures a voice sample and generates a voiceprint - a compact numerical representation of the speaker's unique vocal characteristics. On subsequent interactions, incoming audio is compared against enrolled voiceprints to confirm the speaker's identity.

This operates independently from the emotion models. Emotion detection analyzes how the caller feels. Speaker verification determines who the caller is.

Enrollment

Voiceprints are enrolled through the Platform API using a short audio sample of natural speech. The enrollment is stored as a biometric event in the world model on the person entity, following the same event-sourced pattern as all other patient data.

Enrollment status (whether a voiceprint exists, when it was enrolled) is visible on the entity state. Raw biometric data remains in the event store and is not surfaced through entity queries - only metadata is projected.

Matching

During a call or clinical encounter, audio is compared against enrolled voiceprints. When a match is confirmed, the system attributes audio segments to the identified speaker. This speaker identity flows through to transcripts and call intelligence data.

The verification threshold is configurable per workspace, allowing organizations to balance security (higher threshold, fewer false matches) against usability (lower threshold, fewer false rejections). Speaker verification is optional and can be enabled or disabled at the workspace level without affecting other audio analysis capabilities.

Audio Embeddings

The platform supports native audio embeddings that capture paralinguistic features - tone, urgency, hesitation, confidence - directly from audio segments without transcription. These embeddings enable semantic search over how something was said, not just what was said.

This is distinct from the three emotion models described above, which produce structured signals (valence, arousal, trend). Audio embeddings produce dense vectors that can be compared across conversations, enabling queries like "find calls where the caller sounded similar to this one" without relying on keyword matching or emotion labels.

Last updated

Was this helpful?