> For the complete documentation index, see [llms.txt](https://docs.amigo.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.amigo.ai/channels/voice/emotion-detection.md).

# Emotion Detection

When emotion detection is enabled, the primary voice runtime analyzes caller audio and transcript text without putting either path on the critical path for speech recognition. The resulting signals can shape response guidance, filler behavior, and text-to-speech delivery. They are probabilistic conversation signals, not clinical assessments.

## Live Signal Flow

```mermaid
flowchart LR
    audio["Caller Audio"] --> segment["Voiced 2-Second Segments"]
    segment --> categorical["Categorical Acoustic Model\n9-Class Scores"]
    segment --> dimensional["Dimensional Acoustic Model\nValence + Arousal"]
    segment --> profile["Per-Call Acoustic Profile"]

    transcript["Final Transcript"] --> language["Sentiment + Toxicity"]

    categorical --> state["Rolling Emotional State\n4 Segments, About 8 Seconds"]
    dimensional --> state
    language --> state

    state --> empathy["Empathy Tier"]
    state --> tone["Prompt + TTS Steering"]
    state --> compounds["5-Turn Compound Resolver"]
    profile --> observer["Live Observer Payload"]
    state --> observer
```

There is no separate live vocal-burst classifier. Sighs, laughs, gasps, cries, and similar sounds are not routed through a dedicated burst model or a burst-first TTS priority path.

## Acoustic Analysis

The emotion path processes short voiced segments and skips audio below its silence threshold. Segmentation and analysis run independently from live transcription.

The service combines two acoustic outputs:

* **Categorical classification** - Nine scores: angry, disgusted, fearful, happy, neutral, other, sad, surprised, and unknown.
* **Dimensional inference** - Continuous acoustic attributes. The voice agent maintains valence and arousal in its rolling state and uses them for downstream guidance.

The rolling runtime label is derived from the combined dimensional region rather than assuming that the largest categorical score is authoritative, especially when the categorical distribution is nearly flat.

The agent maps the nine categorical labels into its runtime vocabulary. For example, `angry` becomes `Anger`, `happy` becomes `Joy`, and `neutral` becomes `Calmness`. The `other` and `unknown` classes are mapped to a non-emotional placeholder and excluded from the displayed score distribution.

{% hint style="info" %}
Dominance is not maintained as a live rolling-state or empathy-control dimension. Do not rely on it for agent behavior in the current runtime.
{% endhint %}

## Transcript Analysis

Final caller transcripts are also sent to the emotion service. A separate text path returns:

* **Sentiment** - one score from negative to positive
* **Toxicity** - category scores such as toxicity, threat, insult, and identity attack

Transcript analysis is asynchronous and best-effort. If it fails, the call continues with acoustic and transcript content still available to the rest of the voice pipeline.

The text path does not return a second set of named emotions. Labels such as Masked Distress, Cold Hostility, and Sarcasm are derived later by the compound resolver from relationships among sentiment, toxicity, valence, and arousal.

## Speaker Profile

The emotion service builds a per-call acoustic profile from each analyzed segment. It tracks:

* Relative energy
* A pitch-related acoustic proxy
* A speech-rate-related acoustic proxy

After five analyzed segments, approximately 10 seconds of voiced audio, the service emits normalized deltas relative to that caller's running baseline. It can also report whether recent energy is rising, falling, or stable.

These speaker-profile values are currently observability data. They appear in the live emotion payload after warmup, but they are not fed back into acoustic inference, the rolling emotional state, empathy classification, or the compound resolver.

## Rolling Emotional State

The voice agent keeps at most four recent acoustic segments, approximately eight seconds of voiced audio. Valence and arousal are averaged with linear recency weights, so newer segments influence the state more than older ones.

The rolling state derives a dominant runtime label from the aggregated valence-arousal region. It also applies two cross-channel checks:

* Mildly negative acoustic valence requires negative transcript sentiment before the state commits to Sadness.
* High toxicity with negative sentiment can replace an otherwise calm or positive label with Hostility or Contempt.

Trend classification starts once four segments are available. It compares the first and second halves of the four-segment window and reports improving, stable, or deteriorating.

### Signal Boundaries

| Signal                          | Current Use                                                                            |
| ------------------------------- | -------------------------------------------------------------------------------------- |
| **Categorical scores**          | Per-segment observer data and current-turn compound dyads                              |
| **Rolling valence and arousal** | Empathy classification, prompt guidance, TTS tone selection, and compound trajectories |
| **Dominant runtime label**      | Prompt annotations, empathy checks, TTS tone mapping, and call summary                 |
| **Trend**                       | Prompt guidance and call summary                                                       |
| **Sentiment and toxicity**      | Cross-channel label checks, compound signals, observer data, and call summary          |
| **Coherence**                   | Diagnostic agreement score in observer data and the terminal emotion summary           |
| **Speaker profile**             | Live observer data after warmup                                                        |

The terminal emotion summary stores aggregate evidence, including the final dominant label, average valence and arousal, peak negative valence, shift count, final trend, segment count, behavioral counters, coherence, sentiment, and toxicity. It is not a guaranteed per-segment timeline.

## Empathy Tier Classification

Before navigation for each caller turn, a rule-based classifier assigns one of four empathy tiers. It uses the current transcript plus the rolling valence, arousal, dominant label, and recent acoustic valence history. It does not make an additional model call.

| Tier | Name             | Runtime Behavior                                                             |
| ---- | ---------------- | ---------------------------------------------------------------------------- |
| T0   | **Functional**   | Normal task-oriented response and filler behavior                            |
| T1   | **Light Touch**  | Empathy-oriented filler content before normal task content                   |
| T2   | **Full Empathy** | A configured empathy hold and an empathy-first response prompt               |
| T3   | **Hold Space**   | A longer hold, filler suppression, and no task advancement for that response |

The classifier checks higher tiers first:

* **T3** - Explicit crisis or loss language and implicit grief markers such as funeral, bereavement, hospice, or palliative-care context
* **T2** - Strong negative valence; Fear or Sadness with agreeing negative valence; three recent negative acoustic valence readings; distress, helplessness, or financial-distress language
* **T1** - Mild negative valence, mild concern language, concern for a dependent, vulnerability cues, or negative high arousal

Negation, figurative crisis phrases, and resolved past-tense statements are filtered before keyword rules are applied. For example, "I'm not worried," "dying to know," and "I was worried but I'm fine now" do not take the same path as current distress.

The empathy tier is classified before the current turn's compound snapshot is created. Compound scores do not raise or lower the empathy tier in the current runtime.

### Empathy Baseline

The controller keeps a separate empathy baseline so delivery does not snap back immediately after a difficult turn. Each tier contributes a bounded signal, and the baseline decays across later caller turns. When elevated, it can reduce configured TTS speed, subject to the service's minimum-speed floor.

## Behavioral Signals

The rolling state also tracks caller behavior:

| Signal                    | Rolling-State Threshold                         | Prompt Effect                   |
| ------------------------- | ----------------------------------------------- | ------------------------------- |
| **Barge-ins**             | 2 or more during the call                       | Notes repeated interruptions    |
| **Short response streak** | 3 or more consecutive turns of 4 words or fewer | Notes sustained terse responses |
| **Long silence count**    | 2 or more gaps of at least 5 seconds            | Notes repeated extended pauses  |

These counters can appear in emotional prompt guidance once at least two acoustic segments are available. The compound resolver uses related per-turn observations over its own five-turn window, with stricter cross-signal conditions described in [Compound Emotions](/channels/voice/compound-emotions.md).

## TTS and Prompt Steering

After enough acoustic evidence is available, the rolling state can add concise adaptation guidance to the response prompt and map caller state to an empathetic TTS tone. The mapping responds to the caller without imitating negative emotion: an angry or disgusted signal maps to calm delivery, Fear or Sadness maps to sympathetic delivery, and Joy maps to enthusiastic delivery.

Weak or unmapped acoustic turns retain the previous successfully derived emotion tone instead of resetting immediately. This tone momentum applies to the emotion-derived TTS tone; navigation can still select a response emotion, and a configured workspace tone can override the computed voice-context tone. See [TTS Tone Resolution](/channels/voice/audio-pipeline.md#tts-tone-resolution) for the complete order.

If no emotion-derived tone is available, the current context-graph action can trigger a sympathetic fallback when it matches a configured sensitive topic. This is a delivery fallback, not a prediction that the caller is distressed.

Call-duration guidance is prompt-only. During an extended call with deteriorating or negative evidence, the prompt can ask for shorter, resolution-focused responses and suggest considering escalation. These checks do not automatically transfer or escalate the call.

## Compound Emotions

At each caller turn, the runtime resolves zero or more scored compounds from acoustic, linguistic, and behavioral evidence over the most recent five caller turns.

{% content-ref url="/pages/IeSDYAEqjzKUcXkDFduG" %}
[Compound Emotions](/channels/voice/compound-emotions.md)
{% endcontent-ref %}

## Fault Tolerance

Emotion analysis is optional to the live conversation path:

* A connection failure is non-fatal and the call continues without emotion-derived steering.
* Backpressure can skip best-effort emotion segments rather than delaying speech processing.
* Repeated receive or parsing failures can disable the emotion stream for the rest of that session.
* Transcript-analysis errors do not interrupt acoustic analysis or the call.

When emotion data is unavailable, the voice runtime uses navigation, configured voice settings, and provider defaults. It does not synthesize missing emotion evidence.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.amigo.ai/channels/voice/emotion-detection.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.