# Reasoning Engine

The reasoning engine is the core intelligence layer of the Amigo platform. It processes input signals from any channel - voice, SMS, simulation, or API - through a unified pipeline that navigates context graphs, executes tools, adapts to the caller's emotional state, and generates responses. The same reasoning path runs regardless of how the conversation arrives.

## Why a Unified Engine Matters

Early voice AI systems tightly couple reasoning logic with audio transport. The agent's decision-making is interleaved with speech-to-text timing, filler audio generation, and WebSocket management. This coupling means every new channel (SMS, simulation, API webhooks) must reimplement the reasoning loop from scratch, and bugs fixed in one channel don't propagate to others.

Amigo separates the concern cleanly. The reasoning engine never touches audio, transport protocols, or channel-specific I/O. Instead, modality adapters convert channel-specific input into typed signals, feed them to the engine, and execute the effects that come back.

<figure><img src="https://3635224444-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvcLyiHRcwv7g83p6vxAd%2Fuploads%2Fgit-blob-ac6638b0de913d4f92da5c583ca16983ebe33e02%2Freasoning-engine-blue.svg?alt=media" alt="Unified reasoning engine: modality adapters feed signals to Perceive, Reason, Execute pipeline"><figcaption></figcaption></figure>

## Cut / Navigate / Engage

Three operations drive the entire system:

1. **Cut** - A signal arrives. The system asks: does this end the current state? A cut is a compression - between two cuts, dozens of raw signals may arrive (audio segments, emotion scores, behavioral patterns, tool events). The cut absorbs them into a handful of causally relevant fields and discards the rest. This compression is what makes navigation tractable.
2. **Navigate** - Given the compressed state and the trajectory of previous states, select the next state. Navigation is a pure decision: no side effects, no I/O. It can be called speculatively, cached, or replayed.
3. **Engage** - Execute the selected state: generate a response, call a tool, play audio, or set a deadline that will produce the next signal.

These three operations are fractal - the same pattern at every scale:

| Scale            | Cut                                         | Navigate                                    | Engage                                  |
| ---------------- | ------------------------------------------- | ------------------------------------------- | --------------------------------------- |
| **Conversation** | Signal ends the current context graph state | Select the next state from the graph        | Generate response, call tools           |
| **Turn**         | Signal ends the current voice state         | Select breath, filler, hold, or response    | Enqueue utterance with voice parameters |
| **Audio**        | Utterance boundary                          | Select emotion and speed for this utterance | Stream audio to the caller              |

Every voice behavior - fillers, silence, empathy pauses, tool progress narration, barge-in recovery - is a special case of cut/navigate/engage. Not a separate mechanism. One pattern, applied at the right scale. The [voice timeline](https://docs.amigo.ai/channels/voice/audio-pipeline#voice-timeline) describes how this pattern operates within each turn.

## Signals and Effects

The engine communicates through two primitives.

**Signals** represent something that happened. Every input, regardless of source, is normalized into a typed signal before reaching the engine:

| Signal             | What It Represents                                                                                 |
| ------------------ | -------------------------------------------------------------------------------------------------- |
| **Utterance**      | The caller or user said something (text, from any source)                                          |
| **Emotion**        | An emotional state update from voice prosody, vocal bursts, text analysis, or conversation context |
| **Tool result**    | A tool execution completed with a result                                                           |
| **Silence**        | The caller has been silent beyond the configured threshold                                         |
| **Barge-in**       | The caller interrupted the agent mid-speech                                                        |
| **External event** | An injected event from an operator, surface submission, or external system                         |

**Effects** represent something the engine wants to happen. The modality adapter decides how to execute each one:

| Effect        | Voice                                            | SMS                       | Simulation                  |
| ------------- | ------------------------------------------------ | ------------------------- | --------------------------- |
| **Respond**   | Stream through TTS with emotion-appropriate tone | Send as SMS message       | Capture in trace log        |
| **Filler**    | Play filler audio ("Let me check on that...")    | No-op                     | No-op                       |
| **Pause**     | Hold silence for empathetic beat                 | Delay before next message | Record pause duration       |
| **Tool call** | Execute tool, feed result back as signal         | Same                      | Same (branch-isolated data) |
| **Terminate** | Hang up after final speech                       | End session               | Return final state          |

## The Pipeline

Each signal flows through three stages.

**Perceive.** The modality adapter converts raw input into typed signals. A voice adapter produces utterance signals from speech-to-text and emotion signals from prosody analysis. An SMS adapter produces utterance signals from message text. A simulation adapter injects both from test parameters.

**Reason.** The engine's core loop implements cut/navigate/engage at the conversation level:

1. **Navigate** - The context graph engine determines the current state, evaluates transition conditions, and selects the appropriate action.
2. **Engage** - The response generation model produces a reply, drawing on the agent's persona, current state context, dynamic behaviors, memory, patient data from the world model, and the emotional context described below.
3. **Execute** - If the model calls tools, the engine executes them, feeds results back as tool result signals, and re-engages. This loop continues until a final text response is produced.

**Act.** The engine emits effects. The modality adapter executes each one according to channel capabilities. For voice, the [voice timeline](https://docs.amigo.ai/channels/voice/audio-pipeline#voice-timeline) applies cut/navigate/engage within each turn to coordinate fillers, empathy pauses, and tool progress narration - the same three operations at a smaller scale.

The engine supports two processing modes. **Streaming mode** (voice) returns a prompt that the adapter streams through the language model and TTS pipeline in real time, minimizing time-to-first-audio. **Batch mode** (text, simulation, API) runs the full loop internally and returns completed effects. Both modes execute the same navigation, tool execution, and empathy classification logic.

## Emotional Adaptation

The engine injects emotional context into every prompt through two paths. This adaptation is modality-independent - text sessions receive the same steering when emotion data is available.

**Per-message annotations.** Every user message in the interaction log carries an emotion annotation: the detected emotion name and its valence. The response model sees these inline with the conversation history, giving it a turn-by-turn emotional trajectory. This lets the model distinguish between "okay" said with frustrated resignation and "okay" said with satisfied agreement - even in text transcripts where the words are identical.

**Session-level steering.** A summary of the caller's overall emotional state is injected into every prompt:

* **Dominant emotion and trend** - Is the caller improving, stable, or deteriorating?
* **Adaptation instructions** - Targeted guidance based on the caller's emotional quadrant (high-arousal negative callers need de-escalation; low-arousal negative callers need patience)
* **Behavioral signals** - Patterns like repeated interruptions, short response streaks, or extended silences that indicate disengagement or frustration independent of vocal emotion
* **Call-phase urgency** - After extended calls with deteriorating mood, the engine instructs the model to become more direct and resolution-focused
* **Coherence warnings** - When what the caller says and how they sound disagree, the engine flags the ambiguity so the model does not over-commit to a single interpretation

The combination gives the model both granular history (what happened on each turn) and strategic direction (what to do about the overall pattern).

## Per-State Configuration (TurnPolicy)

Each context graph state can configure the pipeline independently. A medication verification state behaves differently than a general scheduling state - not because the reasoning logic changes, but because the state's turn policy tunes the pipeline for that context.

<figure><img src="https://3635224444-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvcLyiHRcwv7g83p6vxAd%2Fuploads%2Fgit-blob-35bf5e14898cf5081a0682e5336ca2f153eca963%2Fturn-policy-green.svg?alt=media" alt="TurnPolicy: per-state configuration of barge-in, safety, context strategy, tool availability, and STT sensitivity"><figcaption></figcaption></figure>

Five areas are configurable per state:

* **Barge-in** - Enable or disable caller interruptions. A greeting state can suppress barge-in for a configurable shield duration so the agent's opening message always plays in full. A quick-answer state keeps barge-in enabled for snappy turn-taking.
* **Safety response** - What happens when a safety rule fires. Options: stay in the conversation and respond with empathy, suspend the agent and route to an operator, or log an alert without interrupting.
* **Context strategy** - How conversation history is managed as the call gets long. Full history, summarized history, or aggressive compaction. A degradation threshold triggers automatic downgrading when the conversation crosses a configured length.
* **Tool availability** - Which tools the agent can call in this state. Specific tools can be blocked entirely, or blocked after a configured number of turns to prevent loops.
* **STT sensitivity** (voice only) - End-of-turn thresholds and silence timeouts. Data collection states use higher thresholds and longer timeouts because callers pause between pieces of information. Quick-answer states use lower thresholds for faster responses.

## Navigation Optimizations

Two optimizations reduce latency without changing reasoning behavior.

**Speculative navigation.** When the STT engine signals moderate confidence that the caller may have finished speaking, the engine fires the navigation step speculatively in the background. If the end-of-turn is confirmed and the transcript matches, the cached result saves 200-400ms. If the caller continues speaking, the speculative result is discarded with no impact on the conversation.

**Completion-gated navigation.** When a tool is configured with navigate-on-completion and succeeds, the engine re-runs navigation immediately - without waiting for the next user message. This enables state transitions driven by tool results. A state whose entire purpose is "save the appointment" transitions forward the moment the tool succeeds, rather than waiting for the caller to speak.

## Graceful Degradation

Every intelligence layer has a fallback. No single component failure degrades or drops a conversation.

| Component              | Failure                               | Fallback                                                      |
| ---------------------- | ------------------------------------- | ------------------------------------------------------------- |
| **Emotion detection**  | Connection lost or consecutive errors | Continues with workspace-default emotional settings           |
| **Audio verification** | Correction service unavailable        | Agent reasons from raw STT output                             |
| **Navigation model**   | Timeout or error                      | Retries with a fallback model; filler speech covers the retry |
| **Context strategy**   | Token budget exceeded                 | Automatically downgrades from full to summarize to compact    |

A transient issue in any one system - emotion analysis, audio verification, the primary navigation model - never cascades into a dropped call.

## Voice Control Plane

Voice calls add a configuration layer that controls the agent's vocal identity and delivery. Configuration follows a three-level hierarchy where each level can override the one below it.

<figure><img src="https://3635224444-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvcLyiHRcwv7g83p6vxAd%2Fuploads%2Fgit-blob-7b5d00a714f6a4acc02267d86cee074c8738f787%2Fvoice-control-plane-blue.svg?alt=media" alt="Voice control plane: per-service config, workspace voice settings, and system defaults with automatic emotional adaptation"><figcaption></figcaption></figure>

| Level                        | What It Controls                                                                   | Example                                                                                            |
| ---------------------------- | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| **Per-service voice config** | Filler style, barge-in sensitivity, response limits, voice timing, call forwarding | A triage service uses longer empathy holds; a scheduling service uses shorter transition deadlines |
| **Workspace voice settings** | Default voice, baseline tone, speed, language, domain vocabulary, sensitive topics | All services in a pediatric practice default to a warm, patient tone                               |
| **System defaults**          | Engineering fallback values                                                        | Calm tone, standard pacing                                                                         |

Each level inherits from the one below. A service only specifies the settings it wants to override.

When the emotion detection system identifies strong signals from the caller, the engine overrides the configured baseline automatically - always in the direction of more empathy, never less. A scheduling service configured with a cheerful baseline shifts to sympathetic when the caller sounds distressed. The baseline resumes when the signal subsides. See [Emotion Detection](https://docs.amigo.ai/channels/voice/emotion-detection) and the [Audio Pipeline](https://docs.amigo.ai/channels/voice/audio-pipeline) for how this works in practice.

## Concurrency

Every session - voice, text, simulation - runs as a single actor that processes signals sequentially. There are no locks, no concurrent handlers, no callback chains. Inbound events (caller speech, delivery receipts, surface submissions, operator actions, timeouts) all publish to one channel per session. The actor consumes them in order.

This eliminates race conditions by construction. When a patient sends two SMS messages in quick succession, or a surface submission arrives while the agent is mid-response, or an operator joins a call at the same moment the silence monitor fires - there is no contention. Every event waits its turn. The ordering is deterministic and replayable.

The same model applies within voice turns. Fillers, responses, empathy pauses, and tool progress narration are not separate subsystems competing for the audio stream. They are effects emitted by the same actor into a single timeline, scheduled by cut/navigate/engage at the turn scale. One actor, one timeline.

## Modality Adapters

Each adapter handles the channel-specific concerns that the reasoning engine does not touch:

| Adapter        | Signal Production                                                                                                                          | Effect Execution                                                                                                                                         |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Voice**      | STT produces utterance signals; prosody analysis produces emotion signals; silence and barge-in detectors produce their respective signals | Respond effects stream through TTS with emotion-adaptive delivery; fillers play audio; pauses hold silence; terminate effects hang up after final speech |
| **Text (SMS)** | Incoming messages produce utterance signals                                                                                                | Respond effects send SMS messages; terminate effects end the session                                                                                     |
| **Simulation** | Test parameters inject utterance and emotion signals                                                                                       | All effects are captured in a trace log; tool execution runs against branch-isolated data                                                                |

New modalities can be added by implementing an adapter that converts I/O to signals and executes effects. The reasoning engine requires no changes.

{% hint style="info" %}
**Related sections** - See [Context Graphs](https://docs.amigo.ai/agent/context-graphs) for how the engine navigates problem spaces, [Dynamic Behaviors](https://docs.amigo.ai/agent/context-graphs) for runtime adaptation rules, and [Voice Agent](https://docs.amigo.ai/channels/voice) for voice-specific pipeline details.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.amigo.ai/agent/reasoning-engine.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.