# Performance Characteristics

Measured performance numbers for the major platform subsystems. These are operational characteristics, not theoretical limits.

## Voice Pipeline Latency

The voice pipeline has two primary latency numbers:

* **STT latency**: Sub-300ms from audio input to transcript text
* **Average response latency**: \~900ms from end of caller speech to start of agent audio

The 900ms gap is covered by [filler speech](https://docs.amigo.ai/channels/voice/audio-pipeline) - the caller hears "Let me check on that" while the full response is being generated.

### Per-Turn Timing Breakdown

Each conversational turn passes through five layers:

| Layer                  | What Happens                                                            |
| ---------------------- | ----------------------------------------------------------------------- |
| **STT processing**     | Audio converted to transcript text                                      |
| **Engine**             | Context graph navigation, dynamic behavior evaluation, memory retrieval |
| **Render**             | LLM generates response text with emotional context                      |
| **TTS generation**     | Text converted to speech audio with emotion parameters                  |
| **Transport delivery** | Audio delivered to the telephony layer                                  |

## Concurrency

The voice agent scales horizontally to handle large concurrent call volumes. There is no shared state between instances that would limit horizontal scaling.

## Connector Runner

The connector runner polls external data sources at configurable intervals and dispatches outbound writes in near-real-time. Each instance handles multiple concurrent data sources with coordination to prevent duplicate processing. Reconciliation runs periodically to catch any missed changes.

## Emotion Detection

| Parameter                        | Value                                                                       |
| -------------------------------- | --------------------------------------------------------------------------- |
| **Prosody models**               | Dual-model (categorical + dimensional) running in parallel per segment      |
| **Audio segment size**           | 2 seconds                                                                   |
| **Rolling window**               | Short window over most recent segments (tuned for real-time mood tracking)  |
| **Speaker normalization warmup** | \~10 seconds (5 segments) before per-caller baselines are meaningful        |
| **Context fusion**               | Applied after each turn; adds no measurable latency to the emotion pipeline |
| **Empathy tier classification**  | Rule-based, <100ms (no LLM call)                                            |
| **Circuit breaker recovery**     | Automatic recovery after consecutive failures                               |
| **Audio buffer**                 | Non-blocking, drops on overflow to maintain real-time processing            |
| **Text buffer**                  | Non-blocking, drops on overflow to maintain real-time processing            |

## End-of-Turn Detection

End-of-turn confidence thresholds are configurable per workspace. The thresholds balance responsiveness (responding quickly when the caller finishes) against interruption risk (cutting the caller off mid-sentence). Tuning depends on your patient population's speaking patterns.

## Post-Call Processing

After each call ends, the system runs batch re-transcription using a higher-accuracy model. This produces the canonical transcript that is used for data extraction, quality review, and compliance records. The re-transcription runs asynchronously and does not affect call latency.
