# Cost and Latency Optimization

Running LLM-powered agents at scale requires deliberate engineering to control both cost and latency. The platform uses three strategies: a three-layer prompt caching scheme, per-task model routing, and a latency budget that keeps real-time responses fast enough to feel natural. These optimizations apply across all modalities - voice, SMS, and simulation - though voice has the tightest latency constraints.

<figure><img src="https://3635224444-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvcLyiHRcwv7g83p6vxAd%2Fuploads%2Fgit-blob-332f2ba7ef25ff42bbd17cb2ead5b8162ad3f968%2Fprompt-caching-light.svg?alt=media" alt="Three-layer prompt caching: Static Prefix, Immutable History, Dynamic Suffix"><figcaption></figcaption></figure>

## Three-Layer Prompt Caching

The system splits every LLM prompt into three segments to maximize cache hit rates. Across all interaction types, 75-80% of input tokens are cache-eligible on average.

### Static Prefix (\~5-8K tokens)

Agent persona, core instructions, service description, capabilities, and behavioral guidelines. This segment is keyed by organization and cached across all conversations for that organization. Because it changes only on deployment, it stays in cache indefinitely. With thousands of interactions per month sharing the same prefix, every conversation benefits from a warm cache.

Only a small number of prefix variants exist per service (based on whether tools are available and which response mode is active), so cache fragmentation is negligible.

### Immutable History

Conversation history with absolute timestamps and state-transition markers. This segment grows monotonically - once a turn is appended, it is never modified. Three design choices make this possible:

* **Absolute timestamps** instead of relative ("March 29 14:35" rather than "5 minutes ago"), so rendering a new turn does not change prior turns.
* **State-transition markers** instead of wrapper tags that get deleted and re-inserted when the agent moves between context graph states.
* **Continuation markers** instead of retroactive flags that would mutate older messages to indicate they are no longer the latest.

These signals are not lost - they are reconstructed in the dynamic suffix where they can change freely without invalidating the history cache. The result is that every prior turn remains a stable cache prefix, and the cache grows incrementally as the conversation progresses.

### Dynamic Suffix (\~2-3K tokens)

Currently activated behaviors, timestamp context, current state duration, current objective, action guidelines, and task instructions. This segment changes every turn and is never cached.

### Provider Support

The caching strategy relies on prefix caching: a stable token sequence at the start of every request avoids re-processing. Since the static prefix and immutable history form a consistent prefix, the cache hit rate stays high throughout the conversation.

### Cost Impact

The static prefix is the largest segment and the most cacheable. By keeping it stable across turns, the system avoids re-processing thousands of tokens on every interaction. The immutable history design means the cache grows incrementally - each new turn adds to the history without disrupting what came before. Together, these layers achieve sub-linear cost scaling with conversation length - longer conversations benefit more because the cached prefix represents a larger fraction of the total prompt.

## Per-Task Model Routing

Different parts of the pipeline use different models based on what the task demands:

| Task                                   | Model Selection                          | Why                                                                                |
| -------------------------------------- | ---------------------------------------- | ---------------------------------------------------------------------------------- |
| **Navigation and response generation** | Most capable model available             | These decisions directly affect conversation quality and clinical safety           |
| **Filler speech generation**           | Fast, cost-effective model               | Latency matters more than reasoning depth - the filler just needs to sound natural |
| **Post-call data review**              | Cost-effective model                     | No real-time constraint, so the system optimizes for cost                          |
| **Metric evaluation**                  | Parallel chunks on cost-effective models | Throughput matters - evaluations run concurrently across chunks of 20 metrics      |

This avoids paying for the most expensive model on every LLM call. The most capable model handles the work where reasoning depth matters. Everything else runs on models selected for speed or cost.

## Pipeline Latency Budget

For voice interactions, the pipeline targets sub-300ms STT latency and approximately 900ms total response latency. Two mechanisms keep the experience feeling fast:

**Filler speech** covers the gap between when the caller finishes speaking and when the agent's full response is ready. The caller hears "Let me check on that" or a similar phrase while the LLM generates the real response.

**Prompt caching** reduces LLM inference time by avoiding redundant input processing. The static prefix and prior conversation history are already in cache, so the model only processes the new dynamic suffix and the latest turn.

**Two-phase initialization** prewarms the agent during ring time, before the caller picks up. The agent's context is loaded, the static prefix is cached, and the system is ready to respond the moment the call connects. This ensures zero dead air on call start.

{% hint style="info" %}
The combination of these three strategies means cost scales sub-linearly with conversation length. Longer conversations benefit more from caching because the static prefix and accumulated history represent a larger fraction of the total prompt.
{% endhint %}
