Cost and Latency Optimization
How prompt caching, model routing, and pipeline design keep costs low and responses fast.
Running LLM-powered voice agents at scale requires deliberate engineering to control both cost and latency. The platform uses three strategies: a three-layer prompt caching scheme, per-task model routing, and a latency budget that keeps voice responses fast enough to feel natural.
Three-Layer Prompt Caching
The system splits every LLM prompt into three segments to maximize cache hit rates:
Static Prefix (~5-8K tokens)
Agent persona, core instructions, service description, and capabilities. This segment is cached across interactions for the same organization. Because it rarely changes, it stays in cache for the lifetime of a deployment.
Immutable History
Conversation history with absolute timestamps and per-state markers. This segment is never mutated mid-conversation - prior turns remain cache-eligible as new messages are added. Adding a new message never invalidates the cache for previous messages.
Dynamic Suffix (~2-3K tokens)
Currently activated behaviors, timestamp context, current objective, and action guidelines. This segment changes every turn and is never cached.
The static prefix is the largest segment and the most cacheable. By keeping it stable across turns, the system avoids re-processing thousands of tokens on every interaction. The immutable history design means the cache grows incrementally - each new turn adds to the history without disrupting what came before.
Per-Task Model Routing
Different parts of the pipeline use different models based on what the task demands:
Navigation and response generation
Most capable model available
These decisions directly affect conversation quality and clinical safety
Filler speech generation
Fast, cost-effective model
Latency matters more than reasoning depth - the filler just needs to sound natural
Post-call data review
Cost-effective model
No real-time constraint, so the system optimizes for cost
Metric evaluation
Parallel chunks on cost-effective models
Throughput matters - evaluations run concurrently across chunks of 20 metrics
This avoids paying for the most expensive model on every LLM call. The most capable model handles the work where reasoning depth matters. Everything else runs on models selected for speed or cost.
Pipeline Latency Budget
The voice pipeline targets sub-300ms STT latency and approximately 900ms total response latency. Two mechanisms keep the experience feeling fast:
Filler speech covers the gap between when the caller finishes speaking and when the agent's full response is ready. The caller hears "Let me check on that" or a similar phrase while the LLM generates the real response.
Prompt caching reduces LLM inference time by avoiding redundant input processing. The static prefix and prior conversation history are already in cache, so the model only processes the new dynamic suffix and the latest turn.
Two-phase initialization prewarms the agent during ring time, before the caller picks up. The agent's context is loaded, the static prefix is cached, and the system is ready to respond the moment the call connects. This ensures zero dead air on call start.
The combination of these three strategies means cost scales sub-linearly with conversation length. Longer conversations benefit more from caching because the static prefix and accumulated history represent a larger fraction of the total prompt.
Last updated
Was this helpful?

