boltCost and Latency Optimization

How prompt caching, model routing, and pipeline design keep costs low and responses fast.

Running LLM-powered voice agents at scale requires deliberate engineering to control both cost and latency. The platform uses three strategies: a three-layer prompt caching scheme, per-task model routing, and a latency budget that keeps voice responses fast enough to feel natural.

spinner

Three-Layer Prompt Caching

The system splits every LLM prompt into three segments to maximize cache hit rates:

Static Prefix (~5-8K tokens)

Agent persona, core instructions, service description, and capabilities. This segment is cached across interactions for the same organization. Because it rarely changes, it stays in cache for the lifetime of a deployment.

Immutable History

Conversation history with absolute timestamps and per-state markers. This segment is never mutated mid-conversation - prior turns remain cache-eligible as new messages are added. Adding a new message never invalidates the cache for previous messages.

Dynamic Suffix (~2-3K tokens)

Currently activated behaviors, timestamp context, current objective, and action guidelines. This segment changes every turn and is never cached.

The static prefix is the largest segment and the most cacheable. By keeping it stable across turns, the system avoids re-processing thousands of tokens on every interaction. The immutable history design means the cache grows incrementally - each new turn adds to the history without disrupting what came before.

Per-Task Model Routing

Different parts of the pipeline use different models based on what the task demands:

Task
Model Selection
Why

Navigation and response generation

Most capable model available

These decisions directly affect conversation quality and clinical safety

Filler speech generation

Fast, cost-effective model

Latency matters more than reasoning depth - the filler just needs to sound natural

Post-call data review

Cost-effective model

No real-time constraint, so the system optimizes for cost

Metric evaluation

Parallel chunks on cost-effective models

Throughput matters - evaluations run concurrently across chunks of 20 metrics

This avoids paying for the most expensive model on every LLM call. The most capable model handles the work where reasoning depth matters. Everything else runs on models selected for speed or cost.

Pipeline Latency Budget

The voice pipeline targets sub-300ms STT latency and approximately 900ms total response latency. Two mechanisms keep the experience feeling fast:

Filler speech covers the gap between when the caller finishes speaking and when the agent's full response is ready. The caller hears "Let me check on that" or a similar phrase while the LLM generates the real response.

Prompt caching reduces LLM inference time by avoiding redundant input processing. The static prefix and prior conversation history are already in cache, so the model only processes the new dynamic suffix and the latest turn.

Two-phase initialization prewarms the agent during ring time, before the caller picks up. The agent's context is loaded, the static prefix is cached, and the system is ready to respond the moment the call connects. This ensures zero dead air on call start.

circle-info

The combination of these three strategies means cost scales sub-linearly with conversation length. Longer conversations benefit more from caching because the static prefix and accumulated history represent a larger fraction of the total prompt.

Last updated

Was this helpful?