Metric Store
Config-driven metric infrastructure - 42 built-in metrics across six categories, custom AI-powered metrics, per-metric latency tiers, and freshness SLAs across all channels.
Billing Meters
The platform meters usage across multiple dimensions for billing purposes. Each meter tracks a specific resource consumption signal and is aggregated per customer per billing period.
Production Meters
Voice Minutes
minutes
Total duration of voice calls
Call Count
calls
Number of voice calls
LLM Input Tokens
tokens
Tokens sent to language models
LLM Output Tokens
tokens
Tokens received from language models
SMS Messages
messages
Outbound SMS messages sent
Call Recording Minutes
minutes
Duration of recorded calls
Completed Calls
calls
Calls that reached a terminal state
Quality-Weighted Calls
calls
Calls weighted by quality score
Surface Submissions
submissions
Completed surface form submissions
Message Count
messages
User-facing messages sent by the agent
Action Count
actions
Tool executions during conversations
Conversation Count
conversations
Distinct conversation sessions
Simulation Meters
Every production meter has a corresponding simulation variant. Simulation usage is identified automatically and tracked separately, so production and testing costs are clearly distinguished. Simulation meters use the same units and aggregation logic as their production counterparts.
Per-Conversation Billing Detail
For audit and dispute resolution, the platform provides a per-conversation billing breakdown. Each conversation record includes message count, action count, voice minutes, token usage, quality score, completion status, call direction, and the conversation time range. This eliminates the need to manually correlate individual events when investigating billing questions.
Built-in Dashboards
The platform ships a default Universal Metric Store dashboard that provides immediate visibility into metric health without any configuration. The dashboard is available in the Developer Console under the metrics section and includes:
Summary row - Active metric count, total value points, scored events, average confidence, and last computation time.
Intensity heatmap - Daily averages of numerical and boolean metrics displayed as a heatmap, showing trends and gaps at a glance.
Type distribution - Pie chart showing the mix of numerical, categorical, and boolean metric values.
Source and scope breakdown - Bar chart of value volume grouped by source (production or simulation) and entity scope.
Staleness table - The least-recently-computed metrics, with time since last update, so teams can spot metrics that may have stopped refreshing.
Entity-scoped values - A detail table of the most recent metric values broken down by entity, service, run, and session.
The dashboard supports filters for time window (7 days, 30 days, 90 days, or 1 year), source (production, simulation, or all), scope (all, workspace aggregate, or entity scoped), and metric type (all, numerical, categorical, or boolean). Filters apply across all panels simultaneously.
The dashboard renders automatically once the workspace has metric data ingested - no additional setup is required.
Metrics are scoped by source (production or simulation) and entity (aggregate workspace-level or per-entity). This means simulation runs produce their own metric values that do not interfere with production metrics, and you can drill down from workspace-level aggregates to individual call or session scores.
Computed metrics are synchronized from the analytics warehouse to the operational database on a recurring schedule. The sync uses a targeted replace strategy: only metric keys present in the current batch are replaced, so metrics from other pipelines or sources are unaffected. This handles period granularity changes cleanly - for example, when daily metrics are replaced by hourly metrics for the same key, all old daily rows are removed and the new hourly rows are inserted in a single atomic operation. Readers never see an empty or partially-updated state during the sync.
A companion freshness table tracks the last sync time and value count per metric key per workspace, so downstream consumers can verify data currency.
The metric store is the platform's unified analytics infrastructure. It computes, stores, and serves metrics across all channels - voice calls, text sessions, surface submissions, and data quality events - from a single config-driven pipeline. Metrics are defined through workspace settings, not code. Adding a new metric, changing an aggregation window, or retiring an unused metric is a configuration change that takes effect on the next pipeline refresh.
How It Works
The pipeline reads metric definitions from workspace settings, extracts values from source events using the configured extraction mode, aggregates them at the configured granularity, and writes results to the metric store. The query API serves the computed metrics to dashboards, alerting systems, and external consumers.
Every workspace ships with a set of pre-built dashboards that visualize the built-in metrics out of the box. These default dashboards cover voice quality trends, surface completion rates, data quality distribution, and cross-channel engagement - the operational views most deployments need from day one. Default dashboards use the same embeddable component as custom dashboards, so they can be extended, customized, or embedded in external tools.
Built-In Metrics
Every workspace ships with 42 pre-configured metrics across six categories. Built-in metrics cover the operational dimensions that most deployments need out of the box.
Voice Intelligence
Quality score
Numerical
Composite call quality (0-100) based on latency, silence, barge-ins, loops, escalations, and tool failures
Duration
Numerical
Average call duration in seconds
Escalation rate
Numerical
Proportion of calls escalated to a human operator
Sentiment
Categorical
Caller sentiment distribution (positive, negative, neutral, mixed)
Completion reason
Categorical
How calls ended (completed, escalated, abandoned, timeout)
Risk level
Categorical
Risk assessment distribution (low, medium, high, critical)
Response latency
Numerical
Average time-to-first-byte for audio response
Tool failure rate
Numerical
Proportion of tool calls that returned errors
Barge-in count
Numerical
Average caller interruptions per call
Loop count
Numerical
Average context graph state revisits per call
Silence ratio
Numerical
Proportion of call time spent in silence
Surface Intelligence
Completion rate
Numerical
Proportion of created surfaces that were submitted
Open rate
Numerical
Proportion of delivered surfaces that were opened
Abandonment rate
Numerical
Proportion of opened surfaces that were not submitted
Channel distribution
Categorical
Delivery channel breakdown (SMS, WhatsApp, email, web, voice)
Time to complete
Numerical
Average hours from surface creation to submission
Data Quality
Average confidence
Numerical
Mean confidence score across all events in the workspace
Event volume
Numerical
Total events ingested
Review approval rate
Numerical
Proportion of reviewed events that were approved
Cross-Channel
Patient contact count
Numerical
Total patient contacts across all channels
Patient response rate
Numerical
Proportion of contacts that received a patient response
Standard Quality
Five universal metrics apply to every service regardless of use case. Six outcome metrics are scoped to specific product types (scheduling, outbound, coaching, intake, triage, support) and activate automatically when a service is tagged with the matching type.
Patient sentiment
Categorical
Universal
Caller sentiment (positive, neutral, negative) assessed from the full interaction
Conversational naturalness
Numerical (1-10)
Universal
How natural and human-like the conversation felt
Conciseness
Numerical (1-10)
Universal
Whether the agent communicated efficiently without unnecessary repetition
Information accuracy
Boolean
Universal
Whether facts and data referenced during the call were correct (null when no tool calls occurred)
Safety
Categorical
Universal
Safety event classification (no event, handled, warning, critical)
Scheduling outcome
Categorical
Scheduling
Whether the scheduling objective was achieved
Outbound outcome
Categorical
Outbound
Whether the outbound call objective was achieved
Coaching outcome
Categorical
Coaching
Whether the coaching objective was achieved
Intake outcome
Categorical
Intake
Whether the intake objective was achieved
Triage outcome
Categorical
Triage
Whether the triage objective was achieved
Support outcome
Categorical
Support
Whether the support objective was achieved
Standard quality metrics use AI-powered evaluation against each call's intelligence data, running at batch cadence with a balanced model tier. The product-type outcome metrics only compute for calls on services tagged with the matching type, so a scheduling service will not produce triage outcomes.
Voice Quality Evaluation
Ten metrics from the Voice Judge, each scoring a specific dimension of audio-native call quality on a 0.0-1.0 scale. These aggregate per-call Voice Judge scores into workspace-level trends at daily granularity.
Pronunciation
Speech clarity and correctness
Pacing
Conversation tempo appropriateness
Clarity
Communication clarity
Warmth and Tone
Emotional responsiveness and empathy
Latency and Dead Air
Response timing and silence management
Filler and Silence
Appropriate use of filler speech and pauses
Interruption Handling
Barge-in detection and recovery
Audio Consistency
Audio quality and volume stability
Accent Quality
Pronunciation consistency across accents
Voice Identity
Voice persona consistency
Built-in metrics can be customized per workspace: you can adjust the aggregation window, change the freshness target, set valid ranges, or deactivate metrics you do not need. Built-in metric keys and types cannot be changed.
Custom Metrics
Workspaces can define up to 50 custom metrics through the settings API. Custom metrics use the same pipeline, dashboards, and query API as built-in metrics. No code changes are required - custom metrics are configuration that the pipeline reads on each refresh.
Custom metrics can evaluate two different data sources depending on what they measure:
Conversation metrics - Evaluate conversation summaries from call intelligence. These metrics run against the same post-call data used by the standard quality metrics, scoring conversations on dimensions you define.
Event metrics - Evaluate raw event data from the world model. These metrics run against individual events and extract or classify values from event payloads.
The platform automatically routes each custom metric to the correct data source based on its configuration. Both types produce results in the same metric store format and appear in the same dashboards and analytics surfaces.
Custom metrics are most useful for organization-specific quality criteria that the built-in metrics do not cover. Common examples:
Protocol adherence - Did the agent follow the required intake, triage, or scheduling protocol?
Topic classification - What was the primary reason for the call?
Empathy and tone - How empathetic or warm was the agent beyond what the built-in naturalness score captures?
Regulatory compliance - Did the agent provide required disclaimers or collect required consents?
Domain-specific outcomes - Did the coaching session address the patient's stated goals?
The recommended workflow for a new custom metric is: define it, test it against a few real conversations using the on-demand evaluate endpoint, iterate on the prompt until the scores match human judgment, then let the batch pipeline compute it across all conversations going forward.
Metric Types
Three value types are supported:
Numerical
Float values (e.g., 0.0 - 100.0)
Scores, rates, durations, counts
Categorical
Predefined string values
Classifications, status distributions, channel breakdowns
Boolean
True / false
Pass/fail checks, presence detection
Extraction Modes
Each metric defines how its value is extracted from source events. Seven extraction modes cover the range from fast deterministic lookups to complex AI-powered analysis.
Static
Extracts a value from a known field path in the event data
Free
Structured data: durations, counts, status codes, confidence scores
AI classification
Classifies event content into predefined categories
Free (platform-managed)
Sentiment analysis, risk categorization, topic classification
AI extraction
Extracts structured fields from unstructured text
Free (platform-managed)
Pulling specific data points from transcripts or notes
AI sentiment
Analyzes emotional tone of event content
Free (platform-managed)
Caller satisfaction, conversation tone tracking
AI query
Runs a custom prompt against event content
Varies by model tier
Complex evaluation: clinical accuracy, protocol adherence, custom quality criteria
Judge
Runs a custom prompt with entity context variables substituted from the patient's projected state
Varies by model tier
Evaluations that need patient-specific context: "Was the agent's advice consistent with the patient's {$.medications}?"
Ratio
Computes the ratio between two event type counts
Free
Conversion rates, success rates, completion rates
Platform-managed AI modes (classification, extraction, sentiment) use the platform's built-in models at no additional cost. Custom AI queries use the model tier you select.
Model Tiers
For AI query extraction, you choose a model tier that balances cost and reasoning depth:
Free
Platform-managed extraction modes (classification, extraction, sentiment)
Fast
Simple classification, low-latency decisions
Balanced
Quality scoring, moderate reasoning
Max
Complex multi-step analysis, clinical accuracy evaluation
The platform resolves model tiers to appropriate compute resources at execution time. You do not select specific models - just the level of reasoning depth your metric requires.
Channel Scoping
Metrics can be scoped to specific channels without writing filters:
All
Every event across all channels
Voice
Voice call events only
Text
SMS and text session events only
Surface
Form submission events only
Inbound
Inbound interactions only
Outbound
Outbound interactions only
Channel scoping lets you define the same metric type (e.g., sentiment) separately for voice and text channels to track performance independently.
Latency Tiers
Each metric can be assigned to a latency tier that determines how quickly new values are available after the source event occurs.
Streaming
Seconds
Safety alerts, real-time dashboards, escalation triggers
Near-realtime
Minutes
Operational dashboards, shift-level reporting
Batch
Hourly / daily
Trend analysis, cost-sensitive metrics, historical reporting
Safety-critical metrics (escalation rate, risk level) typically run in the streaming tier. Nuanced quality metrics that require AI evaluation run in the batch tier where they have more compute time. Volume-based metrics (event counts, contact rates) can run in any tier depending on how quickly you need the data.
Aggregation Granularity
Metrics aggregate at either hourly or daily granularity:
Hourly - One data point per workspace per metric per hour. Useful for shift-level monitoring and intraday dashboards.
Daily - One data point per workspace per metric per day. Cheaper to compute and store. Sufficient for trend analysis and reporting.
Voice intelligence metrics default to hourly granularity. Data quality metrics default to daily.
Freshness SLAs
Each metric has a configurable freshness target - the maximum acceptable time between the source event occurring and the computed metric value being available in the query API. The platform tracks freshness per metric and exposes it through a dedicated freshness endpoint.
Default freshness target is 60 minutes. You can tighten it to 5 minutes for critical metrics or relax it to 24 hours for batch reporting metrics.
Aggregation Functions
Metrics support standard aggregation functions:
Count
Number of events matching the metric's filters
Sum
Total of extracted numerical values
Average
Mean of extracted numerical values
Min / Max
Extremes of extracted numerical values
Count distinct
Number of unique values
Ratio
Count of one event type divided by count of another
Rate
Proportion of events meeting a condition
Data Quality
The metric pipeline enforces data quality at several levels:
Valid ranges - Numerical metrics can define minimum and maximum bounds. Values outside the range are dropped before aggregation.
Category constraints - Categorical metrics can define allowed values. Unrecognized categories are excluded.
Confidence weighting - Source event confidence scores are available for metrics that need confidence-aware aggregation.
Production filtering - Simulation, playground, and test data are excluded from production metrics automatically.
Query API
Five endpoints serve computed metrics:
List metrics
Latest value for each active metric in the workspace. Filterable by entity type and entity ID.
Get metric
Values for a specific metric with optional date range and entity type filtering
Metric trend
Time-series data for a specific metric (configurable lookback, up to 365 days) with entity type scoping
Freshness status
Per-metric freshness: when each metric was last computed, latest period covered, and data point count
Metric catalog
All available metrics (built-in and custom) with their configuration: type, source, extraction mode, channel scope, unit, and granularity
Evaluate metric
Run a metric definition against a specific call or simulation session without persisting results. Returns the computed value inline. Useful for testing metric definitions before deployment and previewing per-call scores.
Call metrics
Fetch the latest metric values for a specific call, providing a per-interaction metric snapshot for call detail views
All endpoints support entity type scoping in addition to entity ID filtering, so metrics can be queried by the type of entity they apply to (patient, appointment, call) as well as a specific entity instance. Endpoints are workspace-scoped and respect Platform API authentication and permissions.
Post-Session Evaluation
In addition to operational metrics, the platform supports conversation-scoped LLM evaluations. After each conversation, an evaluation judge scores the interaction against your defined criteria. These post-session evaluations answer: "How well did the agent handle this specific conversation?"
Defined per organization with versioned evaluation criteria
Three scoring types: boolean (pass/fail), numerical (bounded range), categorical (predefined labels)
Each evaluation produces a value, justification, and transcript references
Evaluated post-session, during simulations, or on-demand via manual evaluation
Used for quality gates, simulation scoring, and human calibration workflows
Post-session evaluation metrics are the right tool for conversation quality evaluation - scoring individual interactions against rubrics. See Metrics and Quality for how evaluation metrics work with simulations and deployment gates.
The metric store can incorporate post-session evaluation results as a source, so evaluation scores feed into aggregate quality trends.
Developer Guide - For metric settings API endpoints, custom metric creation, on-demand evaluation, and query parameters, see the Metric Store developer guide.
Last updated
Was this helpful?

