gaugeMetric Store

Config-driven metric infrastructure - 42 built-in metrics across six categories, custom AI-powered metrics, per-metric latency tiers, and freshness SLAs across all channels.

Billing Meters

The platform meters usage across multiple dimensions for billing purposes. Each meter tracks a specific resource consumption signal and is aggregated per customer per billing period.

Production Meters

Meter
Unit
What It Tracks

Voice Minutes

minutes

Total duration of voice calls

Call Count

calls

Number of voice calls

LLM Input Tokens

tokens

Tokens sent to language models

LLM Output Tokens

tokens

Tokens received from language models

SMS Messages

messages

Outbound SMS messages sent

Call Recording Minutes

minutes

Duration of recorded calls

Completed Calls

calls

Calls that reached a terminal state

Quality-Weighted Calls

calls

Calls weighted by quality score

Surface Submissions

submissions

Completed surface form submissions

Message Count

messages

User-facing messages sent by the agent

Action Count

actions

Tool executions during conversations

Conversation Count

conversations

Distinct conversation sessions

Simulation Meters

Every production meter has a corresponding simulation variant. Simulation usage is identified automatically and tracked separately, so production and testing costs are clearly distinguished. Simulation meters use the same units and aggregation logic as their production counterparts.

Per-Conversation Billing Detail

For audit and dispute resolution, the platform provides a per-conversation billing breakdown. Each conversation record includes message count, action count, voice minutes, token usage, quality score, completion status, call direction, and the conversation time range. This eliminates the need to manually correlate individual events when investigating billing questions.

Built-in Dashboards

The platform ships a default Universal Metric Store dashboard that provides immediate visibility into metric health without any configuration. The dashboard is available in the Developer Console under the metrics section and includes:

  • Summary row - Active metric count, total value points, scored events, average confidence, and last computation time.

  • Intensity heatmap - Daily averages of numerical and boolean metrics displayed as a heatmap, showing trends and gaps at a glance.

  • Type distribution - Pie chart showing the mix of numerical, categorical, and boolean metric values.

  • Source and scope breakdown - Bar chart of value volume grouped by source (production or simulation) and entity scope.

  • Staleness table - The least-recently-computed metrics, with time since last update, so teams can spot metrics that may have stopped refreshing.

  • Entity-scoped values - A detail table of the most recent metric values broken down by entity, service, run, and session.

The dashboard supports filters for time window (7 days, 30 days, 90 days, or 1 year), source (production, simulation, or all), scope (all, workspace aggregate, or entity scoped), and metric type (all, numerical, categorical, or boolean). Filters apply across all panels simultaneously.

The dashboard renders automatically once the workspace has metric data ingested - no additional setup is required.

circle-info

Metrics are scoped by source (production or simulation) and entity (aggregate workspace-level or per-entity). This means simulation runs produce their own metric values that do not interfere with production metrics, and you can drill down from workspace-level aggregates to individual call or session scores.

Computed metrics are synchronized from the analytics warehouse to the operational database on a recurring schedule. The sync uses a targeted replace strategy: only metric keys present in the current batch are replaced, so metrics from other pipelines or sources are unaffected. This handles period granularity changes cleanly - for example, when daily metrics are replaced by hourly metrics for the same key, all old daily rows are removed and the new hourly rows are inserted in a single atomic operation. Readers never see an empty or partially-updated state during the sync.

A companion freshness table tracks the last sync time and value count per metric key per workspace, so downstream consumers can verify data currency.

The metric store is the platform's unified analytics infrastructure. It computes, stores, and serves metrics across all channels - voice calls, text sessions, surface submissions, and data quality events - from a single config-driven pipeline. Metrics are defined through workspace settings, not code. Adding a new metric, changing an aggregation window, or retiring an unused metric is a configuration change that takes effect on the next pipeline refresh.

How It Works

Metric store pipeline: event sources, config-driven extraction and aggregation, dashboards and query API

The pipeline reads metric definitions from workspace settings, extracts values from source events using the configured extraction mode, aggregates them at the configured granularity, and writes results to the metric store. The query API serves the computed metrics to dashboards, alerting systems, and external consumers.

Every workspace ships with a set of pre-built dashboards that visualize the built-in metrics out of the box. These default dashboards cover voice quality trends, surface completion rates, data quality distribution, and cross-channel engagement - the operational views most deployments need from day one. Default dashboards use the same embeddable component as custom dashboards, so they can be extended, customized, or embedded in external tools.

Built-In Metrics

Every workspace ships with 42 pre-configured metrics across six categories. Built-in metrics cover the operational dimensions that most deployments need out of the box.

Voice Intelligence

Metric
Type
What It Measures

Quality score

Numerical

Composite call quality (0-100) based on latency, silence, barge-ins, loops, escalations, and tool failures

Duration

Numerical

Average call duration in seconds

Escalation rate

Numerical

Proportion of calls escalated to a human operator

Sentiment

Categorical

Caller sentiment distribution (positive, negative, neutral, mixed)

Completion reason

Categorical

How calls ended (completed, escalated, abandoned, timeout)

Risk level

Categorical

Risk assessment distribution (low, medium, high, critical)

Response latency

Numerical

Average time-to-first-byte for audio response

Tool failure rate

Numerical

Proportion of tool calls that returned errors

Barge-in count

Numerical

Average caller interruptions per call

Loop count

Numerical

Average context graph state revisits per call

Silence ratio

Numerical

Proportion of call time spent in silence

Surface Intelligence

Metric
Type
What It Measures

Completion rate

Numerical

Proportion of created surfaces that were submitted

Open rate

Numerical

Proportion of delivered surfaces that were opened

Abandonment rate

Numerical

Proportion of opened surfaces that were not submitted

Channel distribution

Categorical

Delivery channel breakdown (SMS, WhatsApp, email, web, voice)

Time to complete

Numerical

Average hours from surface creation to submission

Data Quality

Metric
Type
What It Measures

Average confidence

Numerical

Mean confidence score across all events in the workspace

Event volume

Numerical

Total events ingested

Review approval rate

Numerical

Proportion of reviewed events that were approved

Cross-Channel

Metric
Type
What It Measures

Patient contact count

Numerical

Total patient contacts across all channels

Patient response rate

Numerical

Proportion of contacts that received a patient response

Standard Quality

Five universal metrics apply to every service regardless of use case. Six outcome metrics are scoped to specific product types (scheduling, outbound, coaching, intake, triage, support) and activate automatically when a service is tagged with the matching type.

Metric
Type
Scope
What It Measures

Patient sentiment

Categorical

Universal

Caller sentiment (positive, neutral, negative) assessed from the full interaction

Conversational naturalness

Numerical (1-10)

Universal

How natural and human-like the conversation felt

Conciseness

Numerical (1-10)

Universal

Whether the agent communicated efficiently without unnecessary repetition

Information accuracy

Boolean

Universal

Whether facts and data referenced during the call were correct (null when no tool calls occurred)

Safety

Categorical

Universal

Safety event classification (no event, handled, warning, critical)

Scheduling outcome

Categorical

Scheduling

Whether the scheduling objective was achieved

Outbound outcome

Categorical

Outbound

Whether the outbound call objective was achieved

Coaching outcome

Categorical

Coaching

Whether the coaching objective was achieved

Intake outcome

Categorical

Intake

Whether the intake objective was achieved

Triage outcome

Categorical

Triage

Whether the triage objective was achieved

Support outcome

Categorical

Support

Whether the support objective was achieved

Standard quality metrics use AI-powered evaluation against each call's intelligence data, running at batch cadence with a balanced model tier. The product-type outcome metrics only compute for calls on services tagged with the matching type, so a scheduling service will not produce triage outcomes.

Voice Quality Evaluation

Ten metrics from the Voice Judge, each scoring a specific dimension of audio-native call quality on a 0.0-1.0 scale. These aggregate per-call Voice Judge scores into workspace-level trends at daily granularity.

Metric
What It Measures

Pronunciation

Speech clarity and correctness

Pacing

Conversation tempo appropriateness

Clarity

Communication clarity

Warmth and Tone

Emotional responsiveness and empathy

Latency and Dead Air

Response timing and silence management

Filler and Silence

Appropriate use of filler speech and pauses

Interruption Handling

Barge-in detection and recovery

Audio Consistency

Audio quality and volume stability

Accent Quality

Pronunciation consistency across accents

Voice Identity

Voice persona consistency

Built-in metrics can be customized per workspace: you can adjust the aggregation window, change the freshness target, set valid ranges, or deactivate metrics you do not need. Built-in metric keys and types cannot be changed.

Custom Metrics

Workspaces can define up to 50 custom metrics through the settings API. Custom metrics use the same pipeline, dashboards, and query API as built-in metrics. No code changes are required - custom metrics are configuration that the pipeline reads on each refresh.

Custom metrics can evaluate two different data sources depending on what they measure:

  • Conversation metrics - Evaluate conversation summaries from call intelligence. These metrics run against the same post-call data used by the standard quality metrics, scoring conversations on dimensions you define.

  • Event metrics - Evaluate raw event data from the world model. These metrics run against individual events and extract or classify values from event payloads.

The platform automatically routes each custom metric to the correct data source based on its configuration. Both types produce results in the same metric store format and appear in the same dashboards and analytics surfaces.

Custom metrics are most useful for organization-specific quality criteria that the built-in metrics do not cover. Common examples:

  • Protocol adherence - Did the agent follow the required intake, triage, or scheduling protocol?

  • Topic classification - What was the primary reason for the call?

  • Empathy and tone - How empathetic or warm was the agent beyond what the built-in naturalness score captures?

  • Regulatory compliance - Did the agent provide required disclaimers or collect required consents?

  • Domain-specific outcomes - Did the coaching session address the patient's stated goals?

The recommended workflow for a new custom metric is: define it, test it against a few real conversations using the on-demand evaluate endpoint, iterate on the prompt until the scores match human judgment, then let the batch pipeline compute it across all conversations going forward.

Metric Types

Three value types are supported:

Type
Values
Use Case

Numerical

Float values (e.g., 0.0 - 100.0)

Scores, rates, durations, counts

Categorical

Predefined string values

Classifications, status distributions, channel breakdowns

Boolean

True / false

Pass/fail checks, presence detection

Extraction Modes

Each metric defines how its value is extracted from source events. Seven extraction modes cover the range from fast deterministic lookups to complex AI-powered analysis.

Mode
How It Works
Cost
Best For

Static

Extracts a value from a known field path in the event data

Free

Structured data: durations, counts, status codes, confidence scores

AI classification

Classifies event content into predefined categories

Free (platform-managed)

Sentiment analysis, risk categorization, topic classification

AI extraction

Extracts structured fields from unstructured text

Free (platform-managed)

Pulling specific data points from transcripts or notes

AI sentiment

Analyzes emotional tone of event content

Free (platform-managed)

Caller satisfaction, conversation tone tracking

AI query

Runs a custom prompt against event content

Varies by model tier

Complex evaluation: clinical accuracy, protocol adherence, custom quality criteria

Judge

Runs a custom prompt with entity context variables substituted from the patient's projected state

Varies by model tier

Evaluations that need patient-specific context: "Was the agent's advice consistent with the patient's {$.medications}?"

Ratio

Computes the ratio between two event type counts

Free

Conversion rates, success rates, completion rates

Platform-managed AI modes (classification, extraction, sentiment) use the platform's built-in models at no additional cost. Custom AI queries use the model tier you select.

Model Tiers

For AI query extraction, you choose a model tier that balances cost and reasoning depth:

Tier
Best For

Free

Platform-managed extraction modes (classification, extraction, sentiment)

Fast

Simple classification, low-latency decisions

Balanced

Quality scoring, moderate reasoning

Max

Complex multi-step analysis, clinical accuracy evaluation

The platform resolves model tiers to appropriate compute resources at execution time. You do not select specific models - just the level of reasoning depth your metric requires.

Channel Scoping

Metrics can be scoped to specific channels without writing filters:

Scope
What It Includes

All

Every event across all channels

Voice

Voice call events only

Text

SMS and text session events only

Surface

Form submission events only

Inbound

Inbound interactions only

Outbound

Outbound interactions only

Channel scoping lets you define the same metric type (e.g., sentiment) separately for voice and text channels to track performance independently.

Latency Tiers

Each metric can be assigned to a latency tier that determines how quickly new values are available after the source event occurs.

Tier
Latency
Best For

Streaming

Seconds

Safety alerts, real-time dashboards, escalation triggers

Near-realtime

Minutes

Operational dashboards, shift-level reporting

Batch

Hourly / daily

Trend analysis, cost-sensitive metrics, historical reporting

Safety-critical metrics (escalation rate, risk level) typically run in the streaming tier. Nuanced quality metrics that require AI evaluation run in the batch tier where they have more compute time. Volume-based metrics (event counts, contact rates) can run in any tier depending on how quickly you need the data.

Aggregation Granularity

Metrics aggregate at either hourly or daily granularity:

  • Hourly - One data point per workspace per metric per hour. Useful for shift-level monitoring and intraday dashboards.

  • Daily - One data point per workspace per metric per day. Cheaper to compute and store. Sufficient for trend analysis and reporting.

Voice intelligence metrics default to hourly granularity. Data quality metrics default to daily.

Freshness SLAs

Each metric has a configurable freshness target - the maximum acceptable time between the source event occurring and the computed metric value being available in the query API. The platform tracks freshness per metric and exposes it through a dedicated freshness endpoint.

Default freshness target is 60 minutes. You can tighten it to 5 minutes for critical metrics or relax it to 24 hours for batch reporting metrics.

Aggregation Functions

Metrics support standard aggregation functions:

Function
What It Computes

Count

Number of events matching the metric's filters

Sum

Total of extracted numerical values

Average

Mean of extracted numerical values

Min / Max

Extremes of extracted numerical values

Count distinct

Number of unique values

Ratio

Count of one event type divided by count of another

Rate

Proportion of events meeting a condition

Data Quality

The metric pipeline enforces data quality at several levels:

  • Valid ranges - Numerical metrics can define minimum and maximum bounds. Values outside the range are dropped before aggregation.

  • Category constraints - Categorical metrics can define allowed values. Unrecognized categories are excluded.

  • Confidence weighting - Source event confidence scores are available for metrics that need confidence-aware aggregation.

  • Production filtering - Simulation, playground, and test data are excluded from production metrics automatically.

Query API

Five endpoints serve computed metrics:

Endpoint
What It Returns

List metrics

Latest value for each active metric in the workspace. Filterable by entity type and entity ID.

Get metric

Values for a specific metric with optional date range and entity type filtering

Metric trend

Time-series data for a specific metric (configurable lookback, up to 365 days) with entity type scoping

Freshness status

Per-metric freshness: when each metric was last computed, latest period covered, and data point count

Metric catalog

All available metrics (built-in and custom) with their configuration: type, source, extraction mode, channel scope, unit, and granularity

Evaluate metric

Run a metric definition against a specific call or simulation session without persisting results. Returns the computed value inline. Useful for testing metric definitions before deployment and previewing per-call scores.

Call metrics

Fetch the latest metric values for a specific call, providing a per-interaction metric snapshot for call detail views

All endpoints support entity type scoping in addition to entity ID filtering, so metrics can be queried by the type of entity they apply to (patient, appointment, call) as well as a specific entity instance. Endpoints are workspace-scoped and respect Platform API authentication and permissions.

Post-Session Evaluation

In addition to operational metrics, the platform supports conversation-scoped LLM evaluations. After each conversation, an evaluation judge scores the interaction against your defined criteria. These post-session evaluations answer: "How well did the agent handle this specific conversation?"

  • Defined per organization with versioned evaluation criteria

  • Three scoring types: boolean (pass/fail), numerical (bounded range), categorical (predefined labels)

  • Each evaluation produces a value, justification, and transcript references

  • Evaluated post-session, during simulations, or on-demand via manual evaluation

  • Used for quality gates, simulation scoring, and human calibration workflows

Post-session evaluation metrics are the right tool for conversation quality evaluation - scoring individual interactions against rubrics. See Metrics and Quality for how evaluation metrics work with simulations and deployment gates.

The metric store can incorporate post-session evaluation results as a source, so evaluation scores feed into aggregate quality trends.

circle-info

Developer Guide - For metric settings API endpoints, custom metric creation, on-demand evaluation, and query parameters, see the Metric Store developer guidearrow-up-right.

Last updated

Was this helpful?