# Metric Store

## Billing Meters

The platform meters usage across multiple dimensions for billing purposes. Each meter tracks a specific resource consumption signal and is aggregated per customer per billing period.

### Production Meters

| Meter                  | Unit          | What It Tracks                         |
| ---------------------- | ------------- | -------------------------------------- |
| Voice Minutes          | minutes       | Total duration of voice calls          |
| Call Count             | calls         | Number of voice calls                  |
| LLM Input Tokens       | tokens        | Tokens sent to language models         |
| LLM Output Tokens      | tokens        | Tokens received from language models   |
| SMS Messages           | messages      | Outbound SMS messages sent             |
| Call Recording Minutes | minutes       | Duration of recorded calls             |
| Completed Calls        | calls         | Calls that reached a terminal state    |
| Quality-Weighted Calls | calls         | Calls weighted by quality score        |
| Surface Submissions    | submissions   | Completed surface form submissions     |
| Message Count          | messages      | User-facing messages sent by the agent |
| Action Count           | actions       | Tool executions during conversations   |
| Conversation Count     | conversations | Distinct conversation sessions         |

### Simulation Meters

Every production meter has a corresponding simulation variant. Simulation usage is identified automatically and tracked separately, so production and testing costs are clearly distinguished. Simulation meters use the same units and aggregation logic as their production counterparts.

### Per-Conversation Billing Detail

For audit and dispute resolution, the platform provides a per-conversation billing breakdown. Each conversation record includes message count, action count, voice minutes, token usage, quality score, completion status, call direction, and the conversation time range. This eliminates the need to manually correlate individual events when investigating billing questions.

## Built-in Dashboards

The platform ships a default Universal Metric Store dashboard that provides immediate visibility into metric health without any configuration. The dashboard is available in the Developer Console under the metrics section and includes:

* **Summary row** - Active metric count, total value points, scored events, average confidence, and last computation time.
* **Intensity heatmap** - Daily averages of numerical and boolean metrics displayed as a heatmap, showing trends and gaps at a glance.
* **Type distribution** - Pie chart showing the mix of numerical, categorical, and boolean metric values.
* **Source and scope breakdown** - Bar chart of value volume grouped by source (production or simulation) and entity scope.
* **Staleness table** - The least-recently-computed metrics, with time since last update, so teams can spot metrics that may have stopped refreshing.
* **Entity-scoped values** - A detail table of the most recent metric values broken down by entity, service, run, and session.

The dashboard supports filters for time window (7 days, 30 days, 90 days, or 1 year), source (production, simulation, or all), scope (all, workspace aggregate, or entity scoped), and metric type (all, numerical, categorical, or boolean). Filters apply across all panels simultaneously.

The dashboard renders automatically once the workspace has metric data ingested - no additional setup is required.

{% hint style="info" %}
Metrics are scoped by **source** (production or simulation) and **entity** (aggregate workspace-level or per-entity). This means simulation runs produce their own metric values that do not interfere with production metrics, and you can drill down from workspace-level aggregates to individual call or session scores.
{% endhint %}

Computed metrics are synchronized from the analytics warehouse to the operational database on a recurring schedule. The sync uses a targeted replace strategy: only metric keys present in the current batch are replaced, so metrics from other pipelines or sources are unaffected. This handles period granularity changes cleanly - for example, when daily metrics are replaced by hourly metrics for the same key, all old daily rows are removed and the new hourly rows are inserted in a single atomic operation. Readers never see an empty or partially-updated state during the sync.

A companion freshness table tracks the last sync time and value count per metric key per workspace, so downstream consumers can verify data currency.

The metric store is the platform's unified analytics infrastructure. It computes, stores, and serves metrics across all channels - voice calls, text sessions, surface submissions, and data quality events - from a single config-driven pipeline. Metrics are defined through workspace settings, not code. Adding a new metric, changing an aggregation window, or retiring an unused metric is a configuration change that takes effect on the next pipeline refresh.

## How It Works

<figure><img src="/files/2HyuuH2TxC0wZl94wsIp" alt="Metric store pipeline: event sources, config-driven extraction and aggregation, dashboards and query API"><figcaption></figcaption></figure>

The pipeline reads metric definitions from workspace settings, extracts values from source events using the configured extraction mode, aggregates them at the configured granularity, and writes results to the metric store. The query API serves the computed metrics to dashboards, alerting systems, and external consumers.

Every workspace ships with a set of pre-built dashboards that visualize the built-in metrics out of the box. These default dashboards cover voice quality trends, surface completion rates, data quality distribution, and cross-channel engagement - the operational views most deployments need from day one. Default dashboards use the same embeddable component as custom dashboards, so they can be extended, customized, or embedded in external tools.

## Built-In Metrics

Every workspace ships with 42 pre-configured metrics across six categories. Built-in metrics cover the operational dimensions that most deployments need out of the box.

### Voice Intelligence

| Metric                | Type        | What It Measures                                                                                           |
| --------------------- | ----------- | ---------------------------------------------------------------------------------------------------------- |
| **Quality score**     | Numerical   | Composite call quality (0-100) based on latency, silence, barge-ins, loops, escalations, and tool failures |
| **Duration**          | Numerical   | Average call duration in seconds                                                                           |
| **Escalation rate**   | Numerical   | Proportion of calls escalated to a human operator                                                          |
| **Sentiment**         | Categorical | Caller sentiment distribution (positive, negative, neutral, mixed)                                         |
| **Completion reason** | Categorical | How calls ended (completed, escalated, abandoned, timeout)                                                 |
| **Risk level**        | Categorical | Risk assessment distribution (low, medium, high, critical)                                                 |
| **Response latency**  | Numerical   | Average time-to-first-byte for audio response                                                              |
| **Tool failure rate** | Numerical   | Proportion of tool calls that returned errors                                                              |
| **Barge-in count**    | Numerical   | Average caller interruptions per call                                                                      |
| **Loop count**        | Numerical   | Average context graph state revisits per call                                                              |
| **Silence ratio**     | Numerical   | Proportion of call time spent in silence                                                                   |

### Surface Intelligence

| Metric                   | Type        | What It Measures                                              |
| ------------------------ | ----------- | ------------------------------------------------------------- |
| **Completion rate**      | Numerical   | Proportion of created surfaces that were submitted            |
| **Open rate**            | Numerical   | Proportion of delivered surfaces that were opened             |
| **Abandonment rate**     | Numerical   | Proportion of opened surfaces that were not submitted         |
| **Channel distribution** | Categorical | Delivery channel breakdown (SMS, WhatsApp, email, web, voice) |
| **Time to complete**     | Numerical   | Average hours from surface creation to submission             |

### Data Quality

| Metric                   | Type      | What It Measures                                         |
| ------------------------ | --------- | -------------------------------------------------------- |
| **Average confidence**   | Numerical | Mean confidence score across all events in the workspace |
| **Event volume**         | Numerical | Total events ingested                                    |
| **Review approval rate** | Numerical | Proportion of reviewed events that were approved         |

### Cross-Channel

| Metric                    | Type      | What It Measures                                        |
| ------------------------- | --------- | ------------------------------------------------------- |
| **Patient contact count** | Numerical | Total patient contacts across all channels              |
| **Patient response rate** | Numerical | Proportion of contacts that received a patient response |

### Standard Quality

Five universal metrics apply to every service regardless of use case. Six outcome metrics are scoped to specific product types (scheduling, outbound, coaching, intake, triage, support) and activate automatically when a service is tagged with the matching type.

| Metric                         | Type             | Scope      | What It Measures                                                                                  |
| ------------------------------ | ---------------- | ---------- | ------------------------------------------------------------------------------------------------- |
| **Patient sentiment**          | Categorical      | Universal  | Caller sentiment (positive, neutral, negative) assessed from the full interaction                 |
| **Conversational naturalness** | Numerical (1-10) | Universal  | How natural and human-like the conversation felt                                                  |
| **Conciseness**                | Numerical (1-10) | Universal  | Whether the agent communicated efficiently without unnecessary repetition                         |
| **Information accuracy**       | Boolean          | Universal  | Whether facts and data referenced during the call were correct (null when no tool calls occurred) |
| **Safety**                     | Categorical      | Universal  | Safety event classification (no event, handled, warning, critical)                                |
| **Scheduling outcome**         | Categorical      | Scheduling | Whether the scheduling objective was achieved                                                     |
| **Outbound outcome**           | Categorical      | Outbound   | Whether the outbound call objective was achieved                                                  |
| **Coaching outcome**           | Categorical      | Coaching   | Whether the coaching objective was achieved                                                       |
| **Intake outcome**             | Categorical      | Intake     | Whether the intake objective was achieved                                                         |
| **Triage outcome**             | Categorical      | Triage     | Whether the triage objective was achieved                                                         |
| **Support outcome**            | Categorical      | Support    | Whether the support objective was achieved                                                        |

Standard quality metrics use AI-powered evaluation against each call's intelligence data, running at batch cadence with a balanced model tier. The product-type outcome metrics only compute for calls on services tagged with the matching type, so a scheduling service will not produce triage outcomes.

### Voice Quality Evaluation

Ten metrics from the [Voice Judge](/intelligence-and-analytics/intelligence.md#voice-judge), each scoring a specific dimension of audio-native call quality on a 0.0-1.0 scale. These aggregate per-call Voice Judge scores into workspace-level trends at daily granularity.

| Metric                    | What It Measures                            |
| ------------------------- | ------------------------------------------- |
| **Pronunciation**         | Speech clarity and correctness              |
| **Pacing**                | Conversation tempo appropriateness          |
| **Clarity**               | Communication clarity                       |
| **Warmth and Tone**       | Emotional responsiveness and empathy        |
| **Latency and Dead Air**  | Response timing and silence management      |
| **Filler and Silence**    | Appropriate use of filler speech and pauses |
| **Interruption Handling** | Barge-in detection and recovery             |
| **Audio Consistency**     | Audio quality and volume stability          |
| **Accent Quality**        | Pronunciation consistency across accents    |
| **Voice Identity**        | Voice persona consistency                   |

Built-in metrics can be customized per workspace: you can adjust the aggregation window, change the freshness target, set valid ranges, or deactivate metrics you do not need. Built-in metric keys and types cannot be changed.

## Custom Metrics

Workspaces can define up to 50 custom metrics through the settings API. Custom metrics use the same pipeline, dashboards, and query API as built-in metrics. No code changes are required - custom metrics are configuration that the pipeline reads on each refresh.

Custom metrics can evaluate two different data sources depending on what they measure:

* **Conversation metrics** - Evaluate conversation summaries from call intelligence. These metrics run against the same post-call data used by the standard quality metrics, scoring conversations on dimensions you define.
* **Event metrics** - Evaluate raw event data from the world model. These metrics run against individual events and extract or classify values from event payloads.

The platform automatically routes each custom metric to the correct data source based on its configuration. Both types produce results in the same metric store format and appear in the same dashboards and analytics surfaces.

Custom metrics are most useful for organization-specific quality criteria that the built-in metrics do not cover. Common examples:

* **Protocol adherence** - Did the agent follow the required intake, triage, or scheduling protocol?
* **Topic classification** - What was the primary reason for the call?
* **Empathy and tone** - How empathetic or warm was the agent beyond what the built-in naturalness score captures?
* **Regulatory compliance** - Did the agent provide required disclaimers or collect required consents?
* **Domain-specific outcomes** - Did the coaching session address the patient's stated goals?

The recommended workflow for a new custom metric is: define it, test it against a few real conversations using the on-demand evaluate endpoint, iterate on the prompt until the scores match human judgment, then let the batch pipeline compute it across all conversations going forward.

### Metric Types

Three value types are supported:

| Type            | Values                           | Use Case                                                  |
| --------------- | -------------------------------- | --------------------------------------------------------- |
| **Numerical**   | Float values (e.g., 0.0 - 100.0) | Scores, rates, durations, counts                          |
| **Categorical** | Predefined string values         | Classifications, status distributions, channel breakdowns |
| **Boolean**     | True / false                     | Pass/fail checks, presence detection                      |

### Extraction Modes

Each metric defines how its value is extracted from source events. Seven extraction modes cover the range from fast deterministic lookups to complex AI-powered analysis.

| Mode                  | How It Works                                                                                      | Cost                    | Best For                                                                                                                |
| --------------------- | ------------------------------------------------------------------------------------------------- | ----------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| **Static**            | Extracts a value from a known field path in the event data                                        | Free                    | Structured data: durations, counts, status codes, confidence scores                                                     |
| **AI classification** | Classifies event content into predefined categories                                               | Free (platform-managed) | Sentiment analysis, risk categorization, topic classification                                                           |
| **AI extraction**     | Extracts structured fields from unstructured text                                                 | Free (platform-managed) | Pulling specific data points from transcripts or notes                                                                  |
| **AI sentiment**      | Analyzes emotional tone of event content                                                          | Free (platform-managed) | Caller satisfaction, conversation tone tracking                                                                         |
| **AI query**          | Runs a custom prompt against event content                                                        | Varies by model tier    | Complex evaluation: clinical accuracy, protocol adherence, custom quality criteria                                      |
| **Judge**             | Runs a custom prompt with entity context variables substituted from the patient's projected state | Varies by model tier    | Evaluations that need patient-specific context: "Was the agent's advice consistent with the patient's {$.medications}?" |
| **Ratio**             | Computes the ratio between two event type counts                                                  | Free                    | Conversion rates, success rates, completion rates                                                                       |

Platform-managed AI modes (classification, extraction, sentiment) use the platform's built-in models at no additional cost. Custom AI queries use the model tier you select.

### Model Tiers

For AI query extraction, you choose a model tier that balances cost and reasoning depth:

| Tier         | Best For                                                                  |
| ------------ | ------------------------------------------------------------------------- |
| **Free**     | Platform-managed extraction modes (classification, extraction, sentiment) |
| **Fast**     | Simple classification, low-latency decisions                              |
| **Balanced** | Quality scoring, moderate reasoning                                       |
| **Max**      | Complex multi-step analysis, clinical accuracy evaluation                 |

The platform resolves model tiers to appropriate compute resources at execution time. You do not select specific models - just the level of reasoning depth your metric requires.

### Channel Scoping

Metrics can be scoped to specific channels without writing filters:

| Scope        | What It Includes                 |
| ------------ | -------------------------------- |
| **All**      | Every event across all channels  |
| **Voice**    | Voice call events only           |
| **Text**     | SMS and text session events only |
| **Surface**  | Form submission events only      |
| **Inbound**  | Inbound interactions only        |
| **Outbound** | Outbound interactions only       |

Channel scoping lets you define the same metric type (e.g., sentiment) separately for voice and text channels to track performance independently.

## Latency Tiers

Each metric can be assigned to a latency tier that determines how quickly new values are available after the source event occurs.

| Tier              | Latency        | Best For                                                     |
| ----------------- | -------------- | ------------------------------------------------------------ |
| **Streaming**     | Seconds        | Safety alerts, real-time dashboards, escalation triggers     |
| **Near-realtime** | Minutes        | Operational dashboards, shift-level reporting                |
| **Batch**         | Hourly / daily | Trend analysis, cost-sensitive metrics, historical reporting |

Safety-critical metrics (escalation rate, risk level) typically run in the streaming tier. Nuanced quality metrics that require AI evaluation run in the batch tier where they have more compute time. Volume-based metrics (event counts, contact rates) can run in any tier depending on how quickly you need the data.

### Aggregation Granularity

Metrics aggregate at either hourly or daily granularity:

* **Hourly** - One data point per workspace per metric per hour. Useful for shift-level monitoring and intraday dashboards.
* **Daily** - One data point per workspace per metric per day. Cheaper to compute and store. Sufficient for trend analysis and reporting.

Voice intelligence metrics default to hourly granularity. Data quality metrics default to daily.

### Freshness SLAs

Each metric has a configurable freshness target - the maximum acceptable time between the source event occurring and the computed metric value being available in the query API. The platform tracks freshness per metric and exposes it through a dedicated freshness endpoint.

Default freshness target is 60 minutes. You can tighten it to 5 minutes for critical metrics or relax it to 24 hours for batch reporting metrics.

## Aggregation Functions

Metrics support standard aggregation functions:

| Function           | What It Computes                                    |
| ------------------ | --------------------------------------------------- |
| **Count**          | Number of events matching the metric's filters      |
| **Sum**            | Total of extracted numerical values                 |
| **Average**        | Mean of extracted numerical values                  |
| **Min / Max**      | Extremes of extracted numerical values              |
| **Count distinct** | Number of unique values                             |
| **Ratio**          | Count of one event type divided by count of another |
| **Rate**           | Proportion of events meeting a condition            |

## Data Quality

The metric pipeline enforces data quality at several levels:

* **Valid ranges** - Numerical metrics can define minimum and maximum bounds. Values outside the range are dropped before aggregation.
* **Category constraints** - Categorical metrics can define allowed values. Unrecognized categories are excluded.
* **Confidence weighting** - Source event confidence scores are available for metrics that need confidence-aware aggregation.
* **Production filtering** - Simulation, playground, and test data are excluded from production metrics automatically.

## Query API

Five endpoints serve computed metrics:

| Endpoint             | What It Returns                                                                                                                                                                                                              |
| -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **List metrics**     | Latest value for each active metric in the workspace. Filterable by entity type and entity ID.                                                                                                                               |
| **Get metric**       | Values for a specific metric with optional date range and entity type filtering                                                                                                                                              |
| **Metric trend**     | Time-series data for a specific metric (configurable lookback, up to 365 days) with entity type scoping                                                                                                                      |
| **Freshness status** | Per-metric freshness: when each metric was last computed, latest period covered, and data point count                                                                                                                        |
| **Metric catalog**   | All available metrics (built-in and custom) with their configuration: type, source, extraction mode, channel scope, unit, and granularity                                                                                    |
| **Evaluate metric**  | Run a metric definition against a specific call or simulation session without persisting results. Returns the computed value inline. Useful for testing metric definitions before deployment and previewing per-call scores. |
| **Call metrics**     | Fetch the latest metric values for a specific call, providing a per-interaction metric snapshot for call detail views                                                                                                        |

All endpoints support entity type scoping in addition to entity ID filtering, so metrics can be queried by the type of entity they apply to (patient, appointment, call) as well as a specific entity instance. Endpoints are workspace-scoped and respect Platform API authentication and permissions.

## Post-Session Evaluation

In addition to operational metrics, the platform supports **conversation-scoped LLM evaluations**. After each conversation, an evaluation judge scores the interaction against your defined criteria. These post-session evaluations answer: "How well did the agent handle this specific conversation?"

* Defined per organization with versioned evaluation criteria
* Three scoring types: boolean (pass/fail), numerical (bounded range), categorical (predefined labels)
* Each evaluation produces a value, justification, and transcript references
* Evaluated post-session, during simulations, or on-demand via manual evaluation
* Used for quality gates, simulation scoring, and human calibration workflows

Post-session evaluation metrics are the right tool for **conversation quality evaluation** - scoring individual interactions against rubrics. See [Metrics and Quality](/testing/testing/metrics.md) for how evaluation metrics work with simulations and deployment gates.

The metric store can incorporate post-session evaluation results as a source, so evaluation scores feed into aggregate quality trends.

{% hint style="info" %}
**Developer Guide** - For metric settings API endpoints, custom metric creation, on-demand evaluation, and query parameters, see the [Metric Store developer guide](https://docs.amigo.ai/developer-guide/platform-api/metric-store).
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.amigo.ai/intelligence-and-analytics/metric-store.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
