# Call Intelligence and Analytics

Every interaction - voice and text - produces a structured analytical breakdown covering emotion, risk, latency, safety, and outcome quality. All computed automatically. The result is a dataset that covers your entire operation, not just the calls someone happened to listen to.

<figure><img src="/files/BNp6hKpCVOaC8CtKzz2c" alt="Call intelligence pipeline: real-time profiles and decision trace, post-call quality scoring and trace analysis, debugging with execution and prompt traces"><figcaption></figcaption></figure>

## Three Layers of Quality Analysis

```mermaid
flowchart LR
    I[Live Interaction] --> L1[Layer 1: Real-Time\n7 structured profiles]
    I --> L2[Layer 2: Post-Interaction\n5 quality dimensions]
    I --> L3[Layer 3: Voice Judge\n10 audio-native dimensions]
    L1 --> QS[Composite Quality\nScore 0-100]
    L2 --> QS
    L3 --> QS
    QS --> D[Dashboards +\nTrend Analysis]
    QS --> A[Alerts +\nThresholds]
```

Analysis happens in three passes. The first runs during the interaction. The second and third run after it ends.

### Layer 1: Real-Time Intelligence

Seven structured profiles are computed while the interaction is still in progress.

| Profile                   | What It Captures                                                                                                        |
| ------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| **Emotion**               | Dominant emotion, valence and arousal averages, peak negative moment, emotional shifts over time, final emotional trend |
| **Risk**                  | Composite risk score with contributing signals identified                                                               |
| **Latency**               | Response time averages and percentiles, time-to-first-response, silence ratio                                           |
| **Conversation dynamics** | Turn count, states visited, loop count, interruption count, completion reason                                           |
| **Tool performance**      | Success and failure counts per tool invoked during the interaction                                                      |
| **Safety**                | Rule matches and escalation triggers fired                                                                              |
| **Operator involvement**  | Whether a human connected, time to connect, and resolution outcome                                                      |

These profiles are available before the interaction ends. Monitoring dashboards and alerting rules can act on them in real time.

### Layer 2: Post-Interaction Quality Scoring

After the interaction ends, a second pass scores quality across five dimensions on a 1-5 scale.

| Dimension                | What It Measures                                                       |
| ------------------------ | ---------------------------------------------------------------------- |
| **Task completion**      | Did the agent accomplish what the caller needed?                       |
| **Information accuracy** | Were facts correct? Did the agent act on accurate data?                |
| **Conversation flow**    | Natural pacing, no awkward pauses or repetitions                       |
| **Error recovery**       | Did the agent recover gracefully from confusion or unexpected input?   |
| **Caller experience**    | Overall experience based on tone, engagement, and interaction patterns |

Each interaction also receives an **outcome classification**: succeeded, partially succeeded, failed, or abandoned.

## Composite Quality Score

The five dimension scores feed into a single composite quality score on a 0-100 scale. The score starts at 100 and deducts for specific quality signals: high latency, excessive silence, interruptions, agent loops, escalations, and tool failures.

Interactions are tiered based on this score:

| Tier          | Meaning                                                        |
| ------------- | -------------------------------------------------------------- |
| **Excellent** | No significant quality issues detected                         |
| **Good**      | Minor issues that did not affect the outcome                   |
| **Fair**      | Noticeable issues that may have affected the caller experience |
| **Poor**      | Significant issues requiring review                            |

The composite score is the primary metric for tracking quality over time and comparing performance across agents, configurations, and time periods. It is designed for dashboard filtering and trend analysis - you can filter calls by tier to focus review time on the interactions that need it.

## Key Moment Extraction

The system automatically identifies notable events - moments of elevated risk, emotional shifts, escalation triggers, tool failures - and tags them with timestamps.

Reviewers jump directly to what matters instead of listening to entire recordings. When a call scores poorly, the key moments tell you exactly where things went wrong.

## Transcription Accuracy Feedback

Quality analysis feeds corrections back into transcription. When the scoring pass identifies likely transcription errors - a medical term misheard, a name consistently misspelled - it updates the speech recognition configuration.

Transcription accuracy improves over time for your specific vocabulary: medical terminology, provider names, local street names, insurance plan names. No manual tuning required.

## Quality Trends

Individual call scores are useful for reviewing specific interactions. Trends across thousands of calls are where you get operational visibility.

Analytics show quality score distribution, escalation rates, and per-component breakdowns over configurable date ranges. Period-over-period comparison lets you measure whether a configuration change actually improved quality or made things worse.

This is the feedback loop that drives continuous improvement. You make a change to the agent configuration. You wait for enough calls to accumulate. You compare the quality distribution before and after. The data tells you whether the change helped, hurt, or had no measurable effect. Without this, configuration changes are guesswork.

## Analytics

Beyond per-call intelligence, the platform provides workspace-level analytics covering call quality, data quality, pipeline health, and entity resolution. These metrics give operations teams visibility into how data flows through the system and where attention is needed.

All analytics support date range filtering, time bucketing (hourly, daily, weekly), and optional service-level filtering. Results power the developer console dashboards and are available to any user with read access to the workspace.

### Call Quality Trends

Analytics aggregate call intelligence data across all completed calls in a workspace. All support date range filtering, time bucketing, and optional service-level filtering.

| View                     | What It Shows                                                                                        |
| ------------------------ | ---------------------------------------------------------------------------------------------------- |
| **Call quality**         | Quality score trends (avg, p50, p95), distribution by tier, escalation rate, call volume             |
| **Emotion trends**       | Dominant emotion distribution across calls, valence/arousal trends over time, per-emotion frequency  |
| **Safety trends**        | Escalation frequency over time, risk level distribution, safety rule match counts                    |
| **Latency**              | p50/p95/p99 latency by component (engine response, audio time-to-first-byte, navigation, render)     |
| **Tool performance**     | Per-tool success/failure rates, failure trends, invocation counts and average duration               |
| **Operator performance** | Escalation rate trends, quality comparison (escalated vs non-escalated calls), operator connect time |

Advanced analytics support percentile breakdowns (p50/p95/p99) for duration and quality scores, time series trends with p95 latency, and breakdowns by service and call direction (inbound/outbound).

Period-over-period comparison lets you pick any two date ranges and see absolute and percentage change for each KPI. When you update an agent's context graph, change a prompt, or modify an escalation rule, you can compare the week before and after to see exactly what changed. This is the simplest way to answer "did that change help?" with data instead of intuition.

### Call Detail and Historical Access

Call detail is available for the full retention period of your workspace. When a call is no longer held in the real-time voice layer, the platform reconstructs its detail from durable storage - combining persisted call state with post-call intelligence records. This means dashboards, analytics queries, and API consumers can access call metadata, quality scores, conversation summaries, and escalation data long after the call completes, without gaps caused by real-time layer turnover.

## Call Intelligence

Call intelligence events are emitted for both production calls and simulation sessions. Simulation events carry a source tag that distinguishes them from production data, and include the simulation run identifier so metrics can be grouped by run. The metric evaluation pipeline deduplicates events per session and run before scoring, so retried or replayed completions do not inflate metric counts.

Call intelligence data is generated for both production calls and simulation sessions. Simulation-originated intelligence is persisted to the platform's analytical data store alongside production data, so metric evaluation, dashboards, and quality scoring work identically regardless of whether the conversation was a live call or a simulated test run. This unified persistence means simulation results feed directly into the same metric pipelines used for production quality monitoring. Profiles

Each voice call also produces a structured intelligence summary computed from session state at call end. These summaries capture operational telemetry that async quality scoring (which runs on recordings) cannot see: real-time emotion trajectories, engine response latency, tool invocation counts, and safety rule matches. The composite quality score (0-100) from this summary uses the same penalty-based model described above and is the primary metric for dashboard filtering.

### Call Statistics

For voice deployments, call analytics track volume, duration, and daily patterns. These metrics help operations teams identify capacity trends (peak calling hours, seasonal volume changes) and spot anomalies (sudden drops in call volume that might indicate a routing issue).

* **Call volume** - Total calls over configurable windows (30-90 days), broken down by service and direction (inbound/outbound)
* **Duration distribution** - How long calls last, useful for identifying calls that are too short (abandoned) or too long (stuck in loops)
* **Daily breakdown** - Per-day call counts for trend analysis
* **Service breakdown** - Volume by agent service, so you can compare usage across scheduling, care coordination, and other workflows

### Data Quality Dashboard

The data quality dashboard tracks confidence distribution across all events in the workspace. Every piece of data that enters the world model carries a confidence score, and the dashboard shows how that distribution looks across your entire dataset:

| Bucket             | Confidence Range | What It Means                                                  |
| ------------------ | ---------------- | -------------------------------------------------------------- |
| **Rejected**       | 0.0              | Events that failed review or were explicitly contradicted      |
| **Raw**            | 0.1-0.3          | Unverified data from agent inference or initial extraction     |
| **Uncertain**      | 0.4-0.5          | Voice-extracted data awaiting review                           |
| **Verified**       | 0.6-0.7          | Data that passed automated review                              |
| **Human-approved** | 0.8-0.95         | Data approved by a human reviewer                              |
| **Authoritative**  | 1.0              | Data from authoritative system integrations (direct EHR feeds) |

The dashboard also shows confidence breakdown by data source, so you can see which sources produce the most reliable data and which generate the most review queue items. A daily confidence timeseries shows low-confidence and high-confidence event counts over time, making it easy to spot trends after configuration changes.

Review pipeline metrics track how the automated and human review stages are performing:

* **Auto-approved** - Events that passed automated review without human involvement
* **Auto-verified** - Events verified by the automated review judge
* **Rejected** - Events that failed review
* **Pending review** - Events waiting in the human review queue
* **Human-approved** - Events approved by an operator
* **Corrected** - Events where an operator provided corrected data
* **Review rate** - Percentage of events that required any form of review

### Pipeline Health

Pipeline health metrics provide real-time visibility into the [connector runner's](/data/connectors-and-ehr.md) operational state. These are the metrics you check when something feels wrong - data is stale, surfaces are not getting filled, or outbound sync is backed up.

* **Overall status** - Healthy, degraded, or starting - with active poll count and total event/entity counts
* **Per-source connection health** - Whether each data source is reachable and polling successfully, with last poll time, duration, and event counts
* **Loop states** - Current state of each background process (entity resolution, review, outbound sync, reconciliation)
* **Outbound sync status** - Per-sink breakdown of synced, failed, and pending events
* **Throughput time series** - Event ingestion over time, bucketed by hour or day, filterable by source

Sources are automatically marked unhealthy after consecutive poll failures and recover when polling succeeds again. The dashboard degrades gracefully - if the connector runner is temporarily unavailable, stored metrics (event counts, sync history, review stats) remain available without live loop status.

Additional pipeline detail is available per source:

* **Entity resolution metrics** - Total merges, recent merge activity, and resolution loop status
* **Review pipeline metrics** - Queue depth, pending items by priority, approval/rejection counts, and average review time
* **Outbound sync detail** - Per-sink event counts, failure reasons, and retry status

### Command Center Dashboard

The command center provides a single-pane workspace health view that aggregates metrics from across the platform into four sections. This is the "is everything OK?" dashboard - one screen that tells an operations team whether voice calls are flowing, data pipelines are healthy, data quality is acceptable, and identity systems are functioning.

| Section          | Key Metrics                                                                                |
| ---------------- | ------------------------------------------------------------------------------------------ |
| **Voice**        | Active calls, escalated calls, calls today, average quality score, escalation rate         |
| **Pipeline**     | Source health counts (healthy/degraded/failing), events last hour, outbound pending/failed |
| **Data Quality** | Pending reviews, 7-day approval rate, average confidence, total entities, recent merges    |
| **Identity**     | Active API keys, active sessions, failed auth attempts, locked accounts, MFA coverage      |

Each section fails independently - if one data source is unavailable, the other sections still return results with a degraded indicator. The response includes a list of degraded sections so dashboards can show partial data with appropriate warnings.

Alerts are derived from threshold checks on the aggregated metrics: escalation rate above threshold, failing data sources, low approval rate, high outbound failure count. This gives operations teams a single view to answer "is anything broken right now?" without drilling into individual dashboards.

## Entity Intelligence

The [world model](/data/world-model.md) stores entities (patients, providers, appointments, medications) and their relationships. Four capabilities provide visibility into this entity data:

| Capability              | What It Shows                                                                                       |
| ----------------------- | --------------------------------------------------------------------------------------------------- |
| **Relationship graph**  | One-level graph of all edges (same\_as, related\_to) from an entity, with connected entity metadata |
| **Data provenance**     | Full lineage for an entity - contributing data sources, confidence history, merge events            |
| **Duplicate detection** | Suspected duplicates sorted by confidence, filterable by entity type                                |
| **Entity search**       | Search entities by name with filters for type, source, and minimum confidence                       |

These tools let operations teams audit how the world model arrived at a particular entity state and catch duplicate records before they cause downstream issues.

### Entity Narrative Briefs

Entity narrative briefs are AI-generated summaries that synthesize a patient's world model data into a structured narrative. The platform reads the last 90 days of events for a patient, distills them into a Markdown narrative and a structured JSON representation, and persists the result as a world model event for fast retrieval.

Each brief includes:

* **Evidence pointers** - the event IDs that informed the narrative, so every claim in the brief is traceable back to source data
* **Confidence score** - a 0.0-1.0 score reflecting the reliability of the underlying data
* **Event count** - how many events went into the synthesis
* **Prompt version** - the version of the generation prompt used, for traceability as the summarization logic evolves

Briefs are cached after generation. Cache misses fall through to the event store transparently. When no brief has been generated yet, the API returns a consistent empty shape rather than an error, so UIs can render a "not yet generated" state without special error handling.

Briefs are scoped to patient entities today. Cohort, territory, and workspace-level briefs are planned for a future release.

### Relationship Graph

The relationship graph shows one level of connections from any entity. Each edge carries a relationship type (same\_as for duplicates, related\_to for associations like patient-to-provider) and the connected entity's metadata. This is useful for understanding how entities relate to each other and for verifying that entity resolution has correctly linked records.

### Data Provenance

Data provenance traces the full lineage of an entity: which data sources contributed, how confidence changed over time, and which merge events combined records. When a patient record has conflicting information (two different phone numbers from two different sources), provenance shows exactly where each value came from and why the current value was chosen.

### Duplicate Detection

Duplicate detection surfaces suspected duplicate entities - records that the entity resolution system has flagged as likely referring to the same real-world person or object. Results are sorted by confidence and filterable by entity type, so operations teams can prioritize high-confidence duplicates for review first.

Entity search lets you find entities by name with filters for type, source, and minimum confidence. This is the starting point for most investigations - find the entity, then use the relationship graph and provenance tools to understand its state.

### Platform Health Monitoring

Two dedicated observability endpoints provide real-time visibility into data pipeline health without querying the analytics warehouse:

* **Connector health** - per-source event health for every data source writing into the workspace. Each source entry includes events ingested in the last hour and last 24 hours, mean events per minute, the last ingested timestamp, and a freshness category (`fresh`, `stale`, `quiet`, or `never`). Operators can spot a stalled connector at a glance.
* **Loop latency** - measures end-to-end time from data ingestion (a world model event landing) to agent action (the agent acting on that data). Returns an overall median latency, total pair count, and an hourly sparkline (up to 7 days) with per-hour pair counts and median latency. This is the metric that tells you whether the data-to-action loop is fast enough for real-time clinical workflows.

These pair with the pipeline health section above. Pipeline health tells you whether connectors are running. Platform health monitoring tells you whether the data they produce is reaching agents in time.

## Embeddable Dashboards

Analytics dashboards can be embedded outside the Developer Console using a reusable dashboard component. Dashboards are workspace-scoped resources with a slug identifier, title, and description. The platform serves dashboard definitions (chart configurations, data queries) through the API, and the embeddable component renders them with interactive charting.

This supports two use cases:

* **Developer Console integration** - the console's built-in dashboard pages render the same embeddable component, so custom dashboards appear alongside the standard analytics views
* **External embedding** - the same component can be embedded in customer-facing portals or internal tools, authenticated through either API tokens or backend-for-frontend proxy authentication

Dashboards are managed through the platform API (create, list, get by slug). Each dashboard definition specifies its chart type, data source query, and layout. The rendering component handles data fetching, charting, and responsive layout independently.

## Surface Analytics

[Surfaces](/channels/surfaces.md) are agent-generated data collection forms delivered to patients via SMS, WhatsApp, email, or web. Four analytics views provide closed-loop intelligence on how surfaces perform, enabling agents and gap scanners to optimize surface design based on actual outcomes:

| Metric                    | What It Shows                                                                                                                                                           |
| ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Completion rates**      | Overall completion rate, trend over time, and breakdown by source (mid-call agent, gap scanner, manual). Identifies whether surfaces are actually being filled out.     |
| **Channel effectiveness** | Per-channel (SMS, email, WhatsApp, web, voice) completion rate and average time-to-complete. Shows which delivery method works best for your patient population.        |
| **Field abandonment**     | Which specific fields cause patients to stop filling out a surface. Drop-off rate and save rate per field, so you can identify confusing or unnecessary fields.         |
| **Per-entity history**    | Surface history for a specific patient: completion stats, preferred channel, and recent surfaces. Useful for choosing the right channel and avoiding over-solicitation. |

All surface analytics support date range filtering (default 30 days, max 90). The field abandonment data is particularly actionable - if 40% of patients drop off at a specific question, that question is either confusing, unnecessary, or too sensitive for the delivery channel. Removing or rewording it directly improves completion rates.

## Event Distribution

Event analytics show how data enters the system and what kinds of data are flowing. This is useful for two things: verifying that integrations are working (confirming that a connector is producing expected event volumes) and understanding the data mix in the workspace.

Two views are available:

* **By type** - Event counts per entity type (patient, appointment, practitioner, insurance, medication, etc.). Shows what the system knows about and highlights gaps - if you expect appointment data but see zero appointment events, something is misconfigured.
* **By source** - Event counts per data source (EHR sync, voice extraction, manual entry, surface submission, etc.). Shows where data is coming from and how the mix changes over time. A healthy workspace typically has authoritative EHR data as the largest source, with voice-extracted and surface-submitted data filling gaps.

These distributions are useful during initial integration setup (to verify connectors are producing data) and ongoing operations (to spot when a source goes silent).

## Real-Time Event Stream

Two streaming endpoints deliver real-time events without polling.

**Workspace SSE stream** - Server-Sent Events covering call lifecycle (`call.started`, `call.ended`, `call.escalated`), surface lifecycle (created through review approval - 11 event types), pipeline status (`sync_completed`, `error`), and operator actions (registration, status changes, call join/leave, mode switches). Heartbeat comments keep the connection alive, and `Last-Event-ID` enables automatic reconnection with replay of missed events.

**Observer WebSocket** - Per-call stream for real-time call monitoring: user and agent transcripts, tool execution start/completion, call forwarding resolution, and speaker mute state.

Both streams use typed discriminated unions - every event type maps to a distinct payload schema in the OpenAPI specification. SDKs generated from the spec produce type-safe union types automatically, so consumers can switch on the event type and get compile-time guarantees on the payload structure.

## Decision Trace

Every conversation turn is classified with structured signal and effect types, creating an explainability layer over agent reasoning. Each turn carries a `signal_kind` (what triggered the agent's response - caller speech, tool result, state transition, barge-in) and an `effect_kind` (what the agent did - spoke, invoked a tool, navigated to a new state, escalated). This classification is computed at read time from existing turn data, requiring no additional storage.

Decision events are emitted at the end of each call for analytics queries, producing three derived metrics: turns per call, tool invocation rate, and state transition rate. These metrics feed into the metric store and can be tracked over time to understand how agent behavior changes across configuration updates.

## Agent Trace APIs

Two debugging endpoints provide full visibility into agent reasoning for any completed call:

* **Execution trace** - Per-turn log with the action taken, tools invoked, state transitions, detected emotions, inner reasoning thoughts, and response latency. Includes barge-in details showing which agent speech was interrupted and what utterances were discarded.
* **Prompt trace** - Full LLM prompts (system prompt, conversation history, tool definitions) for each turn. These are emitted per-turn as events, making them queryable for analysis across calls.

Equivalent endpoints exist for simulation sessions, so the same debugging tools work during testing.

## Call Trace Analysis

An audio-native analysis endpoint provides deep intelligence on completed calls. Unlike the real-time profiles (which analyze structured data during the call), trace analysis operates on the full audio recording after the call ends and returns:

* **Emotional arc** - How the caller's emotional state evolved throughout the conversation
* **Key decision moments** - Critical points with causal attribution explaining why the agent made each decision
* **Component attribution** - Root cause analysis identifying which subsystem caused each issue (speech-to-text, navigation model, emotion detector, state machine, tool executor, prompt logic, or turn taking), with structural evidence from the execution trace
* **Counterfactual analysis** - What would have happened if the agent had taken a different action
* **Coaching recommendations** - Specific suggestions for improving agent configuration
* **Interaction dynamics** - Patterns in conversational flow, turn-taking, and engagement

Component attribution turns "the agent gave a wrong answer" into "the navigation model failed to process the caller's response despite high transcription confidence." This makes it actionable - you know which component to tune instead of guessing.

Analysis is computed asynchronously after call completion. The API returns a pending status while processing is in progress.

## Call Playback Timeline

The call playback timeline provides a canonical, lane-based visualization of everything that happened during a call. Every event - speech, tool calls, escalations, fillers, silence - is organized into parallel lanes with actor attribution, creating a synchronized multi-track view of the interaction.

<figure><img src="/files/aNYdJUo5DlaXO5ZusB0k" alt="Call playback timeline: parallel lanes for agent, caller, tool, and system events with shared timebase and actor attribution"><figcaption></figcaption></figure>

### Lanes and Actors

Each timeline is divided into lanes - parallel tracks that group events by actor. The standard lanes are:

| Lane         | Actor Kind | What It Contains                                 |
| ------------ | ---------- | ------------------------------------------------ |
| **Agent**    | agent      | Agent speech segments, fillers, greetings        |
| **Caller**   | human      | Caller speech, barge-in events                   |
| **Operator** | operator   | Operator speech after escalation join            |
| **System**   | system     | State transitions, emotion shifts, safety events |
| **Tool**     | tool       | Tool invocations with start/end timing           |

Every segment carries an actor with a kind (agent, human, operator, system, tool), a role (agent, caller, operator, runtime, state, tool), and a display label. This attribution lets UIs render each event in the correct lane and identify who or what caused it.

### Segment Types

Segments are typed to distinguish different kinds of activity within each lane:

* **Speech**: `agent_speech`, `caller_speech`, `operator_speech` - audio segments with transcript text
* **Interaction**: `barge_in`, `filler_hesitation`, `filler_phrase`, `backchannel` - conversational micro-events
* **Execution**: `tool_call`, `state_transition`, `escalation` - system actions with metadata
* **Silence**: `silence`, `hold` - gaps in activity, distinguished by intent

### Shared Timebase

All segments reference a shared timebase that anchors the timeline to a canonical reference point - either the media start, the call start, or a synthetic origin for multi-call timelines. Start and end times are expressed in seconds relative to this timebase, so all lanes are synchronized regardless of when each participant joined the call.

For calls that span multiple legs (transfers, conference joins), the timebase provides a unified coordinate system across the full interaction, and segments carry parent call references so the UI can show the complete timeline as a single continuous view.

The call timeline is accessible through the Platform API alongside call detail and call intelligence endpoints. See the [Audio Pipeline](/channels/voice/audio-pipeline.md) for how voice timing parameters (filler cadence, end-of-turn detection, barge-in thresholds) shape the raw timeline events.

## Voice Judge

The Voice Judge is an audio-native quality evaluator that scores every call across ten dimensions. Unlike the post-interaction quality scoring (which analyzes structured session data), the Voice Judge operates directly on the stereo call recording and produces per-dimension scores with structured evidence.

### Ten Evaluation Dimensions

| Dimension           | What It Measures                                                           |
| ------------------- | -------------------------------------------------------------------------- |
| **Pronunciation**   | Clarity and correctness of speech, including medical terminology           |
| **Pacing**          | Conversation tempo - too fast, too slow, or appropriately varied           |
| **Empathy**         | Emotional responsiveness and acknowledgment of caller feelings             |
| **Listening**       | Active listening signals - appropriate pauses, acknowledgments, follow-ups |
| **Clarity**         | How clearly the agent communicates information and instructions            |
| **Professionalism** | Tone, register, and adherence to professional communication standards      |
| **Task completion** | Whether the agent accomplished the caller's objective                      |
| **Error handling**  | Recovery from misunderstandings, corrections, and unexpected inputs        |
| **Safety**          | Adherence to safety protocols and appropriate escalation                   |
| **Overall**         | Holistic quality assessment across all dimensions                          |

Each dimension produces a score (0.0 to 1.0), a severity level (none, flag, warning, or critical), and an evidence string citing specific moments from the audio.

### How It Works

The Voice Judge runs on a scheduled cadence, processing calls that have completed since the last run. Short calls (under 5 seconds) are skipped to avoid scoring dropped connections and test pings. Each call's stereo audio is evaluated against a structured rubric, and results are persisted for retrieval.

Results are available through two paths:

* **Per-call detail** - a dedicated endpoint returns recent scores for a specific service, with the full evidence JSON for UI drill-down into individual calls
* **Aggregate metrics** - the ten dimensions are registered as built-in metrics in the [Metric Store](/intelligence-and-analytics/metric-store.md), so workspace-level trends flow through the same dashboard and alerting infrastructure as all other metrics

The per-call endpoint is designed for investigation ("what went wrong on this specific call?"). The aggregate metrics are designed for monitoring ("is pronunciation quality trending down this week?").

## Insights Agent

The Insights Agent is a conversational interface for exploring workspace data. Rather than navigating dashboards and filtering tables, operators ask questions in natural language and receive structured analysis with visualizations.

The agent streams responses in real time - reasoning steps, tool invocations, and data queries are visible as they execute, so operators can follow the analysis as it unfolds. When the agent queries the platform's analytics endpoints or metric store, the results appear inline as formatted tables and charts.

Typical queries:

* "How did call quality change after we updated the scheduling context graph last Tuesday?"
* "Which surface fields have the highest abandonment rates this month?"
* "Show me the calls that scored below 60 this week and what went wrong"
* "Compare escalation rates between the downtown and suburban locations"

The Insights Agent uses the same tool infrastructure as the voice and text agents - it calls the workspace's analytics and metric store endpoints to answer questions, so results always reflect live data rather than cached summaries. Visualizations use the full Plotly grammar - bar charts, line charts, heatmaps, scatter plots, annotations, and subplots - rendered inline alongside the analysis. The same charting format powers the platform's embeddable dashboards, so insights visualizations are consistent with dashboard views.

## Population Health Analytics

The platform provides population health analytics surfaces that visualize patient risk at both individual and district levels. These surfaces include:

* **Patient topology** - scatter plot visualization of patient populations with predicted risk scores, cluster assignments, and demographic attributes. Useful for identifying high-risk cohorts and emerging patterns.
* **District-level metrics** - per-district observed vs predicted disease incidence, primary care capacity gaps, and unmet demand scoring for resource allocation decisions.
* **Anomaly alerts** - emerging health anomalies with causal decomposition, recommended actions, and projections of what happens if no action is taken. Alerts are prioritized by severity and include supporting statistical context.
* **Forecast fans** - time-series forecasts with confidence intervals across multiple scenarios (baseline, with policy intervention, observational). Used for both observational monitoring and intervention planning.
* **Positive signals** - headline metrics highlighting positive trends across the population, providing a balanced view alongside alerts and anomalies.

All population health data is workspace-scoped and accessible through the Platform API.

## Production Health Monitoring

The Developer Console provides workspace-level and per-agent production health monitoring. The workspace home page displays a fleet overview grid where each agent card shows its current production status (Live, Degraded, or Paused), pass rate, escalation rate, and recent call volume. A workspace pulse strip above the grid summarizes active agents, call volume, active calls, average quality, escalation rate, and tool success rate with period-over-period deltas.

Each agent links to a dedicated production detail page with KPI cards, trend charts for call quality, latency, caller emotion distribution, and tool performance, and a filterable recent calls list. A period selector lets you view 24-hour, 7-day, or 30-day windows.

Production status is computed automatically:

* **Live** - The agent is deployed, receiving traffic, and metrics are within healthy thresholds.
* **Degraded** - The pass rate has dropped below 80, the escalation rate exceeds twice the baseline for the equivalent prior period, or the agent received zero calls in the selected window.
* **Paused** - The agent deployment is inactive.

The workspace home page also includes a pipeline and data quality section showing ingest source health, event throughput, outbound sync queue status, entity resolution confidence, deduplication activity, and review backlog.

## Metric Store

For workspace-level operational metrics beyond the analytics described above - including built-in metrics, custom AI-powered metrics, per-metric latency tiers, and cross-channel analytics - see the [Metric Store](/intelligence-and-analytics/metric-store.md). The metric store provides a catalog of pre-built and custom metrics that can be evaluated against any interaction, supporting both real-time scoring and batch analysis across your full interaction history.

{% hint style="info" %}
**Developer Guide** - For the full analytics endpoint set and metric store details, see [Analytics](https://docs.amigo.ai/developer-guide/platform-api/analytics) in the developer guide.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.amigo.ai/intelligence-and-analytics/intelligence.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.