# Metrics and Quality

Metrics measure the quality of agent conversations. You define what dimensions matter to your organization, and the platform evaluates every conversation against those dimensions. Metrics provide the scoring layer that tells you whether your agent is meeting its goals.

## What Metrics Measure

Metrics evaluate agent performance across the dimensions that determine success in your domain. Common categories include:

**Conversation quality**: Was the response clear? Did the agent understand the user's intent? Was the information complete and accurate?

**Safety adherence**: Did the agent stay within its defined scope? Did it escalate when appropriate? Were there any safety boundary violations?

**Goal completion**: Did the agent accomplish what the user needed? Was the appointment scheduled, the question answered, the referral made?

**Domain-specific dimensions**: In healthcare, this includes clinical accuracy, protocol adherence, empathy, and risk disclosure completeness. Each organization defines the dimensions that reflect their standards.

## Three Evaluation Sources

Metrics can be generated from three different sources, each serving a different purpose.

### Post-Session Evaluation

After every production conversation, the platform automatically evaluates the interaction against your configured metrics. This happens without any manual intervention.

Post-session evaluation provides continuous quality monitoring across all production traffic. It is the primary source for tracking performance trends and detecting drift.

{% hint style="info" %}
Post-session metrics are generated by evaluation judges that assess the conversation transcript against your defined criteria. These judges receive more computational resources than the agent itself, allowing them to reason carefully about whether responses met your standards.
{% endhint %}

### Simulation Evaluation

When you run simulations, the platform scores each simulated conversation against your metrics. This gives you metric data for synthetic interactions before they reach real users.

Simulation evaluation serves two purposes:

* **Pre-deployment validation**: Confirm that metric scores meet your thresholds before promoting a new configuration.
* **Comparative analysis**: Run the same test set against two configurations and compare metric scores to understand the impact of a change.

### Manual Evaluation

Human reviewers can score conversations against your metrics through manual review workflows. This is useful for:

* **Calibrating automated metrics**: Compare human scores against automated scores to verify that your metrics capture what they should.
* **High-stakes review**: Flag conversations that need human judgment, such as those involving clinical decisions or safety escalations.
* **Discovering new dimensions**: Human reviewers sometimes notice quality issues that existing metrics do not capture, which can inform new metric definitions.

## Multi-Objective Scoring

Healthcare and other high-stakes domains require that the agent performs well across multiple dimensions at the same time. An agent that is clinically accurate but lacks empathy fails to deliver a good outcome. An agent that is empathetic but misses a safety escalation is dangerous.

Multi-objective scoring evaluates conversations against all configured metrics simultaneously. A conversation passes quality gates only when it meets thresholds across every required dimension.

**Example: Post-discharge follow-up quality gates**

| Metric                     | Threshold | Type                  |
| -------------------------- | --------- | --------------------- |
| Clinical accuracy          | 99%       | Hard gate (must pass) |
| Safety escalation accuracy | 100%      | Hard gate (must pass) |
| Protocol adherence         | 95%       | Hard gate (must pass) |
| Empathy score              | 80%       | Soft target           |
| Response completeness      | 90%       | Soft target           |

Hard gates are non-negotiable. A conversation that fails any hard gate is flagged regardless of how well it scores on other dimensions. Soft targets are goals that inform improvement priorities but do not block deployment on their own.

## Configuring Metrics

Each metric definition includes:

* **Name and description**: What the metric measures.
* **Evaluation criteria**: The specific rubric or checklist the evaluator uses to score the conversation.
* **Scoring method**: Pass/fail for binary requirements, or a numeric scale (typically 0-100) for graded dimensions.
* **Threshold**: The minimum acceptable score for this metric.

{% hint style="warning" %}
Start with a small set of high-impact metrics rather than trying to measure everything. A focused set of 5-10 well-calibrated metrics provides more actionable insight than 50 loosely defined ones.
{% endhint %}

## Evaluation Metrics vs Operational Metrics

Evaluation metrics (this page) score individual conversations against rubrics - "How well did the agent handle this interaction?" They are the quality layer for testing, simulation, and deployment gates.

The [Metric Store](https://docs.amigo.ai/intelligence-and-analytics/metric-store) serves a different purpose: workspace-level operational analytics across all channels. It computes aggregate metrics (call volume, escalation rates, surface completion, data quality) from a config-driven pipeline with per-metric latency tiers and AI-powered extraction. Evaluation metric results can feed into the metric store as a source, so individual conversation scores aggregate into quality trends over time.

For operational dashboards, alerting, and cross-channel analytics, see [Metric Store](https://docs.amigo.ai/intelligence-and-analytics/metric-store).

## Using Metrics Strategically

**Separate safety metrics from quality metrics.** Safety metrics (escalation accuracy, scope adherence, privacy compliance) should have 100% targets and serve as hard deployment gates. Quality metrics (empathy, clarity, completeness) should have realistic targets that guide improvement.

**Review metric trends, not just individual scores.** A single low empathy score might be an outlier. A downward trend over two weeks signals a real issue.

**Calibrate regularly.** Have human reviewers score a sample of conversations and compare against automated scores. If automated and human scores diverge, update your evaluation criteria.

{% hint style="info" %}
**For Developers**: See the [Metrics API reference](https://docs.amigo.ai/developer-guide/core-api/metrics) for creating and managing metrics programmatically.
{% endhint %}
