# Drift Detection

Drift is the gradual degradation of agent performance over time. An agent that performs well at launch can slowly get worse as the world changes around it: user populations shift, new types of questions emerge, or upstream data sources change. Drift detection catches these problems before they affect outcomes.

## What Causes Drift

Drift happens for several reasons, and understanding the cause determines the right response.

**Input drift**: The types of conversations your agent handles change over time. A scheduling agent that was tested against routine appointment requests starts receiving complex multi-provider coordination calls. The agent was never tested against these scenarios, and its performance on them is unpredictable.

**Performance drift**: The agent's behavior shifts across metrics. Accuracy may improve while response times degrade. Safety adherence may hold steady while empathy scores decline. These shifts often happen after configuration changes or upstream model updates.

**Requirement drift**: What counts as "good enough" changes. Clinical guidelines update. Patient expectations shift. Regulatory requirements evolve. The agent has not gotten worse, but the bar has moved.

{% @mermaid/diagram content="flowchart TD
prod\["Production Interactions"] --> metrics\["Metric Scores\n(per interaction)"]
metrics --> cohort\["Cohort Comparison\n(before vs after)"]
metrics --> blueprint\["Blueprint Comparison\n(expected vs actual)"]
cohort --> alert{Significant\ndegradation?}
blueprint --> alert
alert -->|Yes| action\["Alert + Rollback\nor Re-simulate"]
alert -->|No| prod" %}

## How the Platform Detects Drift

The platform uses two primary mechanisms to detect drift.

### Metric Tracking Across Cohorts

The platform continuously evaluates production conversations against your configured metrics. It groups conversations into time-based cohorts (daily, weekly, or custom windows) and tracks how metric scores change across cohorts.

When a metric trend crosses a threshold, the platform generates an alert. For example:

* Empathy score drops from 88% to 82% over two weeks
* Safety escalation accuracy falls below 99.5% in the current week
* Response completeness degrades for a specific user segment

Cohort-level tracking distinguishes real trends from noise. A single bad conversation does not trigger an alert. A sustained downward trend does.

### Blueprint Comparison

The platform compares the expected behavior distribution (from your simulations and test sets) against the actual behavior distribution observed in production.

If your simulations predict that the agent escalates in 15% of post-discharge calls, but production data shows escalation in only 8%, that gap signals drift. Either the agent's escalation logic has degraded, or the production population differs from your test personas in ways that affect escalation rates.

Blueprint comparison catches drift that metric scores alone might miss. An agent can maintain good average scores while silently failing on a specific scenario type that has become more common in production.

## What Happens When Drift Is Detected

The platform supports a graduated response to drift based on severity.

### Alerts

All detected drift triggers an alert. Alerts include:

* Which metrics are affected and by how much
* The time window over which the change occurred
* Affected conversation cohorts or user segments
* Comparison against baseline and simulation expectations

Alerts give your team the information needed to investigate and decide on a response.

### Promotion Gates

Drift status can serve as a gate on version set promotions. If active drift is detected in production, the platform can block promotion of new configurations until the drift is investigated and resolved.

This prevents compounding problems. Promoting a new configuration while existing drift is unresolved makes it harder to isolate the root cause.

### Automatic Rollback Triggers

For safety-critical metrics, you can configure automatic rollback triggers. If a metric drops below a hard floor, the platform reverts to the last known good configuration without waiting for manual intervention.

{% hint style="warning" %}
Automatic rollback is appropriate for safety metrics with clear thresholds (such as escalation accuracy falling below 99%). It is not appropriate for quality metrics where temporary dips may be acceptable.
{% endhint %}

## Responding to Drift

When drift is detected, follow this process:

1. **Investigate the cause.** Review the affected conversations and metric breakdowns. Determine whether the drift is caused by input changes, agent behavior changes, or requirement changes.
2. **Update your simulations.** If input drift is the cause, add new personas and scenarios that reflect the changed production distribution. If requirement drift is the cause, update your metric thresholds and success criteria.
3. **Re-verify before promoting.** Run your updated test sets against the current configuration. If the agent no longer passes, iterate on the configuration before re-deploying.
4. **Monitor the response.** After deploying a fix, watch the affected metrics to confirm that the drift has been resolved and that the fix did not introduce new issues.

{% hint style="info" %}
Drift is not always a problem. Sometimes input drift reveals new use cases that your agent should support. In those cases, the right response is to expand your test coverage and improve the agent, not to revert to old behavior.
{% endhint %}

## Setting Up Drift Detection

To get value from drift detection, you need:

* **Baseline metrics** from simulations or an initial production period
* **Monitoring windows** appropriate for your traffic volume (daily for high-volume, weekly for lower-volume)
* **Alert thresholds** calibrated to distinguish real trends from normal variance
* **Rollback thresholds** for safety-critical metrics where immediate response is required

Start with your highest-priority metrics and expand monitoring as you gain confidence in the system.
