headsetOperators and Escalation

Human operators monitor and take over live calls through a conference-based architecture with zero disruption to callers.

Operators are the human participants in a voice call. They can monitor calls silently, take over the conversation when needed, and hand control back to the agent when the situation is resolved. The operator system is built on the same conference architecture as the voice agent itself: the operator joins the existing conference as a third participant, alongside the caller and the AI agent.

This is not a call transfer. The caller stays on the same call. There is no hold music, no reconnection, no disruption. The operator simply appears in the conference.

Two Modes

Operators work in one of two modes at any given time and can switch between them instantly.

Listen Mode

The operator hears the full conversation but is muted at the telephony level. The caller does not know the operator is present. The AI agent continues handling the conversation normally.

Listen mode is used for:

  • Quality monitoring of live calls

  • Observing how the agent handles specific scenarios

  • Waiting for the right moment to intervene if escalation is needed

Takeover Mode

The operator is unmuted and speaks directly with the caller. The AI agent's audio output is suppressed (its speaker is muted), but its processing loop continues running in the background. When the operator finishes and switches back to listen mode or leaves the call, the agent resumes immediately with full context of what happened during the takeover. There is no re-initialization or context loss.

During takeover, the operator's speech is captured through a dedicated per-participant STT stream and recorded as operator turns in the transcript. This means the complete call record includes everything the operator said, not just the agent and caller portions.

Connection Methods

Operators connect to calls through one of two methods.

Phone (PSTN)

The platform dials the operator's phone number. When the operator answers, they are added to the conference. This method works from any phone and requires no special software.

  • Higher latency due to the PSTN round trip

  • Best for remote operators or situations where a desktop is not available

Browser (WebRTC)

The operator connects through a web browser using the voice SDK. Audio travels directly over WebRTC, bypassing the phone network entirely.

  • Lower latency than PSTN

  • Best for operators working at a desktop with a headset

The browser connection flow:

  1. Request an access token via the API

  2. Register the operator for the call

  3. The frontend application connects using the voice SDK with the provided token and connection parameters

  4. Browser audio joins the conference directly

One Operator Per Call

Only one operator can be active on a call at a time. If a second operator attempts to join the same call, they receive a conflict error. The same operator joining the same call again receives the cached response (the join is idempotent).

Escalation Triggers

Operators do not need to monitor every call manually. The platform can escalate calls to operators automatically based on:

  • Safety rules - A monitor concept detects content that requires human review (see Monitoring)

  • Patient request - The caller explicitly asks to speak with a person

  • Agent uncertainty - The agent's confidence in its understanding drops below a configured threshold

When an escalation triggers, the operator receives a notification with the call context: who the caller is, what the conversation has covered so far, and why the escalation was triggered.

Speaker Resolution

With three participants in a conference (caller, agent, operator), the system must determine who is speaking at any given moment. The priority chain is:

  1. Operator in takeover mode - Highest priority. Agent audio is suppressed.

  2. Caller - Barge-in detection applies. If the caller speaks during agent output, the agent stops.

  3. Agent - Speaks when neither the operator nor the caller is active.

This priority chain ensures that humans always take precedence over the agent, and that the caller always takes precedence over the agent's output.

Dashboard and Performance

Operators register with a profile that includes their name, skills, connection method (phone or browser), and role. Their status is tracked in real time: offline, available, on-call, busy, or unavailable.

The operator dashboard provides:

  • Active call list - Currently escalated calls with context summaries

  • Escalation statistics - Volume and type of escalations over time

  • Performance metrics - Total escalations handled and average handle time per operator

  • Audit log - Complete history of operator actions (join, mode switch, leave) for compliance

Browser-based operators connect via WebRTC. Phone-based operators receive a call to their registered number. Both see the same dashboard and have the same listen/takeover/leave controls.

Operator Guidance

Operators in listen mode can send text guidance to the agent without taking over the call. The guidance is injected into the active session and the agent processes it as an instructional event - interrupting its current speech to act on the guidance immediately.

This is useful when an operator sees the conversation going in the wrong direction and wants to steer the agent without the caller knowing a human intervened. For example, an operator monitoring a scheduling call could send "Ask for their insurance ID before confirming the appointment" and the agent would work that into its next response naturally.

Guidance messages are distinct from external events. External events carry factual information ("The appointment has been confirmed") and queue behind the agent's current speech. Guidance carries instructions ("Ask about their insurance") and interrupts the agent's current speech because instructions are time-sensitive.

Both event types flow through the same injection system. The architecture is cross-pod, meaning injection works regardless of which server is handling the call.

Deferred Transfer

When the agent initiates a call transfer (for example, forwarding to a clinic's front desk), the transfer is deferred until the agent's goodbye message finishes playing. This prevents the caller from being redirected mid-sentence. If the caller speaks during the goodbye (barge-in), the transfer is cancelled and the conversation continues. If an operator joins the call during this window, the transfer is also cancelled.

circle-info

Developer Guide - For API endpoints, SDK examples, and integration details, see the Operatorsarrow-up-right in the developer guide.

Last updated

Was this helpful?