microphone-linesVoice Judge

Retrieve audio-native voice quality scores for your voice agent calls.

The Voice Judge evaluates voice agent calls directly from audio recordings, scoring each call across 10 quality dimensions. Scores measure the voice experience - pronunciation, clarity, pacing, interruption handling - independent of conversational logic or agent prompting.

Voice judge results are produced by a scheduled batch evaluation job. Once scored, results are available per-service through the API.

Endpoints

List Recent Voice Judge Results

circle-info

Latency: 500ms-2s. This endpoint reads from the analytics warehouse, not the primary database.

GET /v1/{workspace_id}/services/{service_id}/voice-judge/recent

Returns the most recent per-call voice quality scores for a service, ordered newest first.

Path Parameters

Parameter
Type
Description

workspace_id

string (UUID)

Workspace identifier

service_id

string (UUID)

Service identifier

Query Parameters

Parameter
Type
Default
Description

limit

integer

20

Max rows to return. Min 1, max 100.

Response

{
  "service_id": "string",
  "count": 2,
  "items": [
    {
      "call_sid": "string",
      "call_entity_id": "string | null",
      "service_id": "string | null",
      "latency_dead_air_score": 0.9,
      "pronunciation_score": 1.0,
      "clarity_score": 0.25,
      "filler_silence_score": 0.5,
      "interruption_handling_score": 1.0,
      "audio_consistency_score": 1.0,
      "pacing_score": 1.0,
      "warmth_tone_score": 1.0,
      "accent_quality_score": 1.0,
      "voice_identity_score": 1.0,
      "overall_score": 0.865,
      "critical_count": 0,
      "flag_count": 1,
      "warning_count": 1,
      "judge_json": "string | null",
      "computed_at": "2026-05-15T12:00:00Z"
    }
  ]
}

Response Fields

Field
Type
Description

service_id

string

The service these results belong to

count

integer

Number of items returned

items

array

List of voice judge result rows

Voice Judge Result Row

Field
Type
Description

call_sid

string

Call identifier

call_entity_id

string or null

Entity identifier for the call

service_id

string or null

Service identifier

latency_dead_air_score

number or null

Response latency and dead air score (P0). 0.0-1.0

pronunciation_score

number or null

Pronunciation accuracy score (P0). 0.0-1.0

clarity_score

number or null

Speech clarity and intelligibility score (P0). 0.0-1.0

filler_silence_score

number or null

Filler and silence management score (P1). 0.0-1.0

interruption_handling_score

number or null

Barge-in and recovery score (P1). 0.0-1.0

audio_consistency_score

number or null

Audio consistency score (P1). 0.0-1.0

pacing_score

number or null

Speech rate and pausing score (P2). 0.0-1.0

warmth_tone_score

number or null

Emotional tone appropriateness score (P2). 0.0-1.0

accent_quality_score

number or null

Language and accent match score (P2). 0.0-1.0

voice_identity_score

number or null

Voice consistency across the call score (P2). 0.0-1.0

overall_score

number or null

Composite score (arithmetic mean of dimension scores). 0.0-1.0

critical_count

integer or null

Number of dimensions with Critical severity

flag_count

integer or null

Number of dimensions with Flag severity

warning_count

integer or null

Number of dimensions with Warning severity

judge_json

string or null

Raw judge output with per-dimension evidence quotes and severity. Opaque string for UI drill-in.

computed_at

string (ISO 8601) or null

When the evaluation was computed

Score Interpretation

All dimension scores range from 0.0 to 1.0:

Score Range
Severity
Meaning

0.75 - 1.0

None

Meets the bar

0.5 - 0.74

Warning

Minor quality pattern detected

0.25 - 0.49

Flag

Notable quality issue

0.0 - 0.24

Critical

Significant quality problem

Error Responses

Status
Description

404

Service not found in this workspace

503

Analytics warehouse not configured or transiently unavailable

A 200 response with an empty items list means no calls have been evaluated yet for this service. A 503 response means the analytics infrastructure is temporarily unavailable - retry after a short delay.

Dimensions

The voice judge evaluates 10 dimensions, grouped by priority:

P0 - Critical Quality

  • Latency and Dead Air - Response latency between turns. Flags prolonged silence (>3s between turns) and extended processing waits without verbal acknowledgment.

  • Pronunciation - Correct pronunciation of medical terms, drug names, dates, numbers, and patient names. Critical on any factual read-back error.

  • Clarity - Speech intelligibility and clean audio output. Critical on garbled or unintelligible speech.

P1 - Important Quality

  • Filler and Silence Management - Graceful handling of processing pauses. Verbal acknowledgment before a pause, no dead air during the wait, no repeated filler phrases, and filler that matches the result being delivered.

  • Interruption Handling - Clean barge-in behavior. Agent stops when the caller speaks, no false triggers on background noise, smooth recovery after being interrupted.

  • Audio Consistency - Absence of volume spikes, pitch anomalies, mid-word cutoffs, or inconsistent voice timbre across turns.

P2 - Quality Polish

  • Pacing - Conversational speech rate with appropriate pauses between pieces of information and slower delivery for sensitive content.

  • Warmth and Tone - Emotional appropriateness matched to the caller's state. Flags flat affect, tonal mismatches, or inappropriate emotional tone.

  • Accent and Language Quality - Language and accent match to the caller. Critical on wrong language delivery.

  • Voice Identity - Consistent agent voice and persona across the entire call.

Last updated

Was this helpful?