# Voice Configuration

Every voice service has a `voice_config` object that controls how the real-time pipeline behaves during calls. It covers five categories: latency tuning, filler behavior, response length limits, barge-in sensitivity, and tool access. If you leave `voice_config` null, the service inherits balanced defaults from the workspace and environment.

The design principle is straightforward: each field is optional. A null field means "use the default from the layer above." You only set the fields you want to override.

## Field Reference

### Latency

These fields control how quickly the agent responds after the caller stops speaking.

| Field                 | Type                           | Default         | Description                                                                                                                                                                                                                                                                                                |
| --------------------- | ------------------------------ | --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `tts_model`           | `"sonic-turbo"` or `"sonic-3"` | `"sonic-turbo"` | TTS model. `sonic-turbo` targets \~40ms time-to-first-audio and is best for snappy conversations. `sonic-3` targets \~90ms but produces higher-quality speech with better prosody on longer utterances.                                                                                                    |
| `max_buffer_delay_ms` | `int` (200-1000)               | `500`           | How long the pipeline buffers generated text before sending it to TTS. Lower values make the agent start speaking sooner, but may produce choppier speech because each TTS chunk has less context.                                                                                                         |
| `eager_eot_threshold` | `float` (0.0-1.0)              | inherited       | End-of-turn confidence threshold. Higher values make the system more aggressive about deciding the caller has finished talking. A value of 0.8 means "start responding when 80% confident the caller is done." Lower values reduce latency but increase the chance of cutting the caller off mid-sentence. |
| `eot_timeout_ms`      | `int`                          | inherited       | Hard silence timeout in milliseconds. If the caller stops speaking for this long, the system forces an end-of-turn regardless of the confidence score. Useful as a safety net for the confidence-based detection.                                                                                          |

### Fillers

Fillers are the short sounds or phrases the agent produces while the LLM generates the full response. They fill what would otherwise be dead air.

| Field                  | Type                                       | Default         | Description                                                                                                                                                                                                                                                           |
| ---------------------- | ------------------------------------------ | --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `filler_style`         | `"backchannel"`, `"phrase"`, or `"silent"` | `"backchannel"` | Controls what kind of filler the agent uses. `backchannel` produces short acknowledgments ("Mm", "Yeah", "Mhm"). `phrase` produces longer fillers ("Let me check on that"). `silent` disables fillers entirely. The empathy system can override this at higher tiers. |
| `filler_vocabulary`    | `list[str]`                                | null            | Custom filler words that replace the built-in vocabulary. If you set `filler_style` to `backchannel` and provide `["Sure", "Got it", "Okay"]`, those become the only backchannels the agent uses.                                                                     |
| `backchannel_delay_ms` | `int` (400-1000)                           | `400`           | How many milliseconds the pipeline waits before playing a backchannel when navigation is skipped (the caller said something that doesn't require a state change). Lower values make the agent feel more responsive; higher values feel more deliberate.               |

### Response

Hard caps on how much the agent says per turn. These are mechanically enforced by the pipeline (it truncates the output), not just prompted.

| Field                    | Type  | Default          | Description                                                                                                                                                        |
| ------------------------ | ----- | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `max_response_sentences` | `int` | null (unlimited) | Maximum sentences per response. Set to 1 or 2 for voice services where brevity matters. The pipeline counts sentences by punctuation and cuts off after the limit. |
| `max_response_words`     | `int` | null (unlimited) | Maximum words per response. Same mechanical enforcement. Useful when you want finer control than sentence-level truncation.                                        |

### Barge-in

Barge-in is when the caller starts speaking while the agent is still talking. These fields control how sensitive that detection is.

| Field                   | Type    | Default | Description                                                                                                                                                                                                                                                                                                                         |
| ----------------------- | ------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `barge_in_min_speech_s` | `float` | `0.8`   | Minimum duration of caller speech (in seconds) before a barge-in triggers. The default of 0.8s was found through simulation testing to be the sweet spot. Lower values (e.g. 0.5) cause false barge-ins from breath sounds and micro-utterances like "um". Higher values (e.g. 1.2) make the agent feel like it ignores the caller. |
| `barge_in_cooldown_s`   | `float` | `1.5`   | Cooldown period after a barge-in before another one can fire. Prevents rapid-fire interruptions where the caller and agent keep cutting each other off.                                                                                                                                                                             |

### Tools

| Field                  | Type   | Default | Description                                                                                                                                                                                                                                                            |
| ---------------------- | ------ | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `forward_call_enabled` | `bool` | `false` | Whether the `forward_call` tool is available during conversations. This is opt-in because call forwarding has cost and compliance implications. Even when enabled, the context graph's TurnPolicy can further gate forwarding (e.g. `block_forward_call_after_turns`). |

## Named Presets

Three presets cover common configurations. A preset sets multiple fields at once to a tested combination.

| Preset              | TTS Model     | Buffer Delay | Filler Style  | Backchannel Delay | Response Cap | Best For                                                                                        |
| ------------------- | ------------- | ------------ | ------------- | ----------------- | ------------ | ----------------------------------------------------------------------------------------------- |
| `ultra_low_latency` | `sonic-turbo` | 200ms        | `backchannel` | 400ms             | 1 sentence   | Quick intake, high-volume triage, yes/no flows                                                  |
| `balanced`          | (default)     | 500ms        | (default)     | (default)         | (unlimited)  | General-purpose conversations                                                                   |
| `quality`           | `sonic-3`     | 500ms        | `phrase`      | (default)         | (unlimited)  | Sensitive topics, empathetic conversations where natural speech quality matters more than speed |

Presets are a starting point. You can apply a preset and then override individual fields on top of it.

## API Usage

Voice config is a field on the service model. You set it when creating or updating a service.

### Setting voice config via PUT

```
PUT /v1/{workspace_id}/services/{service_id}
```

```json
{
  "voice_config": {
    "tts_model": "sonic-turbo",
    "max_buffer_delay_ms": 200,
    "filler_style": "backchannel",
    "backchannel_delay_ms": 400,
    "max_response_sentences": 1,
    "barge_in_min_speech_s": 0.8,
    "forward_call_enabled": false
  }
}
```

The `voice_config` field is merged with existing config. You can update individual fields without resending the full object:

```json
{
  "voice_config": {
    "filler_style": "silent"
  }
}
```

This changes only the filler style. All other voice config fields keep their current values.

### Reading current config

The service response includes the full `voice_config` object:

```
GET /v1/{workspace_id}/services/{service_id}
```

```json
{
  "id": "svc_abc123",
  "name": "Intake Line",
  "voice_config": {
    "tts_model": "sonic-turbo",
    "max_buffer_delay_ms": 200,
    "filler_style": "backchannel",
    "filler_vocabulary": null,
    "backchannel_delay_ms": 400,
    "eager_eot_threshold": null,
    "eot_timeout_ms": null,
    "max_response_sentences": 1,
    "max_response_words": null,
    "barge_in_min_speech_s": 0.8,
    "barge_in_cooldown_s": null,
    "forward_call_enabled": false
  }
}
```

Fields that are `null` mean "inherit from the layer above" (see [Configuration Cascade](#configuration-cascade) below).

## CLI Usage

The `forge` CLI provides a shorthand for voice config operations.

### Apply a preset

```bash
forge platform service voice-config <service_id> --preset ultra_low_latency
```

This sets all the fields in the preset at once.

### Set individual fields

```bash
forge platform service voice-config <service_id> \
  --body '{"filler_style": "silent", "max_response_sentences": 1}'
```

### Read current config

```bash
forge platform service voice-config <service_id> --get
```

## Configuration Cascade

Voice behavior resolves through a four-layer cascade. Each layer can override the one above it. The first non-null value wins.

```
Workspace Voice Settings
  └── Service Voice Config (voice_config on the service model)
      └── Context Graph State (TurnPolicy on the current state)
          └── Action-level (per-action overrides via channel_overrides)
```

Here is what each layer controls:

**Layer 1: Workspace Voice Settings** set the baseline for all services in a workspace. Voice ID, tone, speed, volume, language, keyterms, and sensitive topics. Configured via `GET/PUT /v1/{workspace_id}/voice-settings`. See [Workspaces](https://docs.amigo.ai/developer-guide/platform-api/workspaces#voice-settings).

**Layer 2: Service Voice Config** (`voice_config` on the service) overrides workspace defaults for pipeline behavior. This is where you set latency tuning, filler style, response limits, and barge-in sensitivity. A null field means "use whatever the workspace says."

**Layer 3: Context Graph TurnPolicy** can override behavior per-state. For example, a "crisis" state might set `safety_response: stay_empathize` and `block_forward_call: true`. TurnPolicy fields include `barge_in_enabled`, `greeting_shield_s`, `safety_response`, `context_strategy`, `degradation_threshold`, and `block_forward_call`. See [Voice Simulation](https://docs.amigo.ai/developer-guide/platform-api/platform-api/voice-simulation) for how these were tuned.

**Layer 4: Channel Overrides** on individual actions or states can override the objective, action guidelines, and filler suppression for specific channels (e.g. SMS vs voice).

In practice, most services only need Layer 1 (workspace) and Layer 2 (service voice config). Layers 3 and 4 are for advanced context graph tuning.

## Nav-Selected Emotion

During each turn, the navigation LLM selects an emotion alongside its state transition. The output format is:

```
CODE,V,EMOTION,FILLER
```

For example: `a0,0,sympathetic,Mm`

| Part      | Meaning                                                                                                                                                                   |
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `CODE`    | State transition code from the context graph (`a0` = action 0, `e1` = exit condition 1)                                                                                   |
| `V`       | Audio verification flag. `0` = normal. `1` = verify the caller's audio against the transcript (expensive, use sparingly for structured data like dates and phone numbers) |
| `EMOTION` | Voice tone applied to the TTS provider for the entire response. The filler and main response share the same emotional tone.                                               |
| `FILLER`  | Short phrase spoken while the engage LLM generates the full response. Optional but recommended.                                                                           |

### Available Emotions

The navigation LLM chooses from eight emotions based on the caller's emotional state and conversation context:

| Emotion        | When to use                                                                |
| -------------- | -------------------------------------------------------------------------- |
| `friendly`     | Default warmth. General conversations, greetings, confirmations.           |
| `sympathetic`  | Caller is frustrated, upset, or sharing bad news.                          |
| `calm`         | Sensitive topics, de-escalation, anxious callers.                          |
| `enthusiastic` | Positive outcomes, good news, caller is excited.                           |
| `serious`      | Important information, disclaimers, medical details.                       |
| `cheerful`     | Light conversations, small talk, closing on a positive note.               |
| `curious`      | Asking clarifying questions, exploring a topic.                            |
| `content`      | Neutral satisfaction. Wrapping up, confirmations after a good interaction. |

Because emotion is set at the TTS context level (not injected as SSML markup mid-sentence), the entire utterance has consistent prosody. This avoids the uncanny shifts that happened when SSML tags changed emotion partway through a sentence.

The empathy system can override the nav-selected emotion at higher tiers. At Tier 2 (full empathy), the system forces warmth regardless of what the navigator picked. At Tier 3 (hold space), filler is suppressed entirely and silence replaces speech.
