# Voice

Amigo supports two voice modes. Choose the one that matches your UX and latency needs:

| Mode                            | Transport     | Best for                                        | Latency       | Notes                                          |
| ------------------------------- | ------------- | ----------------------------------------------- | ------------- | ---------------------------------------------- |
| **Voice Notes (HTTP)**          | HTTP + NDJSON | Asynchronous push-to-talk, in-app voice replies | Low-to-medium | Upload a short clip; receive streamed TTS back |
| **Real-time Voice (WebSocket)** | WebSocket     | Natural, full-duplex conversations              | Very low      | Bidirectional audio with VAD and interruption  |

See real-time details in Real-time Voice (WebSocket) (conversations-realtime.md).

{% hint style="warning" %}
**Phone-based voice**: these endpoints are for text-channel voice (push-to-talk notes and WebSocket streaming). For enterprise phone-based inbound and outbound calls with emotion detection and EHR integration, see [Platform API: Voice Agent](https://docs.amigo.ai/developer-guide/platform-api/platform-api/voice-agent).
{% endhint %}

## Voice Mode Comparison

{% @mermaid/diagram content="%%{init: {"flowchart": {"useMaxWidth": true, "nodeSpacing": 30, "rankSpacing": 40}, "theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
Start{Choose Voice Mode}

```
Start -->|Asynchronous<br/>Push-to-talk| HTTP[Voice Notes - HTTP]
Start -->|Real-time<br/>Full-duplex| WS[Real-time Voice - WebSocket]

subgraph HTTP_Flow["HTTP Voice Notes Flow"]
    H1[Record Audio Clip] --> H2[Upload to API]
    H2 --> H3[Process & TTS]
    H3 --> H4[Stream Audio Back]
end

subgraph WS_Flow["WebSocket Real-time Flow"]
    W1[Open Connection] --> W2[Bidirectional Audio Stream]
    W2 --> W3[VAD & Interruption Support]
    W3 --> W2
end

HTTP --> HTTP_Flow
WS --> WS_Flow

style HTTP fill:#D4E2E7,stroke:#083241,color:#100F0F,stroke-width:2px
style WS fill:#F0DDD9,stroke:#AA412A,color:#100F0F,stroke-width:2px
style Start fill:#DDE3DB,stroke:#2c3827,color:#100F0F,stroke-width:2px" %}
```

## Voice Notes (HTTP)

Treat each `/interact` call as an asynchronous voice-note exchange, not a full-duplex call.

### Request Essentials

1. Encode microphone audio as `WAV` (PCM) or `FLAC`.
2. POST as `recorded_message` with `request_format=voice`.
3. Set `response_format=voice` and choose `Accept`:
   * `audio/mpeg` (MP3): efficient for mobile playback
   * `audio/wav` (PCM): simple decoding, good for short clips
4. Read the NDJSON stream. `new-message` events contain base64 audio chunks.

> TypeScript SDK note: voice over HTTP is not yet supported in the TS SDK. Use direct API calls.

### Sequence Diagram: Voice Note Exchange

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"actorBkg": "#083241", "actorTextColor": "#FFFFFF", "actorBorder": "#083241", "signalColor": "#575452", "signalTextColor": "#100F0F", "labelBoxBkgColor": "#F1EAE7", "labelBoxBorderColor": "#D7D2D0", "labelTextColor": "#100F0F", "loopTextColor": "#100F0F", "noteBkgColor": "#F1EAE7", "noteBorderColor": "#D7D2D0", "noteTextColor": "#100F0F", "activationBkgColor": "#E8E2EB", "activationBorderColor": "#083241", "altSectionBkgColor": "#F1EAE7", "altSectionColor": "#100F0F"}}}%%
sequenceDiagram
autonumber
participant C as Customer System
participant A as Amigo REST API

C->>A: POST /v1/{org}/conversation/<br/>{conversation\_id}/interact
Note over C,A: request\_format=voice<br/>response\_format=voice<br/>Accept: audio/wav | audio/mpeg<br/>Body: recorded\_message (audio clip)
A-->>C: 200 OK (NDJSON stream)
loop NDJSON events
A-->>C: new-message (base64 audio chunk)
end
A-->>C: interaction-complete { interaction\_id }" %}

### API Reference

{% openapi src="<https://api.amigo.ai/v1/openapi.json>" path="/v1/{organization}/conversation/{conversation\_id}/interact" method="post" %}
<https://api.amigo.ai/v1/openapi.json>
{% endopenapi %}

### Minimal Client Handling (browser-friendly)

```ts
if (evt.type === "new-message" && typeof evt.message === "string" && evt.message) {
  const bytes = Uint8Array.from(atob(evt.message), (c) => c.charCodeAt(0));
  playAudio(bytes.buffer); // your audio player implementation
}
```

### Tips

* Keep uploads short (a few seconds) for responsive turn-taking.
* Accumulate audio chunks from `new-message` into a single buffer for smooth playback.
* Use `interaction-complete` as the boundary between turns.

### Managing Perceived Latency

During voice interactions, the agent handles perceived latency automatically using **audio fillers** when operations take longer than expected.

#### How Audio Fillers Work

When an agent operation (decision-making, tool execution, or analysis) exceeds its configured timeout threshold, the system automatically:

1. Detects that the delay threshold has been exceeded (typically 2-10 seconds).
2. Selects a contextual audio filler phrase from the configured options.
3. Streams the pre-generated audio to maintain conversation flow.
4. Continues processing while the filler plays.

#### Example Flow

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"actorBkg": "#083241", "actorTextColor": "#FFFFFF", "actorBorder": "#083241", "signalColor": "#575452", "signalTextColor": "#100F0F", "labelBoxBkgColor": "#F1EAE7", "labelBoxBorderColor": "#D7D2D0", "labelTextColor": "#100F0F", "loopTextColor": "#100F0F", "noteBkgColor": "#F1EAE7", "noteBorderColor": "#D7D2D0", "noteTextColor": "#100F0F", "activationBkgColor": "#E8E2EB", "activationBorderColor": "#083241", "altSectionBkgColor": "#F1EAE7", "altSectionColor": "#100F0F"}}}%%
sequenceDiagram
autonumber
participant User
participant API
participant Agent

```
User->>API: Voice request (audio)
API->>Agent: Process request
Note over Agent: Tool execution begins
Note over Agent: 2 seconds pass...
Agent-->>API: ActionTooLongEvent
API-->>User: Audio filler: "Let me look that up..."
Note over Agent: Tool completes
Agent-->>API: Response ready
API-->>User: Agent response (audio)" %}
```

#### Common Filler Scenarios

| Scenario                         | Example Fillers                                                      |
| -------------------------------- | -------------------------------------------------------------------- |
| **Designated Tool (end-to-end)** | "I'm looking that up for you...", "Searching now\..."                |
| **Decision-Making**              | "Let me think about that...", "Just a moment..."                     |
| **Reflection**                   | "Let me consider this carefully...", "Analyzing that information..." |
| **Helper Tools**                 | "Checking that...", "One moment...", "Let me verify..."              |

#### Benefits

* **Reduces perceived wait time** by providing active feedback.
* **Keeps conversation natural** instead of awkward silence.
* **Improves user experience** with contextual acknowledgments.
* **Automatic and transparent**: no client-side changes needed.

#### Handling in Code

Audio fillers arrive as `current-agent-action` events with type `action-too-long`:

```typescript
if (evt.type === "current-agent-action" && evt.action.type === "action-too-long") {
  // Audio filler contains base64 PCM audio (or text fallback)
  const audioFiller = evt.action.filler;
  playAudio(base64ToBytes(audioFiller));
}
```

{% hint style="info" %}
**Configuration**\
Audio fillers are configured in your service's **Context Graph** (API field: `service_hierarchical_state_machine`). Each Context Graph state type and tool can have custom filler phrases and timeout thresholds. See [Conversations: Events](https://docs.amigo.ai/developer-guide/classic-api/core-api/conversations-events#managing-perceived-latency-with-audio-fillers) for detailed configuration options.
{% endhint %}

{% hint style="warning" %}
**Best Practice: Keep `audio_filler_triggered_after` Close to Zero**

Set the delay to a very small value like `0.0001` (0.1ms). Any delay adds directly to perceived latency. Since most operations complete quickly, delays hurt the majority of interactions. The schema requires `> 0`, but values below 1ms are instantaneous in practice.
{% endhint %}

## Real-time Voice (WebSocket)

For low-latency, natural conversation with VAD and barge-in, use Real-time Voice (WebSocket). It supports continuous upstream audio, interruption handling, and streaming TTS.
