Voice

Voice note exchange over HTTP and real-time WebSocket voice with audio filler management.

Amigo supports two voice modes. Choose the one that matches your UX and latency needs:

Mode

Transport

Best for

Latency

Notes

Voice Notes (HTTP)

HTTP + NDJSON

Asynchronous push-to-talk, in-app voice replies

Low-to-medium

Upload a short clip; receive streamed TTS back

Real-time Voice (WebSocket)

WebSocket

Natural, full-duplex conversations

Very low

Bidirectional audio with VAD and interruption

See real-time details in Real-time Voice (WebSocket) (conversations-realtime.md).

Phone-based voice - These endpoints are for text-channel voice (push-to-talk notes and WebSocket streaming). For enterprise phone-based inbound/outbound calls with emotion detection and EHR integration, see Platform API: Voice Agent.

Voice Mode Comparison

Voice Notes (HTTP)

Treat each /interact call as an asynchronous voice-note exchange, not a full-duplex call.

Request Essentials

Encode microphone audio as WAV (PCM) or FLAC.
POST as recorded_message with request_format=voice.
Set response_format=voice and choose Accept:
- audio/mpeg (MP3): efficient for mobile playback
- audio/wav (PCM): simple decoding, good for short clips
Read the NDJSON stream; new-message events contain base64 audio chunks.

TypeScript SDK note: Voice via HTTP is not yet supported in the TS SDK; use direct API calls.

Sequence Diagram: Voice Note Exchange

API Reference

Interact with a conversation

post

/v1/{organization}/conversation/{conversation_id}/interact

Send a new user message to the conversation. The endpoint will perform analysis and generate an agent message in response.

A UserMessageAvailableEvent will be the first event in the response, which includes the user message if it's sent as text, or the transcribed message if it's sent as voice. A series of CurrentAgentActionEvents will follow, which indicates steps in the agent's thinking process. Then the agent message is generated sequentially in pieces, with each piece being sent as a NewMessageEvent in the response. After all the pieces are sent, an InteractionCompleteEvent is sent. Depending on the conversation_completed property in this event, the conversation will be awaiting a new message from the user, or it might automatically end (for instance, because the user message indicates the user wants to end the session), while the conversation is marked as finished and the post-conversation analysis asynchronously initiated. The connection will then terminate.

Any further action on the conversation is only allowed after the connection is terminated.

A 200 status code doesn't indicate the successful completion of this endpoint, because the status code is transmitted before the stream starts. At any point during the stream, an ErrorEvent might be sent, which indicates that an error has occurred. The connection will be immediately closed after.

This endpoint can only be called on a conversation that has started but not finished.

Permissions

This endpoint requires the following permissions:

User:UpdateUserInfo on the user who started the conversation.
Conversation:InteractWithConversation on the conversation.

This endpoint may be impacted by the following permissions:

CurrentAgentActionEvents are only emitted if the authenticated user has the Conversation:GetInteractionInsights permission.

Authorizations

AuthorizationstringRequired

The username should be set to {org_id}_{user_id}, and the password should be the Amigo issued JWT token that identifies the user.

AuthorizationstringRequired

Amigo issued JWT token that identifies an user. It's issued either after logging in through the frontend, or manually through the SignInWithAPIKey endpoint.

X-ORG-IDstringRequired

An optional organization identifier that indicates from which organization the token is issued. This is used in rare cases where the user to authenticate is making a request for resources in another organization.

Path parameters

conversation_idstringRequired

The identifier of the conversation to send a message to.

Pattern: ^[a-f0-9]{24}$

organizationstringRequired

Query parameters

request_formatstring · enumRequired

The format in which the user message is delivered to the server.

Possible values:

response_formatstring · enumRequired

The format of the response that will be sent to the user.

Possible values:

current_agent_action_typestringOptional

A regex for filtering the type of the current agent action to return. By default, all are returned. If you don't want to receive any events, set this to a regex that matches nothing, for instance ^$.

Default: ^.*$

request_audio_configany ofOptional

Configuration for the user message audio. This is only required if request_format is set to voice.

anyOptional

nullOptional

Header parameters

content-typestringRequired

The content type of the request body, which must be multipart/form-data followed by a boundary.

Pattern: ^multipart\/form-data; boundary=.+$

x-mongo-cluster-nameany ofOptional

The Mongo cluster name to perform this request in. This is usually not needed unless the organization does not exist yet in the Amigo organization infra config database.

stringOptional

nullOptional

Sec-WebSocket-Protocolstring[]OptionalDefault: []

Body

Responses

200

Succeeded. The response will be a stream of events in JSON format separated by newlines. The server will transmit an event as soon as one is available, so the client should respond to the events as soon as one arrives, and keep listening until the server closes the connection.

application/x-ndjson

400

This may occur for the following reasons:

The user message is empty.
The preferred language does not support voice transcription or response.
The response_audio_format field is not set when voice output is requested.
The timestamps for external event messages are not in the past.
The timestamps for external event messages are inconsistent with the conversation.
The agent does not have voice config specified.

401

Invalid authorization credentials.

403

Missing required permissions.

404

Specified organization or conversation is not found.

408

The request body stream timed out.

409

The specified conversation is already finished, or a related operation is in process.

415

The format of the supplied audio file is not supported.

422

Invalid request path parameter or request body failed validation.

429

The user has exceeded the rate limit of 15 requests per minute for this endpoint.

503

The service is going through temporary maintenance.

post

/v1/{organization}/conversation/{conversation_id}/interact

POST /v1/{organization}/conversation/{conversation_id}/interact?request_format=text&response_format=text HTTP/1.1
Host: api.amigo.ai
Authorization: Bearer YOUR_SECRET_TOKEN
X-ORG-ID: YOUR_API_KEY
Content-Type: multipart/form-data
Accept: */*
Content-Length: 163

{
  "initial_message_type": "text",
  "recorded_message": "text",
  "external_event_message_content": [
    "text"
  ],
  "external_event_message_timestamp": [
    "2026-03-23T16:51:41.055Z"
  ]
}

{
  "type": "interaction-complete",
  "message_id": "text",
  "interaction_id": "text",
  "full_message": "text",
  "conversation_completed": true
}

Minimal Client Handling (browser-friendly)

if (evt.type === "new-message" && typeof evt.message === "string" && evt.message) {
  const bytes = Uint8Array.from(atob(evt.message), (c) => c.charCodeAt(0));
  playAudio(bytes.buffer); // your audio player implementation
}

Tips

Keep uploads short (a few seconds) for responsive turn-taking.
Accumulate audio chunks from new-message to a single buffer for smooth playback.
Use interaction-complete as the boundary between turns.

Managing Perceived Latency

During voice interactions, the agent automatically manages perceived latency using audio fillers when operations take longer than expected.

How Audio Fillers Work

When an agent operation (decision-making, tool execution, or analysis) exceeds its configured timeout threshold, the system automatically:

Detects the delay threshold has been exceeded (typically 2-10 seconds)
Selects a contextual audio filler phrase from the configured options
Streams the pre-generated audio to maintain conversation flow
Continues processing while the filler plays

Example Flow

Common Filler Scenarios

Scenario

Example Fillers

Designated Tool (end-to-end)

"I'm looking that up for you...", "Searching now..."

Decision-Making

"Let me think about that...", "Just a moment..."

Reflection

"Let me consider this carefully...", "Analyzing that information..."

Helper Tools

"Checking that...", "One moment...", "Let me verify..."

Benefits

Reduces perceived wait time by providing active feedback
Maintains conversation naturalness instead of awkward silence
Improves user experience with contextual acknowledgments
Automatic and transparent - no client-side changes needed

Handling in Code

Audio fillers arrive as current-agent-action events with type action-too-long:

if (evt.type === "current-agent-action" && evt.action.type === "action-too-long") {
  // Audio filler contains base64 PCM audio (or text fallback)
  const audioFiller = evt.action.filler;
  playAudio(base64ToBytes(audioFiller));
}

Configuration Audio fillers are configured in your service's Context Graph (API field: service_hierarchical_state_machine). Each Context Graph state type and tool can have custom filler phrases and timeout thresholds. See Conversations: Events for detailed configuration options.

Best Practice: Keep audio_filler_triggered_after Close to Zero

Set the delay to a very small value like 0.0001 (0.1ms). Any delay adds directly to perceived latency - since most operations complete quickly, delays hurt the majority of interactions. The schema requires > 0, but values below 1ms are effectively instantaneous.

Real-time Voice (WebSocket)

For low-latency, natural conversation with VAD and barge-in, use Real-time Voice (WebSocket). It supports continuous upstream audio, interruption handling, and streaming TTS.

PreviousStarters NextReal-time Voice (WebSocket)

Last updated 22 hours ago

Was this helpful?

Good afternoon

hashtagVoice Mode Comparison

hashtagVoice Notes (HTTP)

hashtagRequest Essentials

hashtagSequence Diagram: Voice Note Exchange

hashtagAPI Reference

hashtagInteract with a conversation

Permissions

hashtagMinimal Client Handling (browser-friendly)

hashtagTips

hashtagManaging Perceived Latency

hashtagHow Audio Fillers Work

hashtagExample Flow

hashtagCommon Filler Scenarios

hashtagBenefits

hashtagHandling in Code

hashtagReal-time Voice (WebSocket)

Voice Mode Comparison

Voice Notes (HTTP)

Request Essentials

Sequence Diagram: Voice Note Exchange

API Reference

Interact with a conversation

Minimal Client Handling (browser-friendly)

Tips

Managing Perceived Latency

How Audio Fillers Work

Example Flow

Common Filler Scenarios

Benefits

Handling in Code

Real-time Voice (WebSocket)