Voice

Amigo supports two voice modes. Choose the one that matches your UX and latency needs:

Mode
Transport
Best for
Latency
Notes

Voice Notes (HTTP)

HTTP + NDJSON

Asynchronous push-to-talk, in-app voice replies

Low-to-medium

Upload a short clip; receive streamed TTS back

Real-time Voice (WebSocket)

WebSocket

Natural, full-duplex conversations

Very low

Bidirectional audio with VAD and interruption

See real-time details in Real-time Voice (WebSocket) (conversations-realtime.md).

Voice Mode Comparison

Voice Notes (HTTP)

Treat each /interact call as an asynchronous voice-note exchange, not a full-duplex call.

Request Essentials

  1. Encode microphone audio as WAV (PCM) or FLAC.

  2. POST as recorded_message with request_format=voice.

  3. Set response_format=voice and choose Accept:

    • audio/mpeg (MP3): efficient for mobile playback

    • audio/wav (PCM): simple decoding, good for short clips

  4. Read the NDJSON stream; new-message events contain base64 audio chunks.

TypeScript SDK note: Voice via HTTP is not yet supported in the TS SDK; use direct API calls.

Sequence Diagram: Voice Note Exchange

API Reference

Interact with a conversation

post

Send a new user message to the conversation. The endpoint will perform analysis and generate an agent message in response.

If request_format is text, the request body must follow multipart/form-data with precise one form field called recorded_message that corresponds to UTF-8 encoded bytes of the text message. If request_format is voice, the entire request body must be the bytes of the voice recording in audio/wav or audio/mpeg (MP3) format. The body can be sent as a stream, and the endpoint will start processing chunks as they're received, which will reduce latency.

A UserMessageAvailableEvent will be the first event in the response, which includes the user message if it's sent as text, or the transcribed message if it's sent as voice. A series of CurrentAgentActionEvents will follow, which indicates steps in the agent's thinking process. Then the agent message is generated sequentially in pieces, with each piece being sent as a NewMessageEvent in the response. After all the pieces are sent, an InteractionCompleteEvent is sent. The response might end here, or, if the conversation automatically ends (for instance, because the user message indicates the user wants to end the session), an EndSessionEvent would be emitted, while the conversation is marked as finished and the post-conversation analysis asynchronously initiated. The connection will then terminate.

Any further action on the conversation is only allowed after the connection is terminated.

A 200 status code doesn't indicate the successful completion of this endpoint, because the status code is transmitted before the stream starts. At any point during the stream, an ErrorEvent might be sent, which indicates that an error has occurred. The connection will be immediately closed after.

This endpoint can only be called on a conversation that has started but not finished.

Permissions

This endpoint requires the following permissions:

  • User:UpdateUserInfo on the user who started the conversation.
  • Conversation:InteractWithConversation on the conversation.

This endpoint may be impacted by the following permissions:

  • CurrentAgentActionEvents are only emitted if the authenticated user has the Conversation:GetInteractionInsights permission.
Authorizations
Path parameters
conversation_idstringRequired

The identifier of the conversation to send a message to.

Pattern: ^[a-f0-9]{24}$
organizationstringRequired
Query parameters
request_formatstring · enumRequired

The format in which the user message is delivered to the server.

Possible values:
response_formatstring · enumRequired

The format of the response that will be sent to the user.

Possible values:
current_agent_action_typestringOptional

A regex for filtering the type of the current agent action to return. By default, all are returned. If you don't want to receive any events, set this to a regex that matches nothing, for instance ^$.

Default: ^.*$
request_audio_configany ofOptional

Configuration for the user message audio. This is only required if request_format is set to voice.

one ofOptional
or
or
nullOptional
audio_formatany ofOptional

The format of the audio response, if response_format is set to voice.

string · enumOptionalPossible values:
or
nullOptional
Header parameters
content-typestringRequired

The content type of the request body, which must be multipart/form-data followed by a boundary.

Pattern: ^multipart\/form-data; boundary=.+$
x-mongo-cluster-nameany ofOptional

The Mongo cluster name to perform this request in. This is usually not needed unless the organization does not exist yet in the Amigo organization infra config database.

stringOptional
or
nullOptional
Sec-WebSocket-Protocolstring[]OptionalDefault: []
Responses
200

Succeeded. The response will be a stream of events in JSON format separated by newlines. The server will transmit an event as soon as one is available, so the client should respond to the events as soon as one arrives, and keep listening until the server closes the connection.

application/x-ndjson
Responseany of
or
or
or
or
or
post
POST /v1/{organization}/conversation/{conversation_id}/interact?request_format=text&response_format=text HTTP/1.1
Host: api.amigo.ai
Authorization: Bearer YOUR_SECRET_TOKEN
X-ORG-ID: YOUR_API_KEY
content-type: text
Accept: */*
{
  "type": "interaction-complete",
  "message_id": "text",
  "interaction_id": "text",
  "full_message": "text",
  "conversation_completed": true
}

Minimal Client Handling (browser-friendly)

if (evt.type === "new-message" && typeof evt.message === "string" && evt.message) {
  const bytes = Uint8Array.from(atob(evt.message), (c) => c.charCodeAt(0));
  playAudio(bytes.buffer); // your audio player implementation
}

Tips

  • Keep uploads short (a few seconds) for responsive turn-taking.

  • Accumulate audio chunks from new-message to a single buffer for smooth playback.

  • Use interaction-complete as the boundary between turns.

Real-time Voice (WebSocket)

For low-latency, natural conversation with VAD and barge-in, use Real-time Voice (WebSocket). It supports continuous upstream audio, interruption handling, and streaming TTS.

Last updated

Was this helpful?