Voice
Amigo supports two voice modes. Choose the one that matches your UX and latency needs:
Voice Notes (HTTP)
HTTP + NDJSON
Asynchronous push-to-talk, in-app voice replies
Low-to-medium
Upload a short clip; receive streamed TTS back
Real-time Voice (WebSocket)
WebSocket
Natural, full-duplex conversations
Very low
Bidirectional audio with VAD and interruption
See real-time details in Real-time Voice (WebSocket) (conversations-realtime.md).
Voice Mode Comparison
Voice Notes (HTTP)
Treat each /interact
call as an asynchronous voice-note exchange, not a full-duplex call.
Request Essentials
Encode microphone audio as
WAV
(PCM) orFLAC
.POST as
recorded_message
withrequest_format=voice
.Set
response_format=voice
and chooseAccept
:audio/mpeg
(MP3): efficient for mobile playbackaudio/wav
(PCM): simple decoding, good for short clips
Read the NDJSON stream;
new-message
events contain base64 audio chunks.
TypeScript SDK note: Voice via HTTP is not yet supported in the TS SDK; use direct API calls.
Sequence Diagram: Voice Note Exchange
API Reference
Send a new user message to the conversation. The endpoint will perform analysis and generate an agent message in response.
If request_format
is text
, the request body must follow multipart/form-data
with precise one form field called recorded_message
that corresponds to UTF-8 encoded
bytes of the text message. If request_format
is voice
, the entire request body must be the bytes of the voice recording in audio/wav
or audio/mpeg
(MP3) format. The body
can be sent as a stream, and the endpoint will start processing chunks as they're received, which will reduce latency.
A UserMessageAvailableEvent
will be the first event in the response, which includes the user message if it's sent as text, or the transcribed message if it's sent as voice.
A series of CurrentAgentActionEvent
s will follow, which indicates steps in the agent's thinking process. Then the agent message is generated sequentially in pieces, with each piece
being sent as a NewMessageEvent
in the response. After all the pieces are sent, an InteractionCompleteEvent
is sent. The response might end here, or, if the conversation automatically
ends (for instance, because the user message indicates the user wants to end the session), an EndSessionEvent
would be emitted, while the conversation is marked as finished and the post-conversation
analysis asynchronously initiated. The connection will then terminate.
Any further action on the conversation is only allowed after the connection is terminated.
A 200 status code doesn't indicate the successful completion of this endpoint, because the status code is transmitted before the stream starts. At any point during the stream,
an ErrorEvent
might be sent, which indicates that an error has occurred. The connection will be immediately closed after.
This endpoint can only be called on a conversation that has started but not finished.
Permissions
This endpoint requires the following permissions:
User:UpdateUserInfo
on the user who started the conversation.Conversation:InteractWithConversation
on the conversation.
This endpoint may be impacted by the following permissions:
CurrentAgentActionEvent
s are only emitted if the authenticated user has theConversation:GetInteractionInsights
permission.
The identifier of the conversation to send a message to.
^[a-f0-9]{24}$
The format in which the user message is delivered to the server.
The format of the response that will be sent to the user.
A regex for filtering the type of the current agent action to return. By default, all are returned. If you don't want to receive any events, set this to a regex that matches nothing, for instance ^$
.
^.*$
Configuration for the user message audio. This is only required if request_format
is set to voice
.
The format of the audio response, if response_format
is set to voice
.
The content type of the request body, which must be multipart/form-data
followed by a boundary.
^multipart\/form-data; boundary=.+$
The Mongo cluster name to perform this request in. This is usually not needed unless the organization does not exist yet in the Amigo organization infra config database.
[]
Succeeded. The response will be a stream of events in JSON format separated by newlines. The server will transmit an event as soon as one is available, so the client should respond to the events as soon as one arrives, and keep listening until the server closes the connection.
The user message is empty, or the preferred language does not support voice transcription or response, or the audio_format
field is not set when voice output is requested, or the timestamps for external event messages are not in the past, or the timestamps for external event messages are inconsistent with the conversation.
Invalid authorization credentials.
Missing required permissions.
Specified organization or conversation is not found.
The specified conversation is already finished, or a related operation is in process.
The format of the supplied audio file is not supported.
Invalid request path parameter failed validation.
The user has exceeded the rate limit of 15 requests per minute for this endpoint.
The service is going through temporary maintenance.
POST /v1/{organization}/conversation/{conversation_id}/interact?request_format=text&response_format=text HTTP/1.1
Host: api.amigo.ai
Authorization: Bearer YOUR_SECRET_TOKEN
X-ORG-ID: YOUR_API_KEY
content-type: text
Accept: */*
{
"type": "interaction-complete",
"message_id": "text",
"interaction_id": "text",
"full_message": "text",
"conversation_completed": true
}
Minimal Client Handling (browser-friendly)
if (evt.type === "new-message" && typeof evt.message === "string" && evt.message) {
const bytes = Uint8Array.from(atob(evt.message), (c) => c.charCodeAt(0));
playAudio(bytes.buffer); // your audio player implementation
}
Tips
Keep uploads short (a few seconds) for responsive turn-taking.
Accumulate audio chunks from
new-message
to a single buffer for smooth playback.Use
interaction-complete
as the boundary between turns.
Real-time Voice (WebSocket)
For low-latency, natural conversation with VAD and barge-in, use Real-time Voice (WebSocket). It supports continuous upstream audio, interruption handling, and streaming TTS.
Last updated
Was this helpful?