Text Sessions
Multi-turn text conversations over SMS, WhatsApp, and WebSocket - same reasoning engine, context graphs, and safety boundaries as voice calls.
Text sessions support two connection modes: a request-response HTTP mode for simple integrations, and a persistent WebSocket mode for low-latency conversational experiences.
HTTP Mode
The HTTP mode sends one message per request and receives the agent's response in the reply. This is the simplest integration path and works well for asynchronous workflows or server-to-server integrations where latency is less critical.
WebSocket Mode
The WebSocket mode opens a single persistent connection for the entire conversation. Instead of a separate HTTP round-trip for each turn, the client sends messages and receives streamed agent events over one connection. This reduces per-turn latency significantly compared to the HTTP mode.
WebSocket sessions are authenticated using a subprotocol header that carries the API key or access token. The connection URL accepts the service and entity as query parameters. Once connected, the server sends a session metadata event with the resolved conversation ID, followed by the agent's auto-greeting.
The client sends user messages as JSON frames. The server responds with a stream of turn events for each message - the same event types used in the streaming HTTP path. If the agent invokes tools during a turn, tool execution events are included in the stream when tool events are enabled.
The connection enforces an idle timeout. If no messages are sent within the timeout window, the server closes the connection with a notification. Clients can reconnect and resume the conversation using the same entity and service identifiers - the platform resolves the existing conversation thread automatically.
Error conditions during a turn (upstream failures, timeouts, malformed input) are reported as error events on the WebSocket rather than closing the connection, so the session remains usable after transient failures.
The text-stream WebSocket delivers typed frames for session lifecycle, agent messages, and tool call activity. Each frame carries a type field that identifies the event, so consumers can switch on frame type without maintaining local schema definitions. See the Developer Guide for the full frame reference.
WhatsApp Text Conversations
The platform supports text-based WhatsApp conversations where the agent receives and responds with text messages. Each WhatsApp number can be configured with a response modality - either voice note or text - controlling how the agent replies on that number.
When a number is configured for text responses:
Inbound text messages are processed through the text turn pipeline, and the agent replies with text.
Inbound voice notes on the same number receive a brief reply directing the user to send text instead.
When a number is configured for voice note responses (the existing behavior):
Inbound voice notes are processed through the voice turn pipeline, and the agent replies with audio.
Inbound text messages on the same number receive a brief reply directing the user to send a voice note instead.
Text turns use the same session continuity and agent pipeline as voice turns. Conversation state persists across turns for the same speaker, so multi-turn text conversations maintain full context. Concurrent messages from the same speaker are serialized - if a second message arrives while the first is still being processed, it is rejected to prevent race conditions in the conversation state.
Playground Testing
The Developer Console provides a unified playground for testing agents across multiple interaction modes from a single interface. Available modes depend on the service's channel type:
Text Chat - Turn-by-turn simulation that auto-starts when the service is ready. Useful for iterating on agent logic without external dependencies.
REST API - Sends messages through the production REST conversations endpoint, creating a durable conversation and submitting each message as a turn. An inspector panel shows the full request and response for every API exchange, including timing, so you can verify exactly what your integrations will see.
WebSocket - Real-time text streaming over a persistent WebSocket connection. Messages and agent responses appear as they are generated. The connection automatically reconnects (up to five attempts with exponential backoff) if interrupted.
Voice Call - Browser-based test call via WebSocket audio (available for voice-enabled services).
All modes share the Context Graph visualization and event timeline panels, and each mode supports an optional identifier (Caller ID, Entity ID, or test profile phone number) that is persisted across sessions.
Text conversations and voice calls are now accessible through a single unified conversations resource in the API. See the Developer Guide for endpoint details.
Text sessions let patients interact with the agent over SMS, WhatsApp, or WebSocket. The agent runs the same context graphs, calls the same tools, and follows the same safety rules regardless of channel. What changes is how messages get in and out - not how the agent reasons.
Everything you configure for voice - persona, context graphs, dynamic behaviors, memory, clinical tools - works identically over text. No separate bot, no reduced feature set.
Conversation Persistence
Every text conversation is stored as a single value with three parts:
Plan - A natural-language summary of where the conversation stands, written by the agent during compression. Plans are plain language, not structured data, so they stay readable across platform versions without migration.
Turns - Up to 200 verbatim messages between patient and agent, each carrying a role, text, and timestamp.
Cursor - A channel-specific position marker for resuming exactly where the conversation left off.
A conversation can be active, frozen, or closed. These are not separate operations - they are states derived from the conversation value itself.
Freezing and Resumption
When a conversation goes quiet, the platform freezes it. An LLM compresses the turn history into a plan - a paragraph that captures the participants, what has been promised, where things stand, what should happen next, and what questions are still open. The plan and recent verbatim turns are saved together.
When a new message arrives on a frozen conversation, the platform thaws it: loads the plan and last five turns, feeds them into the reasoning engine, and picks up where things left off. The agent honors every prior commitment. It does not greet the patient again - it simply responds to whatever they said. From the patient's side, the conversation never stopped.
Channels
SMS
Patients text the agent's phone number. The platform creates a session automatically, loads the patient's context from the world model, and the agent responds. Outbound SMS checks consent before sending and respects TCPA-compliant quiet hours.
WhatsApp
Patients message the agent's WhatsApp number. The conversation lives in the same WhatsApp thread, and session continuity is keyed on the patient's phone number with E.164 normalization. WhatsApp also supports voice notes - see Audio Input below.
WebSocket
A persistent bidirectional connection for web and mobile apps. A client opens a WebSocket, sends JSON messages, and receives agent responses with typing indicators while the agent composes. When the WebSocket disconnects, the conversation freezes. When the client reconnects with the same conversation ID, it thaws.
The WebSocket runs the same actor as SMS and WhatsApp. Client messages push signals to the actor's queue, the actor runs them through the reasoning engine, and responses stream back as JSON frames. The agent does not know which transport delivered the message.
The public WebSocket endpoint is workspace-scoped and authenticates via a Sec-WebSocket-Protocol subprotocol header so credentials never appear in URLs, browser history, or proxy logs. Resuming a session from a different device or transport is a matter of reconnecting with the same conversation ID.
REST
A synchronous REST endpoint runs the same reasoning pipeline as the WebSocket and returns the agent's response in one HTTP call. Use REST when the calling system prefers a standard request-response model, when the network path between client and platform does not allow long-lived connections, or when the integration is server-to-server.
The same endpoint also supports token-by-token streaming via Server-Sent Events: pass Accept: text/event-stream and the server emits a typed event sequence (token, tool_call_started, tool_call_completed, thinking, message, done) as the turn unfolds. SSE streaming gives consumer-facing UIs the same incremental feel as the WebSocket without a persistent connection, while keeping the simpler request-scoped lifecycle and authentication of REST. Concurrency, lock semantics, and persistence are identical to the synchronous JSON path - if the client disconnects mid-stream, the partial response is still saved and the next turn is unblocked.
See the Conversations API reference for the full WebSocket wire protocol, REST endpoint reference, the SSE event schema, connection parameters, error codes, and code examples.
Channel Timing
SMS
Phone number pair
30 minutes
2 hours
Phone number (E.164)
1 hour
24 hours
WebSocket
API-authenticated
5 minutes
1 hour
SMS and WhatsApp use phone-number-keyed sessions with residency windows tuned for asynchronous messaging. WebSocket uses API-authenticated sessions with shorter timeouts because web chat is interactive - if the patient walks away, the session should freeze quickly rather than holding resources.
How Sessions Start
Inbound SMS. A patient texts the agent's number. A session is created automatically with full patient context from the world model.
Inbound WhatsApp. A patient messages the agent's WhatsApp number. The platform validates the message, resolves the phone number to a workspace and service, and creates or resumes a session.
WebSocket connection. A client opens a WebSocket with workspace credentials, a service ID, and optionally a conversation ID to resume a prior session.
Gap scanner. When the platform detects missing patient data - an incomplete intake form, a lapsed screening - it can start a text conversation to collect it. If the data fits a structured format, the agent delivers a surface (web form) inline and the patient fills it out without leaving the conversation.
Outbound API. Your systems trigger text sessions directly - appointment reminders, care follow-ups, post-visit check-ins. See Outbound Patterns.
How It Works
The agent greets the patient, then runs a multi-turn conversation. Under the hood, the reasoning engine navigates context graph states and transitions exactly as it would on a phone call. Tools execute identically. Data writes to the world model at conversation confidence. The post-session review pipeline runs the same way.
Every external event - an inbound message, a delivery status update, a surface submission, a timeout - becomes a signal pushed to the session actor's queue. The actor processes signals one at a time through three steps:
Cut - Is this signal a turn boundary? Inbound messages and surface submissions are. Delivery status updates are not.
Navigate - The reasoning engine determines what to do: respond, call a tool, escalate, or complete.
Engage - The chosen action executes: send a text reply, wait for input, or end the session.
Processing signals sequentially eliminates race conditions. When a patient sends two messages in quick succession or a surface submission arrives while the agent is mid-response, everything is serialized and handled in order.
Message Coalescing
On SMS and WhatsApp, patients often send multiple messages in quick succession - splitting a thought across two texts, or adding a correction right after hitting send. Without coalescing, each message would trigger a separate agent response, creating a stuttering back-and-forth.
The platform handles this by draining accumulated messages after each turn. When the actor finishes responding to one message, it checks whether additional messages arrived during processing. If they did, it joins them into a single turn before running the next reasoning cycle. A patient who sends "I need an appointment" and then "for next Tuesday" a few seconds later gets one coherent response instead of two. Up to 10 messages can be coalesced into a single turn.
Coalescing is enabled for SMS and WhatsApp, where input cannot be gated. WebSocket sessions do not coalesce because the client controls send timing. Non-message signals (surface submissions, delivery status updates) are never coalesced - they are processed individually in their original order.
WebSocket. Each message is processed as a separate turn. The agent sees prior turns as conversation context, so responses remain coherent even without coalescing.
REST API. Only one turn can be in flight per conversation at a time. If a second turn request arrives while the first is still processing, it returns 409 Conflict. Wait for the current turn to complete before sending the next one. If a client timeout fires before the server finishes, the turn still runs to completion - the state is saved and the conversation freezes normally. The next GET shows the completed turn.
Cross-transport exclusion. A conversation can only have one active session at a time, regardless of transport. If a WebSocket session is active for a conversation, a REST turn request for the same conversation returns 409. Read operations (listing, detail) are never blocked - only message sends contend. Different conversations can process turns simultaneously.
Audio Input
Text sessions can accept audio recordings instead of typed text. Audio is transcribed server-side and fed through the same reasoning pipeline.
WhatsApp Voice Notes
When a patient sends a voice note on WhatsApp, the platform transcribes it, runs the full reasoning engine pipeline, synthesizes the agent's reply as spoken audio, and returns it as a voice note in the same thread. Patients hear a natural spoken response without making a phone call.
Switching between text and voice notes in the same thread preserves full conversation context. Concurrent voice notes from the same patient are serialized to prevent race conditions. Phone numbers are E.164-normalized so international numbers work correctly regardless of how the messaging provider formats them.
Web Audio
Web chat clients can also send audio recordings. Unlike WhatsApp voice notes, the agent responds with text rather than audio - audio input is a convenience for web chat, not a full voice channel.
Both audio modes are distinct from phone calls, which handle real-time bidirectional audio with barge-in, emotion detection, and filler speech.
What Is Different from Voice
Text sessions share the full reasoning stack with phone. A few things work differently.
No audio pipeline. No real-time speech-to-text, text-to-speech, filler speech, or barge-in. Messages arrive as text (or transcribed audio) and leave as text. WhatsApp voice notes are the exception - they go through transcription and synthesis but still operate turn-by-turn.
Text-only emotion detection. Voice calls analyze both language and vocal tone. Text sessions use language analysis only. The agent still adapts its tone based on detected sentiment, but has fewer signals to work with.
No operator join. Voice calls support conference-style operator takeover. Text sessions do not - operators can monitor in real time but do not join as participants. Escalation routes to a separate channel.
Asynchronous pacing. Patients respond when they can. Timeout windows range from minutes to days depending on the use case. An appointment confirmation might expire after two hours. A care gap outreach might stay open for a week.
Longer responses. Reading is faster than listening. The agent can include more detail per message - medication instructions, prep steps, follow-up lists - without the awkwardness of reading a paragraph aloud.
Session compression. Long conversations automatically compress older history to stay within model context limits. Clinical facts and conversation state are preserved while token count shrinks.
SMS Consent
All outbound SMS checks consent before sending. If a patient has opted out (by texting STOP or through another opt-out mechanism), outbound messages to that number are blocked. Consent is tracked per patient per workspace. Inbound messages from opted-out patients still create sessions, since the patient initiated contact.
Quiet Hours
Outbound text sessions respect TCPA-compliant quiet hours. No messages go out during restricted windows. Quiet hours are configurable per workspace to match local regulations and patient expectations.
Surface Delivery
When the gap scanner starts a text session to collect missing data, it can deliver a surface inline. Instead of dropping a bare link, the agent explains what data is needed and why, answers questions, and then shares the form link within the conversation thread.
When to Use Which Channel
Appointment reminders and confirmations
SMS or WhatsApp
Data collection (insurance, intake forms)
Text with surface
Complex scheduling with multiple options
Voice
Sensitive clinical conversations
Voice
Post-visit follow-up and care instructions
SMS or WhatsApp
Urgent outreach requiring immediate response
Voice
Asynchronous communication
SMS or WhatsApp
Speaking without a live call
WhatsApp voice notes
Regions where WhatsApp is the dominant messenger
Patient portal or mobile app chat
WebSocket
Internal tools with agent interaction
WebSocket
Backend integrations
REST or WebSocket
Most deployments use multiple channels. The choice comes down to urgency, complexity, patient preference, and regional messaging norms.
Last updated
Was this helpful?

