# Data & World Model

The Platform API provides access to a **unified, event-sourced world model** that aggregates entities, events, and relationships from all connected systems - EHRs, voice conversations, manual imports, and AI enrichment. Every piece of data flows through a single event store, and entity state is always a computed projection of events.

{% hint style="info" %}
**Different from Classic API data access** - The Classic API provides [SQL access](https://docs.amigo.ai/developer-guide/classic-api/data-access) to organization-scoped relational tables. The Platform API uses an event-sourced world model that unifies data from multiple sources (EHR, voice, imports) with confidence scoring.
{% endhint %}

## Design Thesis

{% hint style="info" %}
Agents read from and write to the world model directly during conversations. Data arrives from any source, agents structure and extend it, and new entity types require no schema migration. The world model evolves with what agents discover.
{% endhint %}

The world model is designed around a clear separation of **foundational invariants** (constraints that hold regardless of how capable AI models become) and **implementation** (strategies that work given current model constraints but will evolve as capabilities grow):

| Category           | Examples                                                                                             | Durable as Models Improve?                                                                                      |
| ------------------ | ---------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
| **Foundational**   | Event sourcing, confidence-based truth, projection as pure function, open schema                     | Yes - temporal reasoning, audit trails, and conflict resolution remain necessary regardless of model capability |
| **Implementation** | Ambient/queried/extracted channels, session-scoped caches, tool interfaces, pre-computed projections | Evolves - as context windows and reasoning improve, more data becomes ambient and fewer tools are needed        |

**The design test for every architectural decision**: *"Would this still make sense with dramatically more capable models?"* If yes, it's foundational. If no, it's implementation - necessary now, but designed to evolve gracefully.

## Foundational Invariants

These are constraints that hold regardless of agent capability:

### 1. Events Are the Only Source of Truth

Entity state is ALWAYS a computed projection from events. There is no direct mutation path. The only way to change entity state is: `insert event → recompute projection`. This eliminates:

* **Concurrent mutation data loss** - two agents updating the same record, last write wins, first write silently lost
* **Two-source-of-truth confusion** - truth is always in the events, never in the projection

### 2. Events Are Append-Only and Immutable

Once written, an event is never modified. New information creates a new event that `supersedes` the old one. The old event remains for:

* **Audit** - "why did we believe X?"
* **Temporal debugging** - "what did we know at 2:03pm?"
* **Counterfactual reasoning** - "what if this hadn't happened?"
* **Undo** - "roll back this decision"

### 3. Entity State Is a Pure Function of Events

`state = f(events)`. Same events → same state. The projection function is deterministic. Multiple agents can trigger recomputation concurrently - the result is always consistent because it reads all current events and writes the result atomically.

### 4. Confidence Resolves Conflicts, Not Timestamps

When two sources disagree (e.g., a voice transcription vs. an EHR API), the projection takes the highest-confidence data. Confidence is a **source-class ranking**, not a subjective score. Within the same confidence class, most-recent wins.

### 5. Open Schema

`entity_type` and `event_type` are free-form text, not enums. An agent discovering a new kind of entity or observation doesn't need a migration. **The schema adapts to what intelligence discovers, not the other way around.**

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
subgraph Sources\["Data Sources (Any Format, Any Domain)"]
S1\["EHR Systems\n(FHIR, proprietary APIs)"]
S2\["Voice Agent\n(call events, clinical writes)"]
S3\["FHIR Import\n(bundles, bulk data)"]
S4\["AI Enrichment\n(LLM analysis, embeddings)"]
S5\["Browser Agents\n(portal scraping, UI automation)"]
S6\["Managed Connectors\n(CDC, webhooks, REST)"]
S7\["Agent-Produced\n(derived insights,\nrelationship discovery)"]
end

```
subgraph EventStore["Universal Event Store (append-only, immutable)"]
    E["Events\n• confidence-scored\n• source-attributed\n• temporally ordered\n• supersedes chain"]
end

subgraph Projections["Computed Projections (deterministic)"]
    P1["Patient\nState"]
    P2["Appointment\nState"]
    P3["Practitioner\nState"]
    P4["Location\nState"]
    P5["Call\nState"]
    P6["Operator\nState"]
    P7["Outbound Task\nState"]
    P8["... any type"]
end

subgraph Entities["Entity Registry"]
    EN["Entities\n(projected state +\nvector embeddings)"]
    EG["Entity Graph\n(directed relationships)"]
end

subgraph Consumers["Consumers"]
    VA["Voice Agent\n(real-time reads)"]
    CR["Connector Runner\n(outbound sync)"]
    AN["Analytics\n(cross-workspace)"]
    AG["Autonomous Agents\n(reasoning + action)"]
end

Sources --> E
E -->|"projection\n(confidence DESC,\nrecency tiebreak)"| Projections
Projections --> EN
EN --- EG
EN --> Consumers" %}
```

## Event Sourcing Model

Every fact in the system - a patient's name, an appointment booking, an insurance verification, an agent-produced insight - is stored as an immutable **event** with a confidence score:

| Field           | Description                                                                                     |
| --------------- | ----------------------------------------------------------------------------------------------- |
| `entity_type`   | What kind of entity - free-form text (patient, appointment, practitioner, location, call, etc.) |
| `event_type`    | What happened - free-form text (patient.created, appointment.booked, coverage.verified, etc.)   |
| `data`          | Event payload (structured data, FHIR resources, agent observations, etc.)                       |
| `confidence`    | Reliability score (0.0-1.0) - determines projection priority and sync eligibility               |
| `source`        | Origin system (voice\_agent, connector\_runner, ehr\_sync, transcript\_extraction, etc.)        |
| `is_current`    | Whether this is the latest version (managed by storage layer on supersede)                      |
| `effective_at`  | When the fact was true in the real world                                                        |
| `review_status` | Pipeline state: pending, auto\_approved, llm\_approved, human\_approved, rejected, corrected    |
| `embedding`     | Vector embedding for semantic search (auto-generated in background)                             |

### Confidence as Trust Architecture

Every event carries a confidence score that determines its reliability, downstream behavior, and sync eligibility. This is the core trust architecture for a system where autonomous agents act on noisy real-world data:

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
subgraph Sources\["Data Sources"]
A\["Authoritative API\n(EHR system)"]
B\["Human correction"]
C\["Human approval"]
D\["AI verification\n(LLM judge)"]
E\["Voice agent\n(noisy audio)"]
F\["Agent inference"]
G\["Rejected/junk"]
end

```
subgraph Confidence["Confidence Scale"]
    C1["1.0 - Authoritative"]
    C2["0.9 - Human-approved"]
    C3["0.7 - AI-verified"]
    C4["0.5 - Uncertain"]
    C5["0.3 - Pending review"]
    C6["0.0 - Rejected"]
end

subgraph Gate["Sync Eligibility"]
    SY["≥ Verified → Syncs to EHR"]
    NO["< Verified → Held"]
end

A --> C1
B --> C1
C --> C2
D --> C3
E --> C5
F --> C4
G --> C6

C1 & C2 & C3 --> SY
C4 & C5 & C6 --> NO" %}
```

| Level              | Confidence | Meaning                                              | Syncs to EHR? | Source Examples                |
| ------------------ | ---------- | ---------------------------------------------------- | ------------- | ------------------------------ |
| **Authoritative**  | 1.0        | From source system or human-corrected                | Yes (auto)    | EHR API, human correction      |
| **Human-approved** | 0.9        | Confirmed by a human operator or reviewer            | Yes (auto)    | Review queue approval          |
| **Verified**       | 0.7        | AI-verified via cross-reference and coherence checks | Yes (auto)    | LLM judge pass                 |
| **Uncertain**      | 0.5        | Partially corroborated                               | No            | Agent inference, partial match |
| **Pending**        | 0.3        | Awaiting automated review                            | No            | Raw voice agent write          |
| **Rejected**       | 0.0        | Discarded or explicitly rejected                     | Never         | Junk call, hallucination       |

Confidence scores **resolve conflicts** in projections - if two events disagree about a patient's phone number, the higher-confidence one wins. Among equal confidence, the most recent one wins.

**Why continuous confidence, not binary staging**: Binary (draft/production) forces a hard cutover. With continuous confidence, different consumers set their own thresholds: the voice agent trusts its own writes at any confidence (session-scoped), the world model API serves data above a minimum threshold, outbound sync requires verified+, and researchers can query everything. One table, many views.

See [Voice Agent - Multi-Stage Verification](https://docs.amigo.ai/developer-guide/platform-api/voice-agent#multi-stage-verification) for how voice agent writes progress through the confidence pipeline.

### Supersedes Chain

When a new event replaces an older one (e.g., updating a patient's phone number), the new event references the old one via `supersedes`. The storage layer sets `is_current=false` on the superseded event. Projections only consider current events, ensuring state is always up-to-date while maintaining the full history for audit.

## Three Channels of Data Flow

The voice agent (and other consumers) interact with the world model through three distinct channels. This architecture optimizes how AI models work with the world model given current context and reasoning constraints - as models improve, more data naturally shifts toward the ambient channel:

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
subgraph WM\["World Model"]
EV\["Events"]
EN\["Entities"]
end

```
subgraph Ambient["Channel 1: Ambient (pushed)"]
    A1["Patient state → system prompt"]
    A2["Location context → system prompt"]
    A3["Related entities → system prompt"]
    A4["Refreshed mid-call on data change"]
end

subgraph Queried["Channel 2: Queried (pulled)"]
    Q1["find_patient(name, DOB)"]
    Q2["find_slots(date, type)"]
    Q3["find_appointments(patient)"]
    Q4["semantic_search(query)"]
end

subgraph Extracted["Channel 3: Extracted (captured)"]
    E1["Insurance details from speech"]
    E2["Contact info from conversation"]
    E3["Preferences mentioned in passing"]
    E4["Written at moderate confidence"]
end

EN -->|"Pushed into\nsystem prompt"| Ambient
Queried <-->|"Tool calls\n↔ results"| EN
Extracted -->|"Implicit capture\n→ events"| EV

subgraph LLM["LLM Context"]
    CTX["Ambient context +\ntool results +\nconversation history"]
end

Ambient --> CTX
Queried --> CTX" %}
```

### Channel 1: Ambient (Pushed)

Data the LLM should always have without asking. Injected into the system prompt and refreshed as the conversation evolves.

**Design principle: ambient over queried.** If the LLM will almost certainly need this data, push it into context. Don't make it ask. This reduces tool calls, lowers latency, and makes conversations more natural.

**What's ambient today:**

* Caller identity + patient state (resolved at session start from phone number)
* Context graph definition (loaded once)
* Workspace and service config
* Related entities (upcoming appointments, recent encounters)
* Location context (from inbound phone number → workspace → location mapping)

### Channel 2: Queried (Pulled)

Data that can't be ambient because the search space is too large. The LLM asks for it via tool calls.

**Key design**: Queried tools return human-readable results. Slot search returns doctor names and times, not template IDs. The system caches scheduling internals. The LLM gets exactly what a receptionist would see on their screen.

### Channel 3: Extracted (Captured)

Data the LLM expresses through conversation that the system captures without explicit tool calls. This is the deepest architectural change - it eliminates the mode switch where the LLM stops being a conversationalist and becomes a database operator.

The LLM says: *"So that's Guardian vision insurance, member ID 12345678, and you're the primary subscriber. I've recorded that."* A parallel extraction listener captures the structured data and writes it to the world model at moderate confidence.

**Why this is hard**: Extraction reliability. If the system misparses a member ID, it writes bad data. **Mitigation**: Extracted data is written at moderate confidence (below verified threshold). Explicit write tools remain for high-stakes data. Extraction is a complement to tools, not a replacement.

## Entity Types

The world model supports an **open-ended** set of entity types - `entity_type` is free-form text, not an enum. No schema migration is needed to add new types. An agent discovering a new kind of entity creates events with the new type, and the system handles it.

| Entity Type     | What It Represents                   | Common Roles / Key State Fields                                                                                                                                                                                                         |
| --------------- | ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `person`        | Any individual in the system         | Patient (demographics, conditions, medications, insurance), practitioner (profile, NPI, specialty), operator (availability, performance). A single person entity can hold multiple roles - the projection detects roles from event data |
| `organization`  | A company or healthcare organization | Practice, payer, employer. Name, contacts, identifiers                                                                                                                                                                                  |
| `place`         | A physical site                      | Clinic, hospital, office. Address, phone, hours, status                                                                                                                                                                                 |
| `appointment`   | A scheduled visit                    | Status, participants, times, cancellation, confirmation                                                                                                                                                                                 |
| `deal`          | A business opportunity or engagement | Status, pipeline stage, value, contacts                                                                                                                                                                                                 |
| `call`          | A voice conversation                 | Context, escalation history, safety state, human segments, audit                                                                                                                                                                        |
| `outbound_task` | A scheduled outreach task            | Lifecycle, retry state, scheduling, dispatch history                                                                                                                                                                                    |

{% hint style="info" %}
FHIR resource types on events (Patient, Practitioner, Location, etc.) remain unchanged - the ontological type applies at the entity level, while FHIR types describe the data format of individual events.
{% endhint %}

## Entity Graph

Entities are connected by directed relationship edges, discovered automatically from data:

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
P\["Person\n(patient role)"] -->|"participant"| A\["Appointment"]
A -->|"practitioner"| PR\["Person\n(practitioner role)"]
A -->|"location"| L\["Place"]
P -->|"primary\_care"| PR
P -->|"coverage"| INS\["Insurance"]
C\["Call"] -->|"caller"| P
C -->|"escalation"| OP\["Person\n(operator role)"]
OT\["Outbound Task"] -->|"patient"| P
OT -->|"call"| C
D\["Deal"] -->|"contact"| P
D -->|"organization"| ORG\["Organization"]" %}

Relationships are discovered from:

1. **Reference parsing** - Automatically extracted during entity resolution (e.g., Appointment.participant → Patient)
2. **Explicit events** - `relationship.established` events created by connectors or agents
3. **Agent discovery** - Autonomous agents identify relationships through data analysis

Relationships are themselves events - discoverable, versioned, and confidence-scored.

## Agent-Produced Knowledge (The Closed Loop)

Agents analyze incoming data and produce derived events - insights, correlations, predictions, corrections. These derived events live in the same event table as raw data, linked to their source evidence. Other agents read and act on them, creating a feedback loop.

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
A\["Agent analyzes\npatient data"] --> B\["Derived insight:\n'Monday 2-4pm callers\nhave 3x no-show rate'"]
B --> C\["Insight stored\nas event"]
C --> D\["Voice agent reads\nduring Monday 3pm call"]
D --> E\["Proactively offers to\nconfirm appointment"]
E --> F\["Confirmation event\nstored"]
F --> G\["Future agent measures\nif intervention reduced\nno-shows"]
G --> A" %}

## Two-Tier Vector Search

The world model uses semantic vector search at two tiers for different latency and scale requirements:

| Tier          | Purpose                     | Latency  | Scale                  | Used By                                                                                |
| ------------- | --------------------------- | -------- | ---------------------- | -------------------------------------------------------------------------------------- |
| **Hot path**  | Per-workspace, real-time    | Sub-50ms | Thousands of entities  | Voice agents during calls - patient lookup, related events, context retrieval          |
| **Cold path** | Cross-workspace, analytical | Seconds  | Billions of embeddings | Autonomous agents for population analytics, pattern discovery, cross-domain similarity |

**Embedding pipeline**: Embeddings are generated on every event and entity write via background tasks - zero added latency to the write path. Multiple embedding providers support different modalities:

| Provider Type      | Dimensions | Capabilities                                                                                              |
| ------------------ | ---------- | --------------------------------------------------------------------------------------------------------- |
| **Text embedding** | 1536       | Default - text content from events and entities                                                           |
| **Regional text**  | 768        | Region-local processing for compliance workloads                                                          |
| **Multimodal**     | 768-3072   | Text + audio + image - captures paralinguistic features (tone, urgency, emotion) that transcription loses |

Entity state is serialized to structured text before embedding. Events are embedded with metadata (type, subtype, source, entity context) for richer semantic matching. Voice call audio can be embedded directly via multimodal providers for semantic search over conversations.

## Data Sources

External data feeds that push or pull data into the workspace:

| Strategy     | Behavior                             | Example                 |
| ------------ | ------------------------------------ | ----------------------- |
| `continuous` | Polled on a frequent interval        | EHR patient updates     |
| `scheduled`  | Polled on a cron schedule            | Daily reporting         |
| `manual`     | Only synced via explicit API trigger | One-time imports        |
| `webhook`    | Receives push events (no polling)    | Real-time notifications |

Each data source supports status checks, sync history, and health monitoring. The [connector runner](https://docs.amigo.ai/developer-guide/platform-api/platform-api/connector-runner) manages the full lifecycle of data source polling and sync.

The data sources list endpoint accepts repeatable `source_type` query parameters for filtering by multiple source types in a single request (e.g., `?source_type=ehr&source_type=smart_fhir`).

### Self-Service Secret Provisioning

For `smart_fhir` data sources, the create and update endpoints support inline secret provisioning. Pass `private_key_value` in the `connection_config` and the API automatically provisions the key to secure storage, storing only an SSM path reference in the data source record. API responses redact all secret fields (`*_value`, `*_pem`, `*_secret`, `*_password`) and mask SSM paths. This eliminates the need for manual secret provisioning when connecting SMART-compliant EHR systems.

## Unification Rules

Rules that map and transform data from external sources into the workspace's unified entity model. Define field mappings, source-to-target type conversions, and activation state per rule. When no rules match, raw records are stored as `raw_record` events - data is never lost, just not yet unified.

## Data Quality Pipeline

Data quality is a **pipeline, not a gate**. Every piece of data has a confidence level that the system continuously improves through multiple passes. Different consumers read at different confidence thresholds.

{% @mermaid/diagram content="%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#D4E2E7", "primaryTextColor": "#100F0F", "primaryBorderColor": "#083241", "lineColor": "#575452", "textColor": "#100F0F", "clusterBkg": "#F1EAE7", "clusterBorder": "#D7D2D0"}}}%%
flowchart TB
subgraph Write\["Data Capture"]
VA\["Voice agent writes\n(confidence: pending)"]
CR\["Connector runner writes\n(confidence: authoritative)"]
EX\["Transcript extraction\n(confidence: moderate)"]
end

```
subgraph Review["Automated Review Pipeline"]
    CL["Call Classifier\n(junk filter)"]
    J1["Per-Event LLM Judge\n(cross-reference transcript\n+ entity state)"]
    J2["Session Coherence\n(narrative consistency)"]
end

subgraph Decisions["Outcomes"]
    AP["✅ Verified\n(sync-eligible)"]
    FL["⚠️ Flagged\n(→ review queue)"]
    RJ["❌ Rejected\n(discarded)"]
    AC["🔧 Auto-corrected\n(formatting fixes)"]
end

subgraph Human["Human Review"]
    HR["Operator reviews\nflagged items"]
    HRD{"Decision"}
    HA["Approve → high confidence"]
    HC["Correct → new event supersedes"]
    HRJ["Reject → confidence 0"]
end

subgraph Sync["Outbound Sync"]
    GATE{"Confidence\n≥ verified?"}
    OUT["Sync to external system\n(via handler registry)"]
    HOLD["Hold"]
end

VA --> CL
CL -->|"Junk"| RJ
CL -->|"Real"| J1
J1 -->|"Valid"| AP
J1 -->|"Correctable"| AC --> AP
J1 -->|"Uncertain"| FL
J1 -->|"Bad"| RJ
AP --> J2
J2 -->|"Coherent"| GATE
J2 -->|"Contradictions"| FL
FL --> HR --> HRD
HRD -->|"Approve"| HA --> GATE
HRD -->|"Correct"| HC --> GATE
HRD -->|"Reject"| HRJ
GATE -->|"Yes"| OUT
GATE -->|"No"| HOLD
CR --> GATE" %}
```

**Why three automated review stages, not one**: Per-event review catches data-level errors (wrong phone format, impossible DOB, name doesn't match transcript). Session-level review catches narrative-level errors (contradictions between events, discussed insurance but no coverage event recorded). Call classification catches junk calls before any clinical review runs. These are different kinds of errors requiring different analysis approaches and context windows.

The [connector runner](https://docs.amigo.ai/developer-guide/platform-api/platform-api/connector-runner) enforces a **multi-layer confidence gate** before syncing to external systems - ensuring all data is verified before reaching production EHR systems.

## Generic Data Query

A structured query endpoint provides flexible table access with REST-style filter syntax:

```
GET /v1/{workspace_id}/query/{schema}/{table}
```

**Filter operators**: `eq`, `neq`, `gt`, `gte`, `lt`, `lte`, `like`, `ilike`, `in`, `is`

**Other parameters**: `select=col1,col2`, `order=col.asc`, `limit=100` (max), `offset=0`, `semantic=<query>` (cosine similarity search via vector embeddings)

**Security**: Column whitelist per table, hidden sensitive columns, workspace isolation enforced server-side, maximum 100 rows per request.

## API Reference

* [Data Sources](https://docs.amigo.ai/api-reference/readme/platform/data-sources)
* [Unification Rules](https://docs.amigo.ai/api-reference/readme/platform/unification-rules)
* [World Model](https://docs.amigo.ai/api-reference/readme/platform/world)
