> For the complete documentation index, see [llms.txt](https://docs.amigo.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.amigo.ai/data/customer-data-intake.md).

# Customer Data Intake

The intake system provides a structured pipeline for receiving, validating, and cataloging customer data uploads. It supports two entry points: a console interface for workspace operators and an external integration path for automated systems.

## Console Intake

Workspace operators upload files through the console interface using standard workspace session authentication. The console provides three surfaces:

* **Files list** - Browse uploaded files with sorting, filtering by status, and search by filename. Each file shows its dataset, schema version, status, size, and upload timestamp. Files that fail validation display an inline error reason.
* **File upload** - Upload a file to a registered dataset. The platform validates the file against the dataset's schema contract synchronously and assigns a terminal status (success or failed) immediately on upload. Byte-identical re-uploads to the same dataset are deduplicated - the platform returns the existing file rather than creating a new version.
* **File download** - Download the original uploaded file bytes by file ID.

## Schema Registry

Before files can be uploaded to a dataset, a schema contract must be registered. A schema defines the expected structure of uploaded files:

* **Dataset name** - A unique identifier for the dataset within the workspace.
* **File type** - The expected file format (currently CSV).
* **Primary key** - One or more columns that uniquely identify a row. Primary key columns cannot be empty in uploaded files.
* **Field definitions** - A list of columns with their expected data types. Supported types: string, integer, float, boolean, date (ISO format), and datetime (ISO format).
* **Size limits** - Optional maximum file size in megabytes.

Schemas are managed through two console surfaces:

* **Schemas list** - Browse registered schemas with sorting and search by name. Each entry shows the dataset name, schema version, field count, file type, and size limits.
* **Create Schema** - Register a new schema by specifying the dataset name, file type, primary key columns, and field definitions. The platform validates that all primary key columns exist in the field list and that all field types are recognized. Dataset names must be unique within a workspace.

## Upload Validation

When a file is uploaded through the console, the platform validates it synchronously against the dataset's registered schema:

1. **Format check** - The file must be valid UTF-8 text with a header row.
2. **Column check** - Every field defined in the schema must appear as a column in the file (after any column mapping).
3. **Type and key check** - Each cell is checked against its column's declared type. Primary key columns must not be empty.

Validation stops at the first violation and reports a human-readable error reason. Files that pass validation are marked as successful; files that fail are marked with the error reason visible in the files list.

## Versioning

Each uploaded file receives a monotonically increasing version number within its dataset. Concurrent uploads to the same dataset each receive a distinct version. Version allocation is atomic - no two files in the same dataset share a version number.

## Security

All uploaded files are scanned for malware before processing. If the scanner is unavailable, uploads are rejected (fail-closed). File access is scoped to the caller's workspace through row-level security.

## External Integration Path

Automated systems can upload files through a separate HMAC-authenticated endpoint designed for streaming binary uploads from approved customer integrations. This path uses per-customer secrets for authentication rather than workspace session tokens.

## Shareable Upload Links

For customers who need to upload files without API credentials or technical setup, operators can generate shareable upload links. Each link maps to a specific workspace and customer, and grants the holder access to a drag-and-drop upload page - no login required.

### How It Works

1. **Generate a link** - An operator creates an upload link for a specific customer through the API. The link has a configurable expiration (up to 30 days) and a maximum upload count (up to 10,000 files).
2. **Share the URL** - The generated URL points to a self-contained upload page. Send it to the customer via email, chat, or any other channel.
3. **Customer uploads files** - The customer opens the link in any browser and drags files onto the page, or clicks to browse. No account, API key, or technical knowledge required.
4. **Files land in the intake pipeline** - Uploaded files follow the same intake pipeline as API-submitted files, including metadata tracking and audit logging.

### Link Lifecycle

Upload links have four possible states:

| Status        | Meaning                                   |
| ------------- | ----------------------------------------- |
| **Active**    | Link is valid and accepting uploads       |
| **Expired**   | The link's expiration time has passed     |
| **Revoked**   | An operator manually revoked the link     |
| **Exhausted** | The link reached its maximum upload count |

Operators can revoke links at any time. Revoked and expired links show a clear error message to anyone who tries to use them.

### Supported File Types

The upload page accepts PDF, Word documents, PowerPoint presentations, CSV files, JPEG, and PNG files up to 100 MB each. Uploaded files are validated against their declared content type using magic byte detection - if the file content does not match the declared type, the upload is rejected.

CSV support enables bulk data intake workflows where customers upload structured data (patient rosters, appointment lists, insurance records) through the same shareable link mechanism used for document uploads.

### Download

Operators can download files that were submitted through intake links. Downloads are scoped through the intake link - an operator can only download uploads visible on the corresponding upload listing. If the upload or the underlying file no longer exists (for example, after a right-to-be-forgotten deletion), the download returns a not-found error indistinguishable from a missing upload, preventing information leakage about deleted records. Downloads are logged as PHI access events in the audit trail.

### Duplicate Detection

When an uploaded file has the same content hash as a previously uploaded file in the same workspace, the upload response includes the ID and timestamp of the original upload. The duplicate file is still accepted and stored - this is informational only, letting integrators and the upload UI surface duplicate warnings without blocking the upload.

### Security

The link token itself is the authentication mechanism - no API key is exposed to the customer. Links are scoped to a single workspace and customer, time-limited, and usage-limited. The upload page enforces strict content security policies and does not embed third-party resources.

## API-Based Intake

For integrations that need programmatic upload (automated pipelines, EHR exports, partner systems), the intake API accepts files directly with HMAC-signed authentication. Each file is authenticated, checksum-verified, written to a per-customer storage path, logged to a tamper-evident audit row, and projected into a signal event that downstream pipelines can react to.

This complements the [connector system](/data/connectors-and-ehr.md), which pulls data from EHR platforms, FHIR stores, and CRMs on a schedule. Intake covers the cases connectors cannot: bulk historical loads, one-off document drops from partners who have no queryable API, and upstream systems that prefer pushing data on their own cadence rather than exposing an endpoint for Amigo to poll.

## When to Use Intake vs a Connector

| Scenario                                                        | Use                                                                |
| --------------------------------------------------------------- | ------------------------------------------------------------------ |
| The source system has a queryable API (FHIR, SMART, REST, CRM)  | [Connector](/data/connectors-and-ehr.md)                           |
| A partner drops files into shared cloud storage on a schedule   | [File Drop connector](/data/connectors-and-ehr.md#connector-types) |
| The source system prefers to push bulk PHI on its own cadence   | **Intake**                                                         |
| An operations team needs to backfill historical documents       | **Intake**                                                         |
| A referral partner sends clinical summaries as discrete uploads | **Intake**                                                         |

Intake does not replace connectors. Most workspaces will run both: connectors for the steady-state sync, intake for bulk and ad-hoc pushes.

## Supported File Types

A single intake call accepts one file at a time. Supported formats include:

* Clinical documents (PDF, CDA / CCDA)
* FHIR bundles (JSON, NDJSON)
* Structured exports (CSV, NDJSON)
* Arbitrary binary payloads (attachments, scanned forms)

There is no hard format restriction at the transport layer - parsing and normalization happen downstream, and the intake channel itself is format-agnostic. The supported format list reflects what downstream parsing currently understands.

## How an Upload Works

```mermaid
flowchart LR
    A[Customer System] -- HMAC-signed stream --> B[Intake Endpoint]
    B -- verify signature --> B
    B -- stream bytes --> C[Workspace Storage]
    B -- write audit row --> D[Intake Audit Log]
    C --> E[Signal Event]
    D --> E
    E --> F[Downstream Pipelines]
    F --> G[(World Model)]
```

1. **Sign and send.** The customer integration computes a SHA-256 over the file body and signs a canonical request string with a per-customer HMAC secret. The file bytes are sent as the raw request body so large payloads stream without buffering.
2. **Authenticate.** The endpoint verifies the API key, the workspace binding, the HMAC signature, and that the request timestamp is within a short freshness window. Mismatches reject the upload before any bytes are retained.
3. **Stream and hash.** Bytes are streamed into a per-customer path in workspace storage. The SHA-256 is computed as the body flows through; if the client's hash disagrees with what actually arrived, the partial object is deleted and the caller receives a validation error.
4. **Log and emit.** A row is written to the workspace's intake audit log with the file identifier, customer slug, storage path, size, hash, and timestamp. The upload is then projected as a signal event that downstream pipelines can subscribe to.

## Authentication

Intake uses two layers of authentication stacked on top of each other:

* **API key.** The request carries a workspace API key. This scopes the caller to a workspace and enforces role-based permissions.
* **Per-customer HMAC.** On top of the API key, every upload is signed with a secret that belongs to a specific customer slug. The slug identifies which upstream entity the file belongs to, and the HMAC proves the request came from that entity and has not been replayed or tampered with.

The two-layer design lets a single workspace accept uploads from multiple upstream systems without giving any one of them access to the others' secrets. Rotating a customer's HMAC revokes their upload access without affecting the workspace's API key.

## Audit Log

Every accepted upload writes an immutable row containing:

* Unique upload identifier
* Workspace and customer slug
* Storage path the bytes were written to
* Original filename and content type
* SHA-256 and size in bytes
* Actor identifier of the API key that submitted the request
* Received-at timestamp
* Scan status and any scan findings
* Processed-at timestamp and processing error, if any

The log is the authoritative record of what was received from whom and when. It is queryable through the same data-access surface as the rest of the workspace, and it is retained according to the workspace's compliance policy.

## Downstream Projection

Uploads do not enter the world model directly. They first land in storage and the audit log, then emit a signal event that downstream pipelines pick up. This shape preserves the world model's invariant that every fact is sourced from an event with a known provenance, confidence, and timestamp.

What a downstream pipeline does with the event depends on the file type and the workspace's configuration:

* A CCDA or FHIR bundle can be parsed into patient, condition, encounter, and observation events and resolved against existing world model entities.
* A CSV export can be mapped field-by-field into structured events by a workspace-specific transform.
* A scanned form or PDF can be routed into a document-understanding pipeline before any structured events are emitted.

The intake channel itself does not prescribe a parser. Parsing and entity resolution are handled by the same pipelines that process data from connectors, so intake files benefit from the same unification, deduplication, and conflict-resolution behavior as data arriving from an EHR poll.

## Compliance Posture

The intake path is built to HITRUST and HIPAA requirements:

* Transport is TLS-only and authenticated at two layers.
* Storage is workspace-isolated - bytes written for one workspace are never visible to another.
* The audit log is append-only at the application layer and protected by row-level security at the database layer.
* Uploads are retained according to the workspace's data residency and retention policy.

## Availability

Streaming ingest, audit logging, signal-event projection, and file download are live. Malware scanning and built-in document parsing for the full format list are on the roadmap and will ship without changes to the upload contract.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.amigo.ai/data/customer-data-intake.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
