DynamoDB Single-Table Design for Multi-Tenant SaaS

The DynamoDB Table Behind Autowired.ai — Single-Table Design for Multi-Tenant SaaS

DynamoDB single-table design has a reputation for being hard to learn and easy to get wrong. That reputation is deserved.

The AWS documentation makes it sound mechanical: define your access patterns, map them to partition keys and sort keys, add GSIs where needed, and done. What it doesn't tell you is that the decisions you make before your table has a single item are effectively permanent. Changing a partition key structure in a table with production data is a migration with backfill, dual-write phases, and cutover risk. You don't iterate on the DynamoDB schema the way you iterate on application code.

This post is the data model behind Autowired.ai – the specific key patterns, the three GSIs and why each one exists, and the decisions I'd make differently if I started over.

The Domain and Why Single-Table Made Sense

The entity hierarchy for Autowired.ai:

Tenant
  └── Project
       └── Workflow (extraction schema + confidence thresholds)
            └── Batch (submitted document set)
                 └── Document (file + extraction result)

User (linked to Tenant)

Most operations are tenant-scoped. The access patterns are predictable and unlikely to change fundamentally — list projects for a tenant, list workflows for a project sorted by date, get a batch by ID, list documents for a batch, and filter by processing status.

A single-table design made sense here for the same reason it makes sense for most serverless SaaS at this scale: one table, one set of CloudWatch metrics, one backup configuration, one IAM policy. PAY_PER_REQUEST billing means you're not pre-provisioning capacity across multiple tables and paying for idle throughput on the ones that don't get hit often.

The tradeoff is real, though: no ad hoc queries, no schema evolution without migration, and a data model that only makes sense if you document it.

Primary Key Design: Tenant Isolation by Structure

The PK/SK pattern follows the entity hierarchy directly:

Entity       PK                                    SK
──────────────────────────────────────────────────────────────────────
Tenant       TENANT#<tenantId>                     METADATA
User         USER#<userId>                         PROFILE
User↔Tenant  TENANT#<tenantId>                     USER#<userId>
Project      TENANT#<tenantId>                     PROJECT#<projectId>
Workflow     TENANT#<tenantId>#PROJECT#<projectId> WORKFLOW#<workflowId>
Batch        TENANT#<tenantId>#PROJECT#<projectId> BATCH#<batchId>
Document     TENANT#<tenantId>#BATCH#<batchId>     DOCUMENT#<documentId>

The compound PK (TENANT##PROJECT#) is the decision that matters most here.

Tenant isolation is structural, not advisory. Because tenantId is embedded in every PK, a DynamoDB query physically cannot return results across tenant boundaries. There's no middleware to configure, no application-layer filter to apply consistently, and no trust that every code path remembered to add the right FilterExpression. The key structure enforces it.

An API handler listing workflows must supply both tenantId (from the authenticated session context via Clerk) and projectId (from the request path). The query:

const result = await dynamodb.query({
  TableName: TABLE_NAME,
  KeyConditionExpression: "PK = :pk AND begins_with(SK, :prefix)",
  ExpressionAttributeValues: {
    ":pk": `TENANT#${tenantId}#PROJECT#${projectId}`,
    ":prefix": "WORKFLOW#",
  },
});

Returns only that tenant-project's workflows. There's no way to accidentally return another tenant's data without constructing the wrong PK — which requires the wrong tenantId to be in the session, not a missing filter clause.

Each workflow also stores confidenceThreshold and reviewThreshold — the two values that control document status transitions in the processing pipeline. A document whose extraction confidence falls below confidenceThreshold is FAILED. One that falls between reviewThreshold and confidenceThreshold is REVIEW_REQUIRED. These thresholds are per workflow, stored in the workflow item, and passed through to Step Functions at batch execution time.

Three GSIs, Three Distinct Problems

The main table's PK structure handles hierarchical queries well. For everything else, there are GSIs. Each one exists for a specific reason.

GSI1 — Email Lookup + Date-Sorted Workflow Listing

GSI1 solves two access patterns that the primary key can't:

User lookup by email. Clerk manages auth, but there are operations — invitations, notifications, and admin lookups — where you need to find a user record by email, not by Clerk's userId. The PK has users keyed by userId. GSI1 maps email to GSI1PK and USER# _to _GSI1SK, giving O(1) lookup by email.

Workflows sorted by last updated date. The frontend shows workflows in reverse chronological order. The PK query can return all workflows for a project but can't sort them. GSI1 maps TENANT##PROJECT# to GSI1PK and an ISO 8601 timestamp to GSI1SK, enabling:

const result = await dynamodb.query({
  TableName: TABLE_NAME,
  IndexName: "GSI1",
  KeyConditionExpression: "GSI1PK = :pk",
  ExpressionAttributeValues: {
    ":pk": `TENANT#${tenantId}#PROJECT#${projectId}`,
  },
  ScanIndexForward: false, // most recently updated first
});

One GSI serving two patterns works here because the two item types use different GSI1PK formats — there's no collision. The downside: it's not obvious why GSI1 exists when you read the CDK definition without comments. Document the patterns each GSI serves, not just the attribute names.

GSI2 — Status-Based Filtering (Processing Pipeline)

GSI2 exists entirely for the processing pipeline — the Step Functions state machine and Lambdas that update document status.

The pipeline needs to ask, "What documents in this batch are in PROCESSING status?" and "What batches for this tenant are FAILED?" Neither query is possible from the main table's PK structure.

GSI2 composite keys:

For documents:  GSI2PK = TENANT#<tenantId>#BATCH#<batchId>#STATUS#<status>
For batches:    GSI2PK = TENANT#<tenantId>#STATUS#<status>
                GSI2SK = <ISO 8601 timestamp>

REVIEW_REQUIRED is a first-class status here — not a variant of FAILED. It represents documents where the Bedrock extraction completed successfully but the combined confidence score fell below the reviewThreshold. These need human review, not reprocessing. Conflating them with FAILED would make it impossible to route them correctly.

The sparse index pattern: Only documents in interesting statuses (PROCESSING, FAILED, REVIEW_REQUIRED) write GSI2 attributes. Completed items have their GSI2 attributes removed after a retention window. Items that don't write GSI2 attributes simply don't appear in the index.

This keeps GSI2 lean. In a system processing thousands of documents per day, the vast majority will be in terminal SUCCEEDED status. If every document item included GSI2 attributes, the index would be orders of magnitude larger than necessary. Sparse index design means the GSI contains only the items you'd actually query.

GSI3 — Direct Batch Lookup by batchId

GSI3 exists to solve a coupling problem.

The Step Functions state machine receives a batchId in its execution input. That's it. It doesn't receive tenantId or projectId — which means it can't construct the primary key TENANT##PROJECT# needed to query the main table.

Without GSI3, the state machine would need to carry the full PK context in every execution payload, coupling the Step Functions definition to the DynamoDB key structure. Any change to the key structure would require updating the state machine input format.

GSI3 maps batchId to GSI3PK with a fixed GSI3SK = "BATCH" for point lookups:

const result = await dynamodb.query({
  TableName: TABLE_NAME,
  IndexName: "GSI3",
  KeyConditionExpression: "GSI3PK = :batchId AND GSI3SK = :sk",
  ExpressionAttributeValues: {
    ":batchId": batchId,
    ":sk": "BATCH",
  },
});

The state machine calls this lookup once at initialization, gets the full batch record (including tenantId, projectId, workflowId, thresholds), and then uses the main table PK for all subsequent operations.

This pattern — a GSI that gives a downstream consumer its native lookup identifier rather than forcing it to understand the data model — comes up repeatedly in event-driven architectures. The processing pipeline is a consumer of DynamoDB, not an owner of the key structure.

The Diagram

Operational Patterns Worth Mentioning

Extraction results in DynamoDB. Each Document item holds the full extraction output — field values, per-field confidence scores, and the Bedrock verification result. DynamoDB's 400KB item limit covers most extraction results. For documents with many fields or long extracted values that approach the limit, the payload goes to S3 and the Document item holds the reference key. DynamoDB remains the authoritative record for status and metadata; S3 handles the overflow.

Conditional writes for idempotency. S3 event notifications are at-least-once delivery. The S3IngestionLambda uses attribute_not_exists(PK) on every PutItem — if the document record already exists, the write fails silently and the Lambda exits cleanly. No duplicate records, no reprocessing.

await dynamodb.putItem({
  TableName: TABLE_NAME,
  Item: documentRecord,
  ConditionExpression: "attribute_not_exists(PK)",
});

TransactWriteItems for batch initialisation. Creating a batch involves writing the batch record, writing all document records, and updating the project's batch count. TransactWriteItems ensures all writes succeed or none do. A partial write – a batch record with no documents or documents with no parent batch – would leave the pipeline in an inconsistent state that's hard to recover from.

PITR is non-optional. Point-in-time recovery is enabled. DynamoDB holds the authoritative record for extraction results, batch status, workflow configuration, and tenant data. A bug in the processing pipeline corrupting document records is recoverable within 35 days with PITR. Without it, it's not recoverable at all.

TTL for ephemeral data. Temporary processing state and short-lived tokens use the ttl attribute. DynamoDB deletes expired items within ~48 hours — eventually consistent, not guaranteed exact. For compliance use cases requiring exact deletion on a schedule, a Lambda on a cron or explicit deletes are more appropriate than TTL alone.

Lessons Learned

Define all access patterns before touching a key. Not 80% — all of them. Access patterns you discover after the fact require GSIs, which have write amplification costs. Access patterns you can't fit into any GSI require scans.

Encode tenant ownership in the PK. Tenant isolation via key structure is more reliable than any application-layer enforcement. The query physically cannot cross tenant boundaries if tenantId is in every partition key.

Sparse indexes are a feature. Only write GSI attributes on items that need to appear in that index. For a processing pipeline where most items are in terminal SUCCEEDED status, keeping active-status items in a sparse index is a significant operational win.

The DynamoDB schema is a public API contract. Treat the first design with the same care you'd give a public API. You won't get to iterate on it cheaply.

DynamoDB Single-Table Design for Multi-Tenant SaaS

Tags

Author

Stats

Published

You Might Also Like

Sharding Hot Partitions in DynamoDB: Why Your Single-Partition Log Table Will Break at Scale

Fix N+1 Trigger Patterns Where Lambda Functions Hammer the Same DynamoDB Partition Key

17 Access Patterns, Zero Scans, One DynamoDB Table: Single-Table Design for a 37-Endpoint SaaS

DynamoDB Streams and Lambda for Real-Time Threat Detection: The Event Pipeline DynamoDB Was Built For

I built a million-scale World Cup prediction game on DynamoDB + v0 in a weekend

I Tried AWS Blocks on a Real Amplify Gen2 Project — Local DynamoDB, No AWS Account, 1-Second Loops