System Design: Document Store — Schema-Flexible Storage, Indexing, and Consistency Trade-offs

What is a Document Store?

A document store (MongoDB, DynamoDB, Couchbase, Firestore) stores JSON-like documents with flexible schemas — each document can have different fields. Key properties: documents are self-contained (no joins required), queries are on document fields rather than table columns, schema changes don’t require migrations (new fields can be added to new documents without changing old ones). Use cases: product catalogs (each product category has different attributes), user profiles (users have heterogeneous settings), content management (articles, posts, comments with varying metadata). When NOT to use: financial transactions requiring ACID across multiple entities (use RDBMS), data with many complex relationships (graph database or RDBMS with joins is more natural).

Data Modeling

The central question in document modeling: embed vs. reference. Embed: store related data inside the parent document. Use when: the data is always accessed together, the embedded data is small and bounded in size (a product’s images array), the embedded data doesn’t need to be queried independently. Reference: store a foreign ID, resolve separately. Use when: the referenced data is large or unbounded (a user’s order history — embed the latest 10, reference the rest), the referenced data is shared across many documents (a category referenced by thousands of products — embed the category name, but reference for category-level queries), or the referenced data changes frequently (updating an embedded copy in thousands of documents is expensive). Anti-pattern: arrays that grow without bound (a user document with all posts embedded). As the array grows, document size increases until it hits the 16MB MongoDB document limit and read/write performance degrades.

Indexing Strategy

Document stores support secondary indexes on any field. Index types (MongoDB example): single field (index on user.email), compound index (user.country, user.created_at — supports queries filtering by country and sorting by date), multikey index (index on an array field — each array element becomes an index entry), text index (full-text search on string fields), geospatial index (2dsphere for lat/lng queries). Index selectivity: high-cardinality fields (email, user_id) are more useful as indexes than low-cardinality fields (status, boolean flags). Partial indexes: index only documents matching a filter (e.g., index only ACTIVE users). Reduces index size when most documents are in a low-cardinality state (e.g., 90% of orders are COMPLETED). Covered queries: a query is “covered” if all its fields are in the index — no document fetch required, just index scan. Very fast.

Consistency and Transactions

Document stores trade relational consistency for flexibility and scale. MongoDB 4.0+: multi-document ACID transactions (similar to RDBMS). Use sparingly — cross-document transactions hurt throughput and should be avoided by embedding when possible. Atomic operations within a single document: always atomic in all document stores. Design to keep related state in one document to leverage single-document atomicity. DynamoDB: supports transactional writes across up to 25 items in one TransactWriteItems call. Consistency levels: eventual consistency (read from any replica — fast, may return stale data) vs strong consistency (read from the primary — slower, always current). Choose based on use case: user profile reads can be eventually consistent; payment status checks need strong consistency.

Sharding and Horizontal Scale

Document stores scale horizontally via sharding (partitioning data across nodes). Shard key selection is critical: High cardinality: many distinct values to distribute writes evenly. Write distribution: avoid monotonically increasing shard keys (like timestamps) — all writes go to one shard (hotspot). Good shard keys: hashed user_id (uniform write distribution), (region, user_id) compound key (range queries within a region). Bad shard key: timestamp (hotspot), status (low cardinality — all ACTIVE documents on one shard). Chunk migration: MongoDB automatically moves data chunks between shards as data grows. This rebalancing has overhead — schedule during off-peak hours or pre-split chunks before high-write events. Shard key is immutable: once a document is inserted with a shard key value, it cannot be changed. Choose carefully at schema design time.

Asked at: Databricks Interview Guide

Asked at: Netflix Interview Guide

Asked at: Stripe Interview Guide

Asked at: Shopify Interview Guide

Scroll to Top