What is a Document Store?
A document store (MongoDB, DynamoDB, Couchbase, Firestore) stores JSON-like documents with flexible schemas — each document can have different fields. Key properties: documents are self-contained (no joins required), queries are on document fields rather than table columns, schema changes don’t require migrations (new fields can be added to new documents without changing old ones). Use cases: product catalogs (each product category has different attributes), user profiles (users have heterogeneous settings), content management (articles, posts, comments with varying metadata). When NOT to use: financial transactions requiring ACID across multiple entities (use RDBMS), data with many complex relationships (graph database or RDBMS with joins is more natural).
Data Modeling
The central question in document modeling: embed vs. reference. Embed: store related data inside the parent document. Use when: the data is always accessed together, the embedded data is small and bounded in size (a product’s images array), the embedded data doesn’t need to be queried independently. Reference: store a foreign ID, resolve separately. Use when: the referenced data is large or unbounded (a user’s order history — embed the latest 10, reference the rest), the referenced data is shared across many documents (a category referenced by thousands of products — embed the category name, but reference for category-level queries), or the referenced data changes frequently (updating an embedded copy in thousands of documents is expensive). Anti-pattern: arrays that grow without bound (a user document with all posts embedded). As the array grows, document size increases until it hits the 16MB MongoDB document limit and read/write performance degrades.
Indexing Strategy
Document stores support secondary indexes on any field. Index types (MongoDB example): single field (index on user.email), compound index (user.country, user.created_at — supports queries filtering by country and sorting by date), multikey index (index on an array field — each array element becomes an index entry), text index (full-text search on string fields), geospatial index (2dsphere for lat/lng queries). Index selectivity: high-cardinality fields (email, user_id) are more useful as indexes than low-cardinality fields (status, boolean flags). Partial indexes: index only documents matching a filter (e.g., index only ACTIVE users). Reduces index size when most documents are in a low-cardinality state (e.g., 90% of orders are COMPLETED). Covered queries: a query is “covered” if all its fields are in the index — no document fetch required, just index scan. Very fast.
Consistency and Transactions
Document stores trade relational consistency for flexibility and scale. MongoDB 4.0+: multi-document ACID transactions (similar to RDBMS). Use sparingly — cross-document transactions hurt throughput and should be avoided by embedding when possible. Atomic operations within a single document: always atomic in all document stores. Design to keep related state in one document to leverage single-document atomicity. DynamoDB: supports transactional writes across up to 25 items in one TransactWriteItems call. Consistency levels: eventual consistency (read from any replica — fast, may return stale data) vs strong consistency (read from the primary — slower, always current). Choose based on use case: user profile reads can be eventually consistent; payment status checks need strong consistency.
Sharding and Horizontal Scale
Document stores scale horizontally via sharding (partitioning data across nodes). Shard key selection is critical: High cardinality: many distinct values to distribute writes evenly. Write distribution: avoid monotonically increasing shard keys (like timestamps) — all writes go to one shard (hotspot). Good shard keys: hashed user_id (uniform write distribution), (region, user_id) compound key (range queries within a region). Bad shard key: timestamp (hotspot), status (low cardinality — all ACTIVE documents on one shard). Chunk migration: MongoDB automatically moves data chunks between shards as data grows. This rebalancing has overhead — schedule during off-peak hours or pre-split chunks before high-write events. Shard key is immutable: once a document is inserted with a shard key value, it cannot be changed. Choose carefully at schema design time.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “When should you choose a document store over a relational database?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Choose a document store when: (1) Schema is heterogeneous — different documents in the same collection have different fields (product catalog where electronics have different attributes than clothing). (2) Data is naturally hierarchical and read together — a blog post with its embedded comments and tags is a better fit as one document than normalized across 3 tables. (3) Schema evolves rapidly — early-stage products with frequent attribute additions benefit from schemaless flexibility. (4) Horizontal write scaling is required — document stores shard more naturally than relational databases. (5) The application reads one entity at a time (by ID) more than it joins across entities. Choose relational when: strong ACID transactions across multiple entities are required (financial systems), data has complex many-to-many relationships best expressed with joins, ad-hoc analytics require flexible aggregation across columns, or the data is highly normalized and stable. Many modern systems use both: relational for transactional core, document store for flexible product or user attributes.”
}
},
{
“@type”: “Question”,
“name”: “How does MongoDB’s aggregation pipeline work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “MongoDB’s aggregation pipeline processes documents through a sequence of stages, each transforming the data. Common stages: $match (filter documents, like WHERE), $group (aggregate by a field, like GROUP BY + aggregate functions), $project (reshape documents, include/exclude/compute fields), $sort, $limit, $skip (pagination), $lookup (left join with another collection), $unwind (deconstruct an array field into multiple documents). Example: find total revenue by country for completed orders: db.orders.aggregate([$match: {status: “COMPLETED”}, $group: {_id: “$country”, total: {$sum: “$amount”}}, $sort: {total: -1}, $limit: 10]). Pipeline execution is optimized: MongoDB reorders stages for efficiency (e.g., moves $match before $sort to filter before sorting). Indexes are used in $match stages if available. $lookup performs in-memory join — use sparingly on large collections; index the foreign field. Aggregation pipelines replace the need for multiple queries and application-side processing.”
}
},
{
“@type”: “Question”,
“name”: “What is the N+1 problem in document stores and how do you solve it?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The N+1 problem: you query for N parent documents, then issue one query per document to fetch related data (total N+1 queries). Example: fetch 20 blog posts, then for each post fetch the author document = 1 + 20 = 21 queries. Solutions: (1) Embed the relevant author fields directly in each post document (author_name, author_avatar). Denormalize. Reading posts no longer requires a separate author lookup. Trade-off: author data duplicated across all posts; must update all posts if the author’s name changes (or accept eventual consistency for non-critical fields). (2) Application-level batch fetch: fetch all 20 posts, collect unique author_ids, fetch all authors in one query (db.users.find({_id: {$in: author_ids}})), build a map, join in application code. 2 queries total. (3) MongoDB $lookup (join in aggregation pipeline): handles this at the database level. For reads dominated by single-document fetches (user profile, product page): embed to eliminate N+1. For analytical queries across many documents: batch fetch or aggregation pipeline.”
}
},
{
“@type”: “Question”,
“name”: “How does DynamoDB’s single-table design work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “DynamoDB’s single-table design collocates multiple entity types in one table to enable efficient access patterns. Every item has a PK (partition key) and SK (sort key). By using prefixes and composite keys, you store different entity types together: PK=”USER#123″, SK=”PROFILE” (user profile), PK=”USER#123″, SK=”ORDER#2024-01-15#456″ (user’s order), PK=”ORDER#456″, SK=”ITEM#789″ (order item). Access patterns: “get user profile” = GetItem(PK=”USER#123″, SK=”PROFILE”). “get all orders for user 123″ = Query(PK=”USER#123”, SK begins_with “ORDER#”). “get all items in order 456″ = Query(PK=”ORDER#456”, SK begins_with “ITEM#”). This colocation enables relational-like queries with DynamoDB’s O(1) key-value lookups. GSI (Global Secondary Index): define an alternate PK/SK for different access patterns (e.g., GSI on email for user lookup by email). Design discipline: enumerate all required access patterns before designing the schema — DynamoDB is access-pattern-driven, not schema-driven.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle schema migrations in a document store?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Document stores don’t enforce a schema, so “migrations” are different from relational ALTER TABLE. Strategies: (1) Lazy migration: add a schema_version field to each document. Application code handles both the old and new schema formats (if version==1: use old field name, else new field name). Documents are upgraded to the new schema on next write. Pros: no downtime, no batch job needed. Cons: application code has migration logic indefinitely; reporting queries must handle both formats. (2) Background migration job: write a script that reads documents in batches, transforms them to the new schema, and writes them back. Run during low-traffic hours. Pros: clean code once all documents are migrated. Cons: risk of missing documents, script bugs. (3) Write-to-new, read-from-both (dual-read period): new writes use the new schema; reads fall back to old schema. Once the migration job catches up, drop old-format fallback. (4) New collection: write new documents to a new collection, migrate old documents, then rename. Zero-downtime with careful coordination.”
}
}
]
}
Asked at: Databricks Interview Guide
Asked at: Netflix Interview Guide
Asked at: Stripe Interview Guide
Asked at: Shopify Interview Guide