Core Entities
Document: (doc_id, title, owner_id, workspace_id, current_version_id, status=ACTIVE|ARCHIVED|DELETED, created_at, updated_at). DocumentVersion: (version_id, doc_id, version_number, content_url, content_hash, size_bytes, created_by, created_at, change_summary). DocumentPermission: (perm_id, doc_id, principal_type=USER|GROUP|WORKSPACE, principal_id, permission=VIEWER|COMMENTER|EDITOR|OWNER). DocumentComment: (comment_id, doc_id, version_id, anchor_text, body, author_id, resolved_at, thread_id). DocumentTag: many-to-many. ShareLink: (link_id, doc_id, token, permission, expires_at, is_active).
Versioning and Storage
Each save creates a new DocumentVersion. Content is stored in S3 as immutable objects (keyed by content_hash for deduplication). If two users save identical content: both version records point to the same S3 object. Version numbering: sequential integers (1, 2, 3…) per document. The Document table points to current_version_id for fast access. Version history: list all versions ordered by version_number desc. Restore: set current_version_id to the target version. Delta storage: for large documents, storing full content per version is expensive. Instead, store a diff (unified diff format) from the previous version and reconstruct by applying diffs. Delta storage reduces storage cost by 80-90% for text documents. Binary files (PDFs, images) always store the full file per version (diffs are not effective).
Permission Model
Hierarchical permissions: workspace-level defaults -> folder-level -> document-level. A user’s effective permission = max(workspace_default, folder_perm, doc_perm) where OWNER > EDITOR > COMMENTER > VIEWER. Inheritance: if a document has no explicit permission entry for a user, they inherit from the folder, then the workspace. Permission checks: on every document access, resolve the user’s effective permission. Cache the resolved permission in Redis (key: user_id:doc_id, TTL: 5 minutes). Invalidate on permission change. Group permissions: a user can be in multiple groups. Their permission is the maximum across all applicable group permissions plus any individual permissions. Share links: bypass the regular permission check; verify the token, check token’s permission level and expiry. Link tokens are cryptographically random (128-bit).
Full-text Search
Index document content in Elasticsearch. Index: doc_id, title, content (extracted text), tags, owner_id, workspace_id, created_at, permission_acl. Permission-filtered search: include the user’s accessible doc_ids as a filter (ACL filtering). For large workspaces with thousands of documents: pre-compute accessible document IDs per user on permission change (stored in Redis). At search time: intersect search results with accessible set. Content extraction: PDF and DOCX files require text extraction before indexing (Apache Tika, AWS Textract). Re-index when a new version is created. Incremental indexing via Kafka: on DocumentVersion creation, publish an event. An indexing worker consumes the event and updates Elasticsearch.
Real-time Collaboration
Multiple users editing the same document simultaneously. Challenges: conflict resolution, cursor positions, operational transforms. Approach: Operational Transformation (OT) — the algorithm underlying Google Docs. Each edit is an operation (insert char at position, delete char at position). Operations from concurrent editors are transformed against each other to maintain consistency. Alternative: CRDTs (Conflict-free Replicated Data Types) — a newer approach used by Figma and some editors. Each character has a unique ID; concurrent inserts are ordered deterministically. Implementation: use a real-time collaboration library (ShareDB for OT, Yjs for CRDT). WebSocket connection per document: server broadcasts operations to all connected editors. For simpler collaboration: last-write-wins with conflict detection (detect conflicts using version numbers, prompt user to resolve).
Interview Tips
- Versioning: the key design choice is full-content vs delta storage. Full-content is simpler and enables O(1) version access. Delta is cheaper but requires diff application on access.
- Permission resolution is complex with inheritance and groups. Always clarify: is inheritance additive (combine all permissions) or override (most specific wins)? For most document systems: most specific wins, with a maximum operation.
- For real-time collaboration, OT is the established standard (Google Docs). CRDTs are newer and simpler to reason about. In an interview, mentioning the existence of both and the trade-offs is sufficient.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you implement document version history with storage efficiency?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Two strategies: full-content storage vs delta storage. Full content: each version stores the complete document content in S3. Pros: O(1) version access (just fetch the S3 object). Cons: storage grows linearly with version count. A 1MB document with 100 versions = 100MB. Use content-hashing for deduplication: if two versions have identical content (SHA-256 hash), they point to the same S3 object. Deduplication saves storage when users save without making changes. Delta storage: store only the diff from the previous version (unified diff format for text). Pros: 80-90% storage reduction for text documents. Cons: accessing version N requires applying N diffs sequentially. Mitigation: store full snapshots every 20 versions, then apply at most 19 diffs. Binary files (PDFs, images): always full-content (diffs are not human-readable and rarely compress well).”
}
},
{
“@type”: “Question”,
“name”: “How does hierarchical permission inheritance work in a document system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Permissions cascade from workspace -> folder -> document. Resolution: find the most specific permission for a user on a document. Check document-level permissions first. If not found: check folder permissions. If not found: check workspace default. This is ‘most specific wins’ inheritance. Group membership: a user can belong to multiple groups. Their effective permission = maximum across (all group permissions + individual permission) at the most specific level. Algorithm: get_effective_permission(user_id, doc_id): check doc permissions for (user_id, all user groups). If found: return max. Check parent folder. If found: return max. Check workspace default. Return max. Caching: resolve and cache user:doc -> permission in Redis (TTL 5 min). Invalidation: on any permission change (doc, folder, workspace, or group membership), invalidate all cached permissions for affected documents and users.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement full-text search across documents with access control?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Challenge: search results must respect permissions — users should not see documents they cannot access. Options: (1) Post-filter: run the search, then filter results by permission check. Simple but inefficient — if top 100 results are all restricted, you return nothing and must fetch more. (2) Pre-computed ACL in index: store the list of authorized user/group IDs on each document in Elasticsearch. At query time, filter by: must match the search query AND (user_id IN authorized_users OR any_user_group IN authorized_groups). Efficient but requires re-indexing when permissions change. (3) Permission-scoped shards: shard the index by workspace, and apply workspace-level access control at the shard level. Simple for single-tenant systems. Production systems use option 2 with careful indexing: when a document’s permissions change, update the authorized_users field in Elasticsearch (partial update).”
}
},
{
“@type”: “Question”,
“name”: “How do you implement real-time collaborative editing?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Real-time collaboration means multiple users editing the same document simultaneously and seeing each other’s changes instantly. Architecture: WebSocket connection from each client to a collaboration server for the document. When a user makes an edit: encode it as an operation (insert, delete, retain). Send to collaboration server. Server broadcasts to all other clients. Conflict resolution: two users insert at the same position simultaneously. Operational Transformation (OT) — the algorithm used by Google Docs — transforms concurrent operations to produce a consistent result. Each operation is transformed against all concurrent operations before being applied. Implementation: use ShareDB (OT) or Yjs (CRDT — Conflict-free Replicated Data Types, used by Notion). CRDTs are simpler to reason about: each character has a globally unique ID, concurrent inserts are deterministically ordered. For most interview purposes: describe the WebSocket + OT/CRDT architecture without implementing OT.”
}
},
{
“@type”: “Question”,
“name”: “How would you design document sharing with time-limited public links?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Share links allow document access without requiring an account. ShareLink entity: (link_id, doc_id, token, permission_level=VIEWER|COMMENTER, expires_at, created_by, is_active, access_count). Token generation: cryptographically secure random token (128 bits, URL-safe base64 encoding = 22 characters). Expiry options: never, 1 day, 7 days, 30 days, custom. Access flow: user visits /share/{token}. Server looks up the ShareLink by token. Checks is_active=true and expires_at > NOW(). Serves the document at the link’s permission level. No authentication required. Tracking: increment access_count on each visit. Log accesses (IP, user-agent, timestamp) for the document owner’s analytics. Revocation: set is_active=false immediately invalidates the link (even if not expired). Security: tokens must be unguessable (128-bit entropy). Never expose doc_id in the share URL — token alone identifies the document.”
}
}
]
}
Asked at: Atlassian Interview Guide
Asked at: LinkedIn Interview Guide
Asked at: Snap Interview Guide
Asked at: Shopify Interview Guide