Core Requirements
A file sharing platform allows users to upload, organize, and share files and folders. Key features: file upload/download, folder hierarchy, sharing with fine-grained permissions, real-time sync across devices, and version history. Scale: Google Drive stores 15 billion files. Dropbox syncs 1+ billion files per day. Challenges: efficient storage (deduplication), sync (detecting changes across devices), and real-time collaboration. This is a common system design interview question at Dropbox, Google, and Microsoft.
Storage Architecture
Files are stored in object storage (S3, GCS) — not in a relational database (binary blobs don’t belong in SQL). Schema: File: file_id, owner_id, name, mime_type, size_bytes, storage_key (S3 object key), content_hash (SHA256 of file content), created_at, modified_at, is_trashed. FileVersion: version_id, file_id, storage_key, content_hash, size_bytes, modified_at, modified_by. FolderHierarchy: node_id, parent_id, name, type (FILE, FOLDER), owner_id, path (materialized path for fast tree queries). Content-addressed storage: the storage_key is derived from the content_hash (e.g., hash[:2]/hash[2:4]/hash). Deduplication: if two users upload the same file (same SHA256), only one copy is stored in S3. The metadata (File table) points to the same storage_key. This is “client-side deduplication” when the client sends the hash before uploading, or “server-side deduplication” when the server checks after upload. Storage savings: for office documents and photos, deduplication typically achieves 30-60% storage reduction.
Upload Flow with Chunked Transfer
For large files (> 5MB): split into 5MB chunks. Upload each chunk independently (resumable). After all chunks are uploaded: server assembles them (S3 multipart upload). Resumable upload: the client computes chunk hashes before uploading. Client sends the list of chunk hashes to the server. Server returns which chunks are already present (dedup check). Client uploads only the missing chunks. This is Dropbox’s “block-based sync” — for a 100MB file that changed only 1MB, only the modified chunks need to be uploaded. Direct-to-S3 upload: client gets a pre-signed S3 URL from the server, uploads directly to S3 without routing through your servers (avoids server bandwidth and processing costs). Server receives completion notification via S3 event notification or direct callback from the client. Integrity check: server verifies the final SHA256 of the assembled file matches the client-provided hash.
Sync Protocol
Sync detects changes on a device and propagates them to other devices. File system watcher: the desktop client watches for file system events (create, modify, delete, rename) using OS APIs (inotify on Linux, FSEvents on macOS, FileSystemWatcher on Windows). Change detection: on each event, compute the file’s hash. Compare to the last known server hash. If different: upload the changed chunks. Server notification to other devices: the server publishes the change event to a long-poll endpoint or WebSocket channel keyed to the user_id. All connected clients receive the notification and download the updated file/chunks. Conflict resolution: two devices modify the same file while offline. On reconnect, both submit changes. The server timestamps changes. The later change wins (last-writer-wins). The earlier version is saved as a conflict copy (“file (John’s conflicted copy 2024-01-15).docx”) and both are shown to the user. Sync ordering: process changes in the order they happened (tracked by a monotonic version counter per file).
Permissions and Sharing
Permission model: OWNER (full control, can delete), EDITOR (read + write), VIEWER (read only), COMMENTER (read + comment). Share: user shares a file or folder with another user (or via a public link). Schema: Permission: permission_id, node_id, grantee_id (user or group), access_level, granted_by, granted_at. Public link: Link: link_id, node_id, token (random 32-byte URL-safe string), access_level, expires_at, view_count. Inherited permissions: when a folder is shared, all descendants inherit the folder’s permissions. Efficient inheritance check: for file F in folder path /A/B/C/F: check permissions on F, then C, B, A in order (first match wins). Materialized path: nodes store their full path (e.g., “/root/A/B/C”) enabling O(1) path queries. Permission cache: cache user→file permission in Redis for 5 minutes (most files are accessed repeatedly). Invalidate on permission changes. For the public link: validate token in Redis for fast, database-free lookup. Include rate limiting (100 req/min per link) in the link validation middleware.
Asked at: Databricks Interview Guide
Asked at: Netflix Interview Guide
Asked at: Apple Interview Guide
Asked at: Airbnb Interview Guide