System Design Interview: File Storage and Sync Service (Dropbox)

Designing a file storage and sync service like Dropbox combines chunked file storage, delta sync, conflict resolution, and offline-first architecture. It’s a comprehensive system design problem that tests your understanding of distributed storage and client-server synchronization.

Core Requirements

Functional: Upload/download files of any size. Sync files across multiple devices in real-time. Share files/folders with other users. Version history (at least 30 days). Offline support with sync on reconnect.

Non-functional: 500M users, 1B files. Average file size 1MB (photos, documents). 10M active syncs simultaneously. Upload/download: maximize throughput. Sync latency < 5 seconds after file change. Bandwidth efficiency: only transfer changed bytes (delta sync).

Chunked File Storage

Problem: large files (10GB videos) can't be uploaded atomically
  - Connection drops midway → must restart from beginning
  - Can't detect which part of file changed

Solution: split files into 4MB chunks

File upload flow:
  1. Client splits file into 4MB chunks
  2. Hash each chunk: SHA-256(chunk_data) → chunk_id
  3. Query server: which chunk_ids are already uploaded?
     POST /chunks/check {chunk_ids: ["abc123...", "def456..."]}
     → Server returns: {existing: ["abc123..."], missing: ["def456..."]}
  4. Upload only missing chunks (deduplication!)
     POST /chunks/upload {chunk_id: "def456...", data: binary}
     → Stored in S3 with key: chunks/{chunk_id}
  5. Create/update file metadata record:
     POST /files {name: "photo.jpg", chunks: ["abc123", "def456", "ghi789"], size: 12MB}

Deduplication benefits:
  If two users upload the same file → only one copy stored in S3
  If user edits a 1GB file (changes 1 paragraph):
    Only 1-2 changed chunks uploaded (4-8MB) vs re-uploading 1GB

Database Design

files table:
  id            UUID PRIMARY KEY
  owner_id      UUID
  name          TEXT
  parent_folder_id UUID NULL
  size_bytes    BIGINT
  mime_type     TEXT
  chunk_ids     TEXT[]  -- ordered list of SHA-256 chunk hashes
  version       INT DEFAULT 1
  is_deleted    BOOLEAN DEFAULT FALSE
  created_at    TIMESTAMPTZ DEFAULT NOW()
  modified_at   TIMESTAMPTZ DEFAULT NOW()

chunks table:
  id            CHAR(64) PRIMARY KEY  -- SHA-256 hash
  size_bytes    INT
  storage_path  TEXT  -- S3 key: chunks/{id}
  ref_count     INT DEFAULT 0  -- how many files reference this chunk

file_versions table (version history):
  id            UUID PRIMARY KEY
  file_id       UUID REFERENCES files(id)
  version_num   INT
  chunk_ids     TEXT[]
  modified_at   TIMESTAMPTZ
  modified_by   UUID

shares table:
  id            UUID PRIMARY KEY
  resource_id   UUID  -- file or folder
  resource_type TEXT  -- file / folder
  owner_id      UUID
  recipient_id  UUID  -- NULL for link sharing
  share_token   TEXT  -- for public link
  permission    TEXT  -- view / edit
  expires_at    TIMESTAMPTZ NULL

Sync Protocol

Multi-device sync challenge:
  Device A edits file.txt → uploads changes → server has version 5
  Device B (offline) edits same file.txt → comes online with local version 4

Sync engine: vector clocks + logical timestamps

Client state:
  Local DB (SQLite): {file_id, name, chunk_ids, server_version, local_version}
  Sync cursor: last_seen_server_event_id

Sync flow (client connects or file changes):
  1. Client → GET /events?since={cursor}
     Server returns: events since cursor
     [{event_id: 100, type: "file_updated", file_id: X, version: 5}, ...]

  2. For each event:
     If local file not modified since last sync:
       Apply server change → update local file
     If local file modified (conflict!):
       → Fork: keep both versions
         Local "file.txt" → "file (conflicted copy 2024-04-16).txt"
         Download server version → save as "file.txt"
         Both versions visible to user

  3. Upload local changes:
     For each locally modified file:
       Upload changed chunks → update server file record
       Server assigns new event_id → client updates cursor

Real-Time Notification (Change Events)

When file changes on any device, all other devices must be notified:

Server-Sent Events (SSE) connection per device:
  Client → GET /events/stream (long-lived HTTP connection)
  Server pushes: "data: {event_id: 101, type: 'file_updated', file_id: X}nn"

Event fan-out:
  File update → DB write → Publish to Redis Pub/Sub channel: user:{user_id}:events
  SSE broadcaster (subscribed to Redis) → push to all connected devices for that user

Connection management:
  Multiple devices per user → SSE broadcaster tracks device connections
  Device disconnect → SSE connection closes → no cleanup needed
  Device reconnect → GET /events?since={last_cursor} to catch up on missed events

Scale:
  10M active syncs × 1 SSE connection = 10M persistent connections
  Dedicated SSE server cluster (Go/nginx, highly concurrent)
  Redis Pub/Sub: fanout per user_id channel

Bandwidth Optimization

Delta encoding for text files:
  Instead of uploading entire 4MB chunk when 1 line changes:
  Compute diff (rsync algorithm): binary diff of old chunk vs new chunk
  Upload only the diff (often < 1KB for small edits)
  Server reconstructs new chunk: old_chunk + delta → new chunk

Rsync algorithm (simplified):
  1. Server sends checksums for each 1KB block of the old version
  2. Client slides a window over the new version, computing checksums
  3. For each matching block: send reference to server block (tiny)
  4. For each non-matching section: send raw bytes
  Net: often 99%+ reduction in bytes uploaded for document edits

Compression: gzip/brotli all chunks before upload (text: 70% reduction)

Bandwidth throttling:
  Background sync: limit to 30% of available bandwidth
  Detect user activity: slow sync when user is actively using network

Version History and Restore

Dropbox keeps 30-day version history (180 days for Pro):

Every file update:
  INSERT INTO file_versions (file_id, version_num, chunk_ids, modified_at)
  → Old chunk data stays in S3 (referenced by version records)

Restore to previous version:
  User requests version 3 of file.txt:
    SELECT chunk_ids FROM file_versions WHERE file_id=? AND version_num=3
    → Reconstruct file from those chunks (already in S3)
    → Client downloads chunks, reassembles file

Chunk garbage collection:
  Chunks not referenced by any current file OR version < 30 days → delete from S3
  Background job: scan chunks with ref_count=0 AND no active version references
  Soft delete: mark for deletion, actually delete after 7-day grace period

Interview Discussion Points

  • Why SHA-256 chunk IDs instead of random IDs? Content-addressable storage: if two files share the same chunk (identical file sections), they point to the same chunk in S3 — stored once. The hash IS the identity of the content. Random IDs would store duplicates. Collision probability for SHA-256: astronomically low (2^-128 per pair).
  • How do you handle very large files (100GB video)? S3 multipart upload: chunk the file, upload parts in parallel (up to 10,000 parts × 5MB = 50GB max; for larger files, increase part size). S3 handles reassembly. Client tracks which parts uploaded — resumes after interruption. Streaming: client doesn’t buffer the whole file; reads and uploads 4MB at a time.
  • Conflict resolution strategy: Most cloud storage (Dropbox, iCloud) uses “last write wins” with a conflict copy — no merge attempt. Google Docs achieves true merge via Operational Transformation (OT) or CRDT because it’s a single document type with defined merge semantics. For arbitrary binary files, OT doesn’t apply.

  • Stripe Interview Guide
  • Apple Interview Guide
  • Atlassian Interview Guide
  • LinkedIn Interview Guide
  • Meta Interview Guide
  • Companies That Ask This

    Scroll to Top