System Design Interview: Design Dropbox / Google Drive (Cloud File Storage)
Cloud file storage like Dropbox or Google Drive is a popular system design question testing file chunking, sync protocols, conflict resolution, and large-scale object storage. Commonly asked at Dropbox, Google, Microsoft (OneDrive), Box, and Apple (iCloud).
Requirements Clarification
Functional Requirements
- Upload, download, delete files and folders
- Sync files across multiple devices automatically
- Share files/folders with other users (view, edit permissions)
- Version history: restore previous file versions
- Offline access: changes sync when device comes online
- Collaboration: multiple users editing simultaneously (Google Docs scope: simplified)
Non-Functional Requirements
- Scale: 500M users, 50 PB total storage, 10M concurrent connected devices
- File sizes: small (KB) to large (GB)
- Sync latency: changes visible on other devices within 5 seconds
- Bandwidth efficiency: only upload changed portions of files (delta sync)
- High durability: 99.999999999% (eleven nines) — use triple redundancy
High-Level Architecture
Desktop/Mobile Client
↓
Upload Service → Block Storage (S3/GCS)
Metadata Service → PostgreSQL (file tree, versions)
Sync Service → WebSocket connections
Notification Service → Push/WebSocket
↓
CDN (for downloads of popular files)
Core Innovation: Block-Level Deduplication
import hashlib
from typing import Optional
BLOCK_SIZE = 4 * 1024 * 1024 # 4MB blocks
def split_file_into_blocks(file_path: str) -> list[dict]:
"""
Split file into fixed-size blocks.
Each block identified by SHA-256 hash of its content.
Same content = same hash = no upload needed (deduplication).
"""
blocks = []
with open(file_path, 'rb') as f:
while True:
data = f.read(BLOCK_SIZE)
if not data:
break
block_hash = hashlib.sha256(data).hexdigest()
blocks.append({
'hash': block_hash,
'size': len(data),
'data': data, # Only included before upload check
})
return blocks
class BlockStore:
"""
Content-addressed block storage.
Block is stored by its hash — if hash exists, block already stored.
This achieves global deduplication across all users.
"""
def __init__(self, s3_client, bucket: str):
self.s3 = s3_client
self.bucket = bucket
def upload_block(self, block_hash: str, data: bytes) -> bool:
"""Upload block if it doesn't exist (deduplication)"""
if self.block_exists(block_hash):
return False # Already stored, skip upload
self.s3.put_object(
Bucket=self.bucket,
Key=f"blocks/{block_hash[:2]}/{block_hash}", # Prefix sharding
Body=data,
ServerSideEncryption='AES256',
)
return True
def block_exists(self, block_hash: str) -> bool:
try:
self.s3.head_object(
Bucket=self.bucket,
Key=f"blocks/{block_hash[:2]}/{block_hash}"
)
return True
except self.s3.exceptions.ClientError:
return False
def download_block(self, block_hash: str) -> bytes:
response = self.s3.get_object(
Bucket=self.bucket,
Key=f"blocks/{block_hash[:2]}/{block_hash}"
)
return response['Body'].read()
File Metadata Model
CREATE TABLE files (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
owner_id UUID NOT NULL,
path TEXT NOT NULL, -- /Documents/report.pdf
name VARCHAR(255) NOT NULL,
size_bytes BIGINT NOT NULL DEFAULT 0,
mime_type VARCHAR(100),
is_folder BOOLEAN NOT NULL DEFAULT FALSE,
parent_id UUID REFERENCES files(id),
created_at TIMESTAMPTZ DEFAULT NOW(),
modified_at TIMESTAMPTZ DEFAULT NOW(),
is_deleted BOOLEAN DEFAULT FALSE, -- soft delete
UNIQUE(owner_id, path)
);
CREATE TABLE file_versions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
file_id UUID NOT NULL REFERENCES files(id),
version_number INT NOT NULL,
size_bytes BIGINT NOT NULL,
block_hashes TEXT[] NOT NULL, -- ordered list of block hashes
created_at TIMESTAMPTZ DEFAULT NOW(),
created_by UUID NOT NULL, -- which device/user created this version
UNIQUE(file_id, version_number)
);
CREATE TABLE file_shares (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
file_id UUID NOT NULL REFERENCES files(id),
shared_with UUID NOT NULL, -- user_id
permission VARCHAR(10) NOT NULL, -- READ, WRITE, ADMIN
created_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ
);
-- Index for efficient sync queries
CREATE INDEX idx_files_owner_modified ON files(owner_id, modified_at DESC);
CREATE INDEX idx_versions_file ON file_versions(file_id, version_number DESC);
Upload Flow: Delta Sync
class DropboxClient:
"""
Client-side logic for uploading and syncing files.
Key optimization: only upload blocks that the server doesn't have.
"""
def __init__(self, server_api, block_store):
self.api = server_api
self.block_store = block_store
self.local_state = LocalSyncState() # tracks last-known server state
def upload_file(self, local_path: str, remote_path: str):
"""Upload file with deduplication — only send new/changed blocks"""
# Step 1: Split into blocks
blocks = split_file_into_blocks(local_path)
block_hashes = [b['hash'] for b in blocks]
# Step 2: Ask server which blocks it already has
missing_hashes = self.api.check_blocks(block_hashes)
# Server returns list of hashes NOT in its storage
# Step 3: Upload only missing blocks
for block in blocks:
if block['hash'] in missing_hashes:
self.block_store.upload_block(block['hash'], block['data'])
# Step 4: Commit file metadata (block list, version)
version = self.api.commit_file(
path=remote_path,
block_hashes=block_hashes,
size=sum(b['size'] for b in blocks),
)
print(f"Uploaded {remote_path} as version {version}")
def check_blocks_client_side(self, blocks: list[dict]) -> dict:
"""
For large files, pre-check hashes on client before uploading.
Returns {hash: exists} for all blocks.
"""
hashes = [b['hash'] for b in blocks]
return self.api.batch_check_blocks(hashes) # server returns dict
class SyncService:
"""
Server-side sync orchestration.
Notifies connected devices when files change.
Uses long-polling or WebSocket connections.
"""
def __init__(self, db, notification_service):
self.db = db
self.notifier = notification_service
self.connected_devices: dict[str, str] = {} # device_id -> websocket
def commit_file_upload(self, user_id: str, path: str,
block_hashes: list[str], size: int) -> int:
"""
Commit file upload:
1. Create/update file record
2. Create new version
3. Notify user's other devices
"""
with self.db.transaction():
# Upsert file record
file_id = self.db.upsert_file(user_id, path, size)
# Create new version
current_version = self.db.get_latest_version(file_id)
new_version = (current_version or 0) + 1
self.db.create_version(file_id, new_version, block_hashes, size, user_id)
# Notify other devices (async — outside transaction)
self.notifier.notify_user_devices(user_id, {
'type': 'FILE_CHANGED',
'path': path,
'version': new_version,
'size': size,
}, exclude_device=None) # Notify all devices
return new_version
def get_changes_since(self, user_id: str, cursor: str) -> list[dict]:
"""
Pull-based sync: client sends its cursor (last sync timestamp or version).
Server returns all changes since cursor.
Used for: reconnecting after offline, initial sync on new device.
"""
changes = self.db.fetch("""
SELECT f.path, fv.version_number, fv.block_hashes, fv.size_bytes,
fv.created_at, f.is_deleted
FROM file_versions fv
JOIN files f ON f.id = fv.file_id
WHERE f.owner_id = $1
AND fv.created_at > $2
ORDER BY fv.created_at ASC
LIMIT 1000
""", user_id, cursor)
new_cursor = changes[-1]['created_at'].isoformat() if changes else cursor
return {'changes': changes, 'cursor': new_cursor}
Conflict Resolution
def resolve_conflict(client_version: dict, server_version: dict) -> str:
"""
When two devices edit the same file offline simultaneously:
Dropbox strategy: last-writer-wins + create conflict copy.
More sophisticated (Google Drive): Operational Transform or CRDTs
for real-time collaborative editing.
"""
# Simple strategy: keep both versions
if client_version['modified_at'] > server_version['modified_at']:
# Client has newer changes — upload as new version
# Create conflict copy of server version
conflict_name = f"{server_version['name']} (conflict from Device B)"
create_conflict_copy(server_version, conflict_name)
return 'CLIENT_WINS'
else:
# Server is newer — download server version
# Create conflict copy of client changes
conflict_name = f"{client_version['name']} (conflict copy)"
return 'SERVER_WINS'
Key Design Decisions
- Block size 4-8MB: Larger blocks = fewer requests but more wasted upload when file changes. Dropbox uses variable-size blocks (Rabin fingerprinting for content-defined chunking) to handle insertions/deletions better than fixed-size blocks.
- Content-addressed storage: Block hash = block identity. Same content = same storage. Global deduplication: if 1M users upload the same popular video, only 1 copy stored. Estimate: 30-50% storage savings from deduplication.
- Notification via WebSocket for push, long-poll as fallback: When file changes, server pushes to connected devices immediately. For mobile (battery/network constraints): use push notifications via APNS/FCM.
- Version limits: Dropbox keeps 30 days of version history (Plus: 180 days). Implement as TTL-based cleanup job that deletes old file_versions records and their blocks (if no other version references them).
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does Dropbox reduce upload bandwidth with block-level deduplication?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Dropbox splits files into 4-8MB blocks and identifies each block by its SHA-256 content hash. When uploading a file: (1) Client computes hashes for all blocks, (2) Sends hash list to server (tiny request), (3) Server checks which hashes already exist in block storage, (4) Client uploads ONLY missing blocks. If you upload a 1GB video that differs by only 100KB from a previous version, only the changed blocks are uploaded — potentially just 4-8MB instead of 1GB. For large unchanged files (common on re-upload): near-zero upload bandwidth. Dropbox uses content-defined chunking (Rabin fingerprinting) to handle insertions/deletions that shift block boundaries.”}},{“@type”:”Question”,”name”:”How does file sync work across multiple devices in Dropbox?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Sync protocol: Push-based for real-time, pull-based for reconnect. Real-time: Server maintains persistent WebSocket connections to all online devices. When file changes on Device A: Device A uploads changes → Server commits new version → Server pushes FILE_CHANGED notification to all other connected devices of that user → Devices download changed blocks. Reconnect/offline: Device sends its last sync cursor (timestamp or version ID) to server. Server returns all changes since that cursor (paginated). Device applies changes and updates cursor. Each device maintains a local database mapping path → last-known version, enabling efficient delta computation.”}},{“@type”:”Question”,”name”:”How does Dropbox handle conflict resolution when two devices edit the same file?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Dropbox uses last-writer-wins with conflict copy creation: If Device A and Device B both edit file.txt while offline and both come online, the first to sync wins. The second device’s changes are saved as ‘file (conflicted copy from Device B 2024-01-15).txt’. Both versions are preserved — no data is lost. This is simple but requires manual resolution by the user. Google Docs takes a different approach: Operational Transform (OT) or Conflict-free Replicated Data Types (CRDTs) allow true concurrent editing where changes from multiple users are merged automatically. For file-level systems like Dropbox, the conflict copy approach is standard practice.”}},{“@type”:”Question”,”name”:”How do you scale cloud storage to petabytes across millions of users?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Architecture: (1) Block storage in S3/GCS — object storage handles petabyte scale natively with 11-nines durability via erasure coding. (2) Metadata database — PostgreSQL/MySQL for file trees and versions, sharded by user_id. (3) CDN for downloads — popular files (shared publicly) served from edge. (4) Deduplication — content-addressed storage means 30-50% storage savings; same popular files shared by many users stored once. (5) Tiered storage — recently accessed files on fast SSD-backed S3; cold files (not accessed for 90 days) moved to S3 Glacier (10x cheaper). (6) Async block upload — client uploads to S3 directly (presigned URLs), bypassing application servers.”}},{“@type”:”Question”,”name”:”What is the difference between Dropbox and Google Drive architecture?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Dropbox: desktop client does heavy lifting (block splitting, hashing, delta computation). Server is relatively simple. Uses native OS file system integration (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows). Sync is file-level: even small changes upload at least one full block. Google Drive: server-centric — Drive API handles version management. Supports real-time collaborative editing (Google Docs/Sheets) via Operational Transform. Has stronger Google Cloud infrastructure integration. Apple iCloud: uses CloudKit framework, tight iOS/macOS integration, background sync via silent push notifications. All three use content-addressed block storage at the backend but differ significantly in client architecture and collaboration features.”}}]}