The Problem: Counting at Scale
Incrementing a counter sounds trivial — until you need to count 10 million page views per second across 100 servers without a central bottleneck. Naive approach: every event writes UPDATE counters SET count = count + 1 WHERE id = X. At high throughput, database row contention becomes the bottleneck. At Twitter scale (500M tweets/day = 5700/sec), a single counter update per tweet-like would saturate a single database row.
Approach 1: Redis INCR
Redis INCR is an atomic O(1) operation. For read-heavy metrics (like view counts), Redis handles millions of increments per second. Persist asynchronously to the database every N seconds or N events.
import redis
r = redis.Redis()
def increment_view(post_id: str, user_id: str):
# Deduplicate: only count unique views
already_viewed = r.sadd(f"views:{post_id}:users", user_id)
if already_viewed:
r.incr(f"views:{post_id}:count")
def get_view_count(post_id: str) -> int:
return int(r.get(f"views:{post_id}:count") or 0)
# Periodic flush to DB (every 60 seconds)
def flush_counters():
for key in r.scan_iter("views:*:count"):
post_id = key.decode().split(':')[1]
count = int(r.getset(key, 0) or 0)
if count > 0:
db.execute("UPDATE posts SET view_count = view_count + %s WHERE id = %s",
[count, post_id])
Approach 2: Sharded Counters
For ultra-high-traffic counters (YouTube view count, Twitter like count), shard the counter into N shards. Each increment goes to a random shard. Read requires summing all shards.
import random
NUM_SHARDS = 16
def increment(counter_name: str):
shard = random.randint(0, NUM_SHARDS - 1)
r.incr(f"counter:{counter_name}:shard:{shard}")
def get_count(counter_name: str) -> int:
keys = [f"counter:{counter_name}:shard:{i}" for i in range(NUM_SHARDS)]
values = r.mget(keys)
return sum(int(v or 0) for v in values)
Trade-off: reads are O(N_SHARDS). With 16 shards, writes distribute 16x more, eliminating hot key contention. Reads require 16 operations but can be batched with MGET (one network round-trip).
Approach 3: Lambda Architecture
For analytics that need both real-time and historical accuracy:
- Speed layer: Kafka + Flink/Spark Streaming. Process events in near-real-time windows (1-minute, 5-minute). Store in Redis. Low latency, approximate accuracy.
- Batch layer: Kafka + Spark batch jobs process complete historical data daily. Store in data warehouse (BigQuery, Snowflake, Redshift). High accuracy, high latency.
- Serving layer: merge batch + speed layer results. Batch provides accurate historical; speed provides recent events not yet in batch.
Approach 4: Time-Series Databases
For metrics with a time dimension (requests/minute, error rate over time), use a time-series database:
- InfluxDB: designed for time-series, supports downsampling (retain 1-minute resolution for 7 days, 1-hour resolution for 1 year), native aggregations.
- Prometheus: pull-based metrics collection. Scraped from services every 15 seconds. PromQL for querying. Grafana for visualization.
- TimescaleDB: PostgreSQL extension for time-series. Full SQL support, automatic partitioning by time.
Real-Time Leaderboards
# Redis Sorted Set: O(log n) insert, O(log n) rank query
def add_score(leaderboard: str, user_id: str, score: float):
r.zadd(leaderboard, {user_id: score})
def increment_score(leaderboard: str, user_id: str, delta: float):
r.zincrby(leaderboard, delta, user_id)
def get_top_k(leaderboard: str, k: int = 10):
# ZREVRANGEBYSCORE: highest scores first, with scores
return r.zrevrange(leaderboard, 0, k - 1, withscores=True)
def get_rank(leaderboard: str, user_id: str) -> int:
rank = r.zrevrank(leaderboard, user_id) # 0-indexed, None if not found
return (rank + 1) if rank is not None else None
Sliding Window Rate Counter
def is_rate_limited(user_id: str, window_seconds: int = 60, max_requests: int = 100) -> bool:
now = time.time()
key = f"rate:{user_id}"
# Remove events older than the window
r.zremrangebyscore(key, '-inf', now - window_seconds)
# Count events in window
count = r.zcard(key)
if count >= max_requests:
return True # rate limited
# Add current event
r.zadd(key, {str(now): now})
r.expire(key, window_seconds)
return False
Exactly-Once Event Counting
At-least-once delivery (Kafka default) means events may be counted twice on reprocessing. For exact counts:
- Use an event ID (UUID) per event. Before counting, check if event_id is already in a “processed” set (Bloom filter for memory efficiency, Redis SET for accuracy). Only count if not seen before.
- For session-based deduplication (count unique views, not total views), use a per-user per-item set: SADD
views:{post_id}:unique{user_id} returns 0 if already in set. - HyperLogLog: Redis PFADD/PFCOUNT for approximate unique counts with 0.81% error at O(1) memory per counter — perfect for “unique visitors” type metrics.
Interview Questions
Q: How would you design YouTube’s view count system?
Views are write-heavy (each play increments the counter). Deduplicate within a session window (same user within 24 hours counts once). Architecture: (1) Player sends view event to a Kafka topic. (2) A Flink job deduplicates by {video_id, user_id, day} and increments a sharded Redis counter (16 shards per video). (3) Every 5 minutes, a flush job reads Redis shards, sums them, and applies the delta to the database with an atomic ADD. (4) View counts are read from Redis for real-time display; the database is the authoritative store for historical analytics.
Asked at: Twitter/X Interview Guide
Asked at: Netflix Interview Guide
Asked at: LinkedIn Interview Guide
Asked at: Databricks Interview Guide
Asked at: Snap Interview Guide