Q: How does a social media platform handle the like count at scale?

Like counts are read frequently (displayed on every post) but must be approximately accurate. Naive approach: COUNT(*) FROM likes WHERE post_id=X -- O(n) per read, too slow at scale. Cached count: store like_count as a column on the post row. Increment/decrement atomically on each like/unlike: UPDATE posts SET like_count = like_count + 1 WHERE post_id = X. This is O(1) per read. Problem at extreme scale: millions of likes per second on viral posts create hot rows (database row-level locking contention). Solution: Redis counter with INCR per post. The like action increments the Redis counter atomically. A background job periodically syncs the Redis count to the database. Eventual consistency: the displayed count may lag by a few seconds. Approximate count: for very popular posts, display rounded counts (1.2M likes) -- users do not need exact precision.

Q: How do you implement threaded comments efficiently?

Comments on a post may be nested (replies to replies). Storage: adjacency list model -- comments table with parent_comment_id (NULL for top-level comments, foreign key to comment_id for replies). Queries: fetch top-level comments: SELECT WHERE post_id=X AND parent_comment_id IS NULL ORDER BY created_at DESC LIMIT 20. Fetch reply count per top-level comment: nested query or cached reply_count on each comment row. Load replies on demand: SELECT WHERE parent_comment_id=Y. Index on (post_id, parent_comment_id, created_at). For deep nesting (rare): limit nesting depth to 2 levels (Twitter-style) for UI simplicity. Alternative storage for very deep threads: nested sets or closure table -- more complex to maintain but efficient for fetching all descendants without recursive queries. For interview purposes, adjacency list is the correct first answer.

Q: How does news feed ranking work algorithmically?

Feed ranking predicts which posts a user is most likely to engage with. Feature engineering: content features (post_age, media_type, is_video, hashtags), author features (relationship_strength = interaction frequency with this author, is_close_friend), engagement features (like_rate = likes/impressions, comment_rate, share_rate), user features (time_of_day, device_type, historical preferences by category). Training: collect implicit feedback (click, like, comment, share, skip) as positive and negative signals. Label: binary (engaged=1, scrolled_past=0) or multi-class. Model: gradient boosted trees (GBT) for speed, neural networks for higher accuracy with more features. Two-stage: (1) candidate generation -- retrieve the most recent N posts from follows + trending + interest signals. (2) Ranking -- score each candidate with the model. Return top K.

Q: How do you implement the follower-followee graph at scale?

The follow graph is sparse and large (billions of edges for a platform like Twitter). Storage: follows table (follower_id, followee_id, created_at) with indices on both columns for bidirectional lookups. At scale: shard by follower_id (most queries are 'who does user X follow'). Cached counts: follower_count and following_count columns on the user row, updated atomically on follow/unfollow. Graph queries: 'followers of user X' -- SELECT follower_id WHERE followee_id=X (index on followee_id). 'following of user X' -- SELECT followee_id WHERE follower_id=X (index on follower_id). Mutual follows: intersection of both queries. Suggestions: second-degree connections ('people you may know') via graph traversal or collaborative filtering. For celebrity accounts with millions of followers: cache the follower list in Redis as a sorted set (scored by follow timestamp) to avoid repeated database scans.

Question 1

How does the hybrid push-pull news feed work for celebrities vs regular users?

Accepted Answer

The challenge: a celebrity with 10 million followers posts once. Pure push: 10 million feed write operations immediately. Pure pull: every feed read must query all followed accounts and merge (expensive for users following many accounts). Hybrid solution: push for regular users (followers < threshold, e.g., 10K): fanout writes to all follower feeds in the background via a queue. Pull for celebrities: skip the fanout. On feed read, fetch the last N posts from each celebrity the user follows in real-time and merge with the precomputed feed. Merge at read time: combine the precomputed feed (from push) with real-time celebrity posts, sort by timestamp or score, return the top K. This balances write amplification (no celebrity fanout) with read cost (only a few celebrity pulls, not thousands of regular user pulls).

Question 2

How does a social media platform handle the like count at scale?

Accepted Answer

Like counts are read frequently (displayed on every post) but must be approximately accurate. Naive approach: COUNT(*) FROM likes WHERE post_id=X -- O(n) per read, too slow at scale. Cached count: store like_count as a column on the post row. Increment/decrement atomically on each like/unlike: UPDATE posts SET like_count = like_count + 1 WHERE post_id = X. This is O(1) per read. Problem at extreme scale: millions of likes per second on viral posts create hot rows (database row-level locking contention). Solution: Redis counter with INCR per post. The like action increments the Redis counter atomically. A background job periodically syncs the Redis count to the database. Eventual consistency: the displayed count may lag by a few seconds. Approximate count: for very popular posts, display rounded counts (1.2M likes) -- users do not need exact precision.

Question 3

How do you implement threaded comments efficiently?

Accepted Answer

Comments on a post may be nested (replies to replies). Storage: adjacency list model -- comments table with parent_comment_id (NULL for top-level comments, foreign key to comment_id for replies). Queries: fetch top-level comments: SELECT WHERE post_id=X AND parent_comment_id IS NULL ORDER BY created_at DESC LIMIT 20. Fetch reply count per top-level comment: nested query or cached reply_count on each comment row. Load replies on demand: SELECT WHERE parent_comment_id=Y. Index on (post_id, parent_comment_id, created_at). For deep nesting (rare): limit nesting depth to 2 levels (Twitter-style) for UI simplicity. Alternative storage for very deep threads: nested sets or closure table -- more complex to maintain but efficient for fetching all descendants without recursive queries. For interview purposes, adjacency list is the correct first answer.

Question 4

How does news feed ranking work algorithmically?

Accepted Answer

Feed ranking predicts which posts a user is most likely to engage with. Feature engineering: content features (post_age, media_type, is_video, hashtags), author features (relationship_strength = interaction frequency with this author, is_close_friend), engagement features (like_rate = likes/impressions, comment_rate, share_rate), user features (time_of_day, device_type, historical preferences by category). Training: collect implicit feedback (click, like, comment, share, skip) as positive and negative signals. Label: binary (engaged=1, scrolled_past=0) or multi-class. Model: gradient boosted trees (GBT) for speed, neural networks for higher accuracy with more features. Two-stage: (1) candidate generation -- retrieve the most recent N posts from follows + trending + interest signals. (2) Ranking -- score each candidate with the model. Return top K.

Question 5

How do you implement the follower-followee graph at scale?

Accepted Answer

The follow graph is sparse and large (billions of edges for a platform like Twitter). Storage: follows table (follower_id, followee_id, created_at) with indices on both columns for bidirectional lookups. At scale: shard by follower_id (most queries are 'who does user X follow'). Cached counts: follower_count and following_count columns on the user row, updated atomically on follow/unfollow. Graph queries: 'followers of user X' -- SELECT follower_id WHERE followee_id=X (index on followee_id). 'following of user X' -- SELECT followee_id WHERE follower_id=X (index on follower_id). Mutual follows: intersection of both queries. Suggestions: second-degree connections ('people you may know') via graph traversal or collaborative filtering. For celebrity accounts with millions of followers: cache the follower list in Redis as a sorted set (scored by follow timestamp) to avoid repeated database scans.

Low-Level Design: Social Media Platform — Posts, Feeds, Follows, and Notifications

Core Entities

Post Creation and Storage

Feed Generation: Push vs Pull

Feed Ranking

Like and Comment Systems

Notification System