Core Entities
User: user_id, username, display_name, bio, avatar_url, follower_count, following_count, created_at. Post: post_id, author_id, content, media_urls[], like_count, comment_count, share_count, visibility (PUBLIC, FOLLOWERS, PRIVATE), created_at. Follow: follower_id, followee_id, created_at. Like: user_id, post_id, created_at. Comment: comment_id, post_id, author_id, content, parent_comment_id (for replies), like_count, created_at. FeedItem: user_id, post_id, score, created_at (cached news feed entry).
Post Creation and Storage
Post creation flow: (1) User submits text + media. (2) Media uploaded directly to object storage (S3) via presigned URL. (3) Post record created in PostgreSQL: post_id, author_id, content, media_urls[]. (4) Post published to Kafka topic posts. (5) Feed fanout service consumes and distributes to follower feeds. Post storage at scale: shard posts table by author_id (most queries are author-scoped: “show me my posts”). Index on (author_id, created_at DESC) for profile pages. For global post IDs: use a distributed ID generator (Snowflake-style: timestamp + machine_id + sequence) to maintain rough time ordering. Media CDN: serve images and videos via CDN (CloudFront, Cloudflare) — never serve directly from S3 in production.
Feed Generation: Push vs Pull
Push (fanout on write): when a post is created, immediately write it to every follower feed. Feed reads are O(1). Downside: for users with millions of followers (celebrities), a single post triggers millions of feed writes. Use push for regular users (followers < 10K). Pull (fanout on read): when a user opens their feed, fetch the latest posts from all followed accounts and merge. No precomputation. Downside: expensive for users following many accounts (merge 1000 latest feeds). Use pull for celebrities. Hybrid: use push for regular followees, skip the fanout for celebrity followees. On feed read: combine precomputed feed (from push) with real-time fetched posts from celebrities. Twitter and Instagram use this hybrid approach.
Feed Ranking
Chronological feed: simplest, show posts in reverse time order. Algorithmic feed: rank by engagement signals. Features: post_age (recency), user_engagement_history (like rate with this author), post_engagement_rate (likes + comments / impressions), media_type (video often ranked higher), relationship_strength (how often you interact). Ranking model: gradient boosted trees or neural network trained on click/like/share signals. Score each candidate post: score = model.predict(features). Sort by score, return top K. Update frequency: re-score the feed every time the user opens the app (or every 5 minutes for active users). Cache the scored feed in Redis per user with a short TTL (5 minutes).
Like and Comment Systems
Likes: store in a likes table (user_id, post_id) with unique constraint. Like count: cached on the post row (like_count column). On like: INSERT INTO likes + UPDATE posts SET like_count=like_count+1. On unlike: DELETE FROM likes + UPDATE posts SET like_count=like_count-1. At extreme scale: use Redis INCR for the count, sync to DB periodically. Has-liked check: Redis SET per post with user_ids (SISMEMBER for O(1) check). Comments: threaded using parent_comment_id (self-referential FK). Fetch top-level comments with count of replies; expand on click. Sort comments by: newest, oldest, or most-liked. Comment count also cached on the post row.
Notification System
Notification types: like on your post, comment on your post, new follower, mention in a comment, re-share. Generation: consume events from Kafka (LikeEvent, CommentEvent, FollowEvent). For each event: create a Notification record (recipient_id, type, actor_id, reference_id, is_read=false, created_at). Delivery: push via FCM/APNs for mobile, WebSocket for web. Aggregation: instead of “Alice liked your post”, “Bob liked your post”, “Carol liked your post” — show “Alice and 2 others liked your post.” Aggregate within a 1-hour window per (recipient, reference_id, type). Rate limiting: cap notification delivery at N per hour per user to avoid spam. Users can configure per-type notification preferences.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the hybrid push-pull news feed work for celebrities vs regular users?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The challenge: a celebrity with 10 million followers posts once. Pure push: 10 million feed write operations immediately. Pure pull: every feed read must query all followed accounts and merge (expensive for users following many accounts). Hybrid solution: push for regular users (followers < threshold, e.g., 10K): fanout writes to all follower feeds in the background via a queue. Pull for celebrities: skip the fanout. On feed read, fetch the last N posts from each celebrity the user follows in real-time and merge with the precomputed feed. Merge at read time: combine the precomputed feed (from push) with real-time celebrity posts, sort by timestamp or score, return the top K. This balances write amplification (no celebrity fanout) with read cost (only a few celebrity pulls, not thousands of regular user pulls)."
}
},
{
"@type": "Question",
"name": "How does a social media platform handle the like count at scale?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Like counts are read frequently (displayed on every post) but must be approximately accurate. Naive approach: COUNT(*) FROM likes WHERE post_id=X — O(n) per read, too slow at scale. Cached count: store like_count as a column on the post row. Increment/decrement atomically on each like/unlike: UPDATE posts SET like_count = like_count + 1 WHERE post_id = X. This is O(1) per read. Problem at extreme scale: millions of likes per second on viral posts create hot rows (database row-level locking contention). Solution: Redis counter with INCR per post. The like action increments the Redis counter atomically. A background job periodically syncs the Redis count to the database. Eventual consistency: the displayed count may lag by a few seconds. Approximate count: for very popular posts, display rounded counts (1.2M likes) — users do not need exact precision."
}
},
{
"@type": "Question",
"name": "How do you implement threaded comments efficiently?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Comments on a post may be nested (replies to replies). Storage: adjacency list model — comments table with parent_comment_id (NULL for top-level comments, foreign key to comment_id for replies). Queries: fetch top-level comments: SELECT WHERE post_id=X AND parent_comment_id IS NULL ORDER BY created_at DESC LIMIT 20. Fetch reply count per top-level comment: nested query or cached reply_count on each comment row. Load replies on demand: SELECT WHERE parent_comment_id=Y. Index on (post_id, parent_comment_id, created_at). For deep nesting (rare): limit nesting depth to 2 levels (Twitter-style) for UI simplicity. Alternative storage for very deep threads: nested sets or closure table — more complex to maintain but efficient for fetching all descendants without recursive queries. For interview purposes, adjacency list is the correct first answer."
}
},
{
"@type": "Question",
"name": "How does news feed ranking work algorithmically?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Feed ranking predicts which posts a user is most likely to engage with. Feature engineering: content features (post_age, media_type, is_video, hashtags), author features (relationship_strength = interaction frequency with this author, is_close_friend), engagement features (like_rate = likes/impressions, comment_rate, share_rate), user features (time_of_day, device_type, historical preferences by category). Training: collect implicit feedback (click, like, comment, share, skip) as positive and negative signals. Label: binary (engaged=1, scrolled_past=0) or multi-class. Model: gradient boosted trees (GBT) for speed, neural networks for higher accuracy with more features. Two-stage: (1) candidate generation — retrieve the most recent N posts from follows + trending + interest signals. (2) Ranking — score each candidate with the model. Return top K."
}
},
{
"@type": "Question",
"name": "How do you implement the follower-followee graph at scale?",
"acceptedAnswer": {
"@type": "Answer",
"text": "The follow graph is sparse and large (billions of edges for a platform like Twitter). Storage: follows table (follower_id, followee_id, created_at) with indices on both columns for bidirectional lookups. At scale: shard by follower_id (most queries are 'who does user X follow'). Cached counts: follower_count and following_count columns on the user row, updated atomically on follow/unfollow. Graph queries: 'followers of user X' — SELECT follower_id WHERE followee_id=X (index on followee_id). 'following of user X' — SELECT followee_id WHERE follower_id=X (index on follower_id). Mutual follows: intersection of both queries. Suggestions: second-degree connections ('people you may know') via graph traversal or collaborative filtering. For celebrity accounts with millions of followers: cache the follower list in Redis as a sorted set (scored by follow timestamp) to avoid repeated database scans."
}
}
]
}
Asked at: Meta Interview Guide
Asked at: Snap Interview Guide
Asked at: Twitter/X Interview Guide
Asked at: LinkedIn Interview Guide