What Is a Social Graph?
A social graph represents relationships between users: friendships (Facebook), follow relationships (Twitter/Instagram), professional connections (LinkedIn). Core operations: add/remove edge, check if two users are connected, find mutual friends, suggest new connections (friends-of-friends), and traverse degrees of separation. LinkedIn has 1B nodes and 30B edges; Facebook’s social graph has 3B nodes.
Graph Representation
Adjacency list stored in a relational DB or graph DB:
-- Undirected (friendship):
friendships: user_id_1, user_id_2, created_at
-- Directed (follow):
follows: follower_id, followee_id, created_at
Indexes: (user_id_1, user_id_2) unique for O(1) connection check; (user_id_1) for O(degree) neighbor lookup; (user_id_2) for reverse lookup.
Mutual Friends
Mutual friends between user A and user B = intersection of A’s friends and B’s friends. Naive: SELECT friends of A, SELECT friends of B, intersect in application code. Better with SQL: SELECT friend_id FROM friendships WHERE user_id = A INTERSECT SELECT friend_id FROM friendships WHERE user_id = B. For high-degree nodes (celebrities with 10M followers): this query is expensive. Cache friend sets in Redis sorted sets; compute intersection in Redis: SINTERSTORE result friends:A friends:B. Redis SINTER on sets of size M runs in O(M * N) where M is the smaller set — fast for typical friend counts.
Friend-of-Friend Suggestions
Suggest users who are 2 hops away (friends of friends) and not already friends. Algorithm:
- Get user’s friends (1-hop neighbors): set F
- Get all friends of friends (2-hop): union of F[f] for all f in F
- Remove already-known users (user themselves + F)
- Rank by mutual friend count (higher = stronger suggestion)
At scale: 2-hop traversal can explode — a user with 1000 friends, each with 1000 friends = 1M candidates before dedup. Precompute top-K suggestions per user with a weekly batch job (Spark), store results in a suggestions table. Serve from cache.
Graph Database vs. Relational
Relational (PostgreSQL): works well for 1-2 hop queries with good indexing. Struggles with deep traversals (6 degrees of separation). Schema: foreign keys, joins for every hop.
Graph DB (Neo4j, Amazon Neptune): optimized for multi-hop traversal. Stores edges with direct pointers (no join needed). MATCH (a)-[:FRIEND*2]->(b) WHERE a.id=123 traverses 2 hops natively. Better for: recommendation engines, fraud ring detection, org-chart traversal.
LinkedIn uses a custom distributed graph store (Espresso + custom graph layer). Facebook uses TAO (The Associations and Objects) — a distributed key-value store with graph semantics built on MySQL.
High-Degree Node Problem (Celebrities)
A celebrity with 10M followers creates challenges: storing 10M edges takes significant space, reading them for fan-out (timeline) is expensive, and mutual-friend queries become slow. Solutions:
- Edge splitting: for “following” (asymmetric), only store follower→celebrity edges; fan-out is pull-based
- Separate data paths: celebrity posts distributed via CDN/pub-sub rather than iterating all followers
- Approximate mutual friends: HyperLogLog for approximate intersection count when exact is too slow
Interview Tips
- Friendship (undirected): store once as (min(A,B), max(A,B)) to avoid duplicate edges.
- Connection check: O(1) with index on both users; don’t iterate all edges.
- BFS for shortest path (6 degrees): bidirectional BFS from both ends meets in the middle — O(b^(d/2)) instead of O(b^d).
- High-degree nodes: always ask about the distribution — if power-law, need special handling for celebrities.