How does an autocomplete service handle 10 million queries per second?

At 10M QPS, the system uses three layers of caching. First, the trie is loaded entirely into each autocomplete server's memory at startup from an S3 snapshot u2014 serving lookups in under 1ms with zero network calls. Each stateless server handles 10,000 req/s, so 1,000 servers cover 10M QPS. Second, a Redis cluster caches popular prefix results (TTL 10 minutes) to share results across servers and reduce trie rebuilds during snapshot updates. Third, a CDN edge cache serves the most common 10,000 prefixes (like "a", "th", "is") with 60-second TTL, absorbing a large portion of traffic before it reaches origin servers. Rolling deployments update the trie hourly without downtime.

How do you pre-compute top-K completions for a trie to avoid DFS at query time?

During the trie build phase (run hourly), when inserting each query with its frequency count, you propagate the top-K completions upward to every ancestor node. At each node, maintain a sorted list of the K highest-frequency completions reachable from that node. When inserting a new query, for each node along the prefix path, merge the new query into that node's top-K list (keeping only the K best). After building, every prefix node stores exactly the top-K answers. At serving time, the query reduces to traversing L nodes (prefix length) and returning the stored list u2014 O(L) with no DFS. This trades build-time work for zero serving-time computation.

How do you add real-time trending queries to an autocomplete system?

A Flink streaming job consumes query events from Kafka and detects trending queries by comparing the current hourly count against a 7-day moving average. When a query's count exceeds 10u00d7 its average (indicating a spike u2014 e.g., a news event), it's published to a Redis sorted set with a decaying score. The autocomplete API merges trie results with the trending sorted set: trending queries receive a boost and appear in completions even if they're not in the hourly-rebuilt trie yet. The score decays by 0.9 per hour, automatically fading out trending queries within 24 hours without manual cleanup. This gives near-real-time (< 5 minutes) trending query coverage without waiting for the next trie rebuild.

System Design Interview: Autocomplete and Typeahead Service

⏱ 6 min read

Designing an autocomplete (typeahead) service is one of the most common system design interview questions, frequently asked at Google, Meta, and Twitter. It combines trie data structures, caching, ranking by popularity, and low-latency distributed systems concerns.

Requirements

Functional: Return top-5 completions for a prefix as user types, ranked by query frequency. Support all languages (Unicode). Update suggestions based on recent trending queries.

Non-functional: 100ms). Handle 10M QPS (Google-scale). Suggestions update within ~1 hour of trending events. 99.99% availability.

High-Level Architecture

Client (browser/app)
    │  types "app"
    ▼
CDN / Edge Cache ──→ (cache hit for popular prefixes)
    │ (miss)
    ▼
Autocomplete Service (stateless, horizontally scaled)
    │
    ├── Trie Cache (in-memory, per region)
    │   └── Redis cluster (shared trie, L2 cache)
    │
    └── Query Log Aggregation Pipeline
        Kafka → Flink → Frequency Counter → Trie Builder

Trie Builder (batch, runs every hour):
  → Reads aggregated query frequencies from last 7 days
  → Builds new trie with top-5 completions at each node
  → Snapshots to S3 → deploys to Autocomplete Service nodes

Trie Node Design with Pre-Computed Top-K

// Pre-computing top-5 at each node eliminates DFS at serving time
class TrieNode {
    children: Map
    topK: List  // pre-sorted top-5
}

// Build phase: insert queries, then propagate top-K upward
function build_trie(queries: List):
    root = TrieNode()
    for query, freq in queries:
        node = root
        for ch in query:
            node = node.children.setdefault(ch, TrieNode())
            // Add to node's topK (keep sorted, max 5)
            node.topK = merge_topK(node.topK, {query, freq}, k=5)
    return root

// Serving: O(prefix_length) — just traverse, no DFS needed
function get_completions(prefix: str) -> List[str]:
    node = trie.find(prefix)
    return [item.query for item in node.topK] if node else []

Data Collection: Query Log Pipeline

Query frequency pipeline:
  1. User submits search "apple" → log event to Kafka
     {query: "apple", ts: 1714000000, user_id: hash(user)}

  2. Flink streaming job:
     - Deduplicate per user (same user typing shouldn't inflate count)
     - Aggregate: COUNT(query) per 1-hour tumbling window
     - Output: {query: "apple", count: 145023, window: "2024-04-25T14:00"}

  3. Aggregated counts → ClickHouse or BigQuery for storage
     Query top 1M queries by total count (last 7 days)

  4. Trie builder job (hourly cron):
     - Load top 1M (query, frequency) pairs
     - Build trie with pre-computed top-5 at each node
     - Serialize to protobuf, upload to S3
     - Trigger rolling deployment to autocomplete nodes

Size estimate:
  Top 1M queries × avg 20 chars = 20MB raw
  Trie with pointers: ~100MB (fit in memory on each node)
  With 5 top-K strings per node: ~500MB — still fits in 64GB RAM node

Caching Strategy

L1: Application memory (trie in-process)
  - Complete trie loaded at startup from S3 snapshot
  - Serves all prefix lookups in-process (no network hop)
  - Updated hourly via rolling restart with new snapshot
  - Latency: < 1ms

L2: Redis cluster (L2 for popular prefixes)
  - Cache: prefix → top-5 completions JSON
  - TTL: 10 minutes
  - Useful when multiple autocomplete nodes share results
  - Latency: 2-5ms

L3: CDN edge cache (for very popular prefixes)
  - "a", "ap", "app", "appl", "apple" cached at CDN
  - Cache-Control: max-age=60, stale-while-revalidate=300
  - Latency: < 5ms from edge
  - Only for top ~10K most common prefixes

Personalization Layer

Base completions (global frequency) + personalization boost:

score(query, user) = 0.7 × global_freq(query)
                   + 0.3 × personal_boost(query, user)

personal_boost sources:
  - User's own search history (last 30 days)
  - Location-based (local businesses ranked higher)
  - Language/locale preference

Implementation:
  Global completions → Autocomplete Service (from trie)
  User history     → Redis: key=user:{id}:history, value=sorted set of queries
  Rerank → merge global top-5 with user history top-3
         → deduplicate → return top-5 final

Latency budget for personalization:
  Trie lookup: < 1ms
  Redis user history: < 3ms (parallel fetch)
  Merge + rank: < 1ms
  Total: < 5ms (well within 100ms budget)

Problem: earthquake happens at 2pm, users search "earthquake"
         but trie was built at 1pm → missing from completions

Solution: Trending overlay
  Flink job: detect queries with spike (> 10x hourly average)
  → Publish to "trending" Redis sorted set: ZADD trending score query
  → Autocomplete API merges trending set with trie results
  → Top-3 trending shown even if not in trie

Decay function: trending score decays by 0.9 per hour
  score_t = initial_score × 0.9^(hours_since_peak)
  Automatically fade out after ~24 hours without more traffic

Scaling to 10M QPS

10M QPS × 100ms budget = 1M concurrent requests

Horizontal scaling:
  - Autocomplete service: stateless, scale to 1000 pods
  - Each pod: trie in memory + 10K req/s throughput
  - 1000 pods × 10K = 10M QPS

Trie deployment:
  - New trie snapshot: upload to S3, update config pointer
  - Rolling restart: update 50 pods/min × 1000 pods = 20 min full rollout
  - During rollout: old pods serve old trie (acceptable — hourly updates)

Geographic distribution:
  - Deploy trie snapshots to each region
  - Global queries aggregated centrally, regional queries per-region
  - Users in JP see JP-language completions ranked by JP query frequency

Interview Discussion Points

Why pre-compute top-K at each node? DFS from a prefix node to find all completions is O(total_words_with_prefix) — slow for popular prefixes like “a”. Pre-computing turns serving into O(prefix_length) — just traverse nodes, no DFS.
How to handle Unicode/multi-byte characters? Store each Unicode character as a node key (not byte). Python strings are natively Unicode; in Java use character arrays. Limit trie depth to 50 characters to cap memory.
Trie vs inverted index? Trie is optimal for prefix matching. Elasticsearch inverted index handles full-text search (any word in query) but doesn’t naturally support prefix-only matching as efficiently. For autocomplete specifically, trie wins on latency.
How to filter sensitive/inappropriate completions? Maintain a blocklist (Redis set). At serving time, filter completions against blocklist before returning. Update blocklist in < 1 minute via Redis SADD without redeploying trie.

Snap Interview Guide

Apple Interview Guide

Uber Interview Guide

Twitter/X Interview Guide

LinkedIn Interview Guide

Companies That Ask This

Meta Interview Guide