String Hashing Interview Patterns: Rabin-Karp, Rolling Hash, and Polynomial Hashing (2025)

Why String Hashing?

Naive string matching (check every position in the text for the pattern) is O(n*m) where n = text length, m = pattern length. String hashing enables O(1) substring comparison by reducing a string to a single integer. Combined with a rolling hash, we can compare any substring of length m in O(1) after O(n) preprocessing, giving O(n+m) total for pattern matching. Polynomial hashing: hash(s) = s[0]*p^(m-1) + s[1]*p^(m-2) + … + s[m-1]*p^0 (mod M). p is a prime base (~31 for lowercase letters), M is a large prime (~10^9+7) to reduce collisions.

Rolling Hash (Rabin-Karp Algorithm)

Compute the hash of the first window of size m. Slide the window by removing the leftmost character and adding the new rightmost character in O(1). When the new hash equals the pattern hash: verify character by character (handles hash collisions). O(n+m) average, O(n*m) worst case (all collisions).

def rabin_karp(text, pattern):
    n, m = len(text), len(pattern)
    if m > n: return []
    BASE, MOD = 31, 10**9 + 7

    def char_val(c): return ord(c) - ord('a') + 1

    # Precompute pattern hash and first window hash
    pat_hash = 0
    win_hash = 0
    power = 1  # BASE^(m-1) mod MOD

    for i in range(m):
        pat_hash = (pat_hash * BASE + char_val(pattern[i])) % MOD
        win_hash = (win_hash * BASE + char_val(text[i])) % MOD
        if i < m - 1:
            power = power * BASE % MOD

    result = []
    for i in range(n - m + 1):
        if win_hash == pat_hash and text[i:i+m] == pattern:
            result.append(i)
        if i < n - m:
            # Roll: remove text[i], add text[i+m]
            win_hash = (win_hash - char_val(text[i]) * power) % MOD
            win_hash = (win_hash * BASE + char_val(text[i+m])) % MOD
            win_hash = (win_hash + MOD) % MOD  # keep positive
    return result

Prefix Hash Array for O(1) Substring Hash

Precompute prefix_hash[i] = hash(s[0..i-1]) and powers[i] = BASE^i mod MOD. Then hash(s[l..r]) = (prefix_hash[r+1] – prefix_hash[l] * powers[r-l+1]) mod MOD. This enables O(1) hash computation for any substring after O(n) preprocessing. Use this for: comparing many pairs of substrings efficiently (LC 1044 Longest Duplicate Substring with binary search + hashing), finding all distinct substrings, or checking if two strings share a common substring of length k.

class StringHasher:
    def __init__(self, s, base=131, mod=10**18+9):
        n = len(s)
        self.mod = mod
        self.base = base
        self.pw = [1] * (n + 1)
        self.h = [0] * (n + 1)
        for i in range(n):
            self.pw[i+1] = self.pw[i] * base % mod
            self.h[i+1] = (self.h[i] * base + ord(s[i])) % mod

    def query(self, l, r):  # hash of s[l..r] inclusive
        return (self.h[r+1] - self.h[l] * self.pw[r-l+1]) % self.mod

Double Hashing to Reduce Collisions

A single hash has a ~1/MOD collision probability per comparison. With n^2 comparisons, expected collisions = n^2/MOD. For n=10^5 and MOD=10^9, expected ~10 false positives per run. Double hashing uses two independent hash functions with different (BASE, MOD) pairs. The combined hash is (hash1, hash2). Collision probability = 1/(MOD1*MOD2) — negligible. Use double hashing when correctness is critical (competitive programming judges use anti-hash tests). Implementation: run two StringHasher instances with different parameters; compare both hashes.

Applications

LC 1044 (Longest Duplicate Substring): binary search on the length L. For each L, use a rolling hash to find all substrings of length L; store hashes in a set; if any hash repeats, L is achievable. O(n log n). LC 1062 (Longest Repeating Substring): same pattern without the binary search guarantee. LC 187 (Repeated DNA Sequences): fixed window (10 chars), rolling hash — find all windows that appear more than once. LC 28 (Find Needle in Haystack): Rabin-Karp is one solution (KMP is another). LC 718 (Maximum Length of Repeated Subarray): 2D extension — rolling hash on rows or binary search + 2D hash.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the rolling hash in Rabin-Karp avoid O(n*m) recomputation?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Naive string matching recomputes the hash of each window of size m from scratch in O(m) time, giving O(n*m) total. Rolling hash reuses the previous window’s hash: new_hash = (old_hash – char_val(leaving_char) * BASE^(m-1)) * BASE + char_val(entering_char). Removing the leftmost character: subtract its contribution (char_val * BASE^(m-1)) then multiply by BASE to shift everything left. Adding the new rightmost character: add char_val. Each slide is O(1). The key insight: polynomial hashing is linear — hash(s[i+1..i+m]) can be derived from hash(s[i..i+m-1]) by one multiply, one add, and one subtract. Precompute BASE^(m-1) mod MOD once before the sliding loop. This transforms O(n*m) into O(n+m).”
}
},
{
“@type”: “Question”,
“name”: “What causes hash collisions and how do you handle them?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A hash collision occurs when two different strings produce the same hash value. With MOD = 10^9+7, the probability of any single collision is ~10^-9. But if you make O(n^2) comparisons (all pairs of substrings), the expected number of collisions is n^2 * 10^-9. For n = 10^5, that’s ~10 false positives. Handling: when hashes match, verify with a direct character-by-character comparison — O(m) in the worst case but rare. Double hashing: use two independent (BASE, MOD) pairs. A false positive requires both hashes to collide simultaneously — probability ~10^-18, negligible. In competitive programming, adversarial test cases are sometimes crafted to break single-hash solutions. Double hashing is the defense. For LeetCode: single hashing with a large MOD (10^18+9) is typically sufficient.”
}
},
{
“@type”: “Question”,
“name”: “How do you use hashing to find the longest duplicate substring?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Binary search on the answer length L. For each candidate L: use a rolling hash to compute the hash of every substring of length L in O(n). Store all hashes in a set. If any hash appears twice (with character verification to rule out collisions): a duplicate of length L exists. Binary search: if L is achievable, try L+1; otherwise try L-1. O(n log n) total. LC 1044: the tricky part is that the answer might be the empty string (length 0). Set the binary search range as [0, n-1]. The character verification on collision: for each repeated hash, find the two matching substrings and compare them directly. Use a dict: hash -> list of start positions. On duplicate hash: compare the actual substrings at those positions. If they match, we found a true duplicate.”
}
},
{
“@type”: “Question”,
“name”: “When should you use hashing vs KMP for pattern matching?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “KMP (Knuth-Morris-Pratt): O(n+m) worst case, no false positives, deterministic. Rabin-Karp hashing: O(n+m) average, O(n*m) worst case (all collisions), with character verification. Use KMP when: a single pattern vs a single text, no risk of false positives needed, deterministic performance required. Use hashing when: multiple patterns (compute hashes of all patterns, check each window against the set — O(n + sum(m_i))), finding any of k patterns simultaneously (KMP requires building an Aho-Corasick automaton), substring uniqueness checks (store hashes in a set, O(1) lookup), or comparing substrings at arbitrary positions (prefix hash array). The prefix hash array (O(1) per query after O(n) preprocessing) has no KMP equivalent for arbitrary substring comparison. Hashing wins for complex multi-query substring problems.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle the modular arithmetic to keep rolling hash values positive?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “After the rolling window update, win_hash can become negative in languages with signed integers: win_hash -= char_val * power can underflow. Fix: add MOD before taking mod — win_hash = (win_hash – char_val * power % MOD + MOD) % MOD. The +MOD step ensures the result is in [0, MOD-1] even if the subtraction gave a negative value. In Python, the % operator always returns a non-negative result for positive MOD, so this is less of an issue, but it’s still good practice. For the multiply-then-mod step: in languages with 64-bit integers (Java long, C++ long long), (a * b) can overflow if a and b are up to 10^9. Use __int128 in C++, or split the multiplication: avoid numbers larger than sqrt(LLONG_MAX) before taking mod. Python handles arbitrary precision natively.”
}
}
]
}

Asked at: Meta Interview Guide

Asked at: Apple Interview Guide

Asked at: Coinbase Interview Guide

Asked at: LinkedIn Interview Guide

Scroll to Top