Why String Hashing?
Naive string matching (check every position in the text for the pattern) is O(n*m) where n = text length, m = pattern length. String hashing enables O(1) substring comparison by reducing a string to a single integer. Combined with a rolling hash, we can compare any substring of length m in O(1) after O(n) preprocessing, giving O(n+m) total for pattern matching. Polynomial hashing: hash(s) = s[0]*p^(m-1) + s[1]*p^(m-2) + … + s[m-1]*p^0 (mod M). p is a prime base (~31 for lowercase letters), M is a large prime (~10^9+7) to reduce collisions.
Rolling Hash (Rabin-Karp Algorithm)
Compute the hash of the first window of size m. Slide the window by removing the leftmost character and adding the new rightmost character in O(1). When the new hash equals the pattern hash: verify character by character (handles hash collisions). O(n+m) average, O(n*m) worst case (all collisions).
def rabin_karp(text, pattern):
n, m = len(text), len(pattern)
if m > n: return []
BASE, MOD = 31, 10**9 + 7
def char_val(c): return ord(c) - ord('a') + 1
# Precompute pattern hash and first window hash
pat_hash = 0
win_hash = 0
power = 1 # BASE^(m-1) mod MOD
for i in range(m):
pat_hash = (pat_hash * BASE + char_val(pattern[i])) % MOD
win_hash = (win_hash * BASE + char_val(text[i])) % MOD
if i < m - 1:
power = power * BASE % MOD
result = []
for i in range(n - m + 1):
if win_hash == pat_hash and text[i:i+m] == pattern:
result.append(i)
if i < n - m:
# Roll: remove text[i], add text[i+m]
win_hash = (win_hash - char_val(text[i]) * power) % MOD
win_hash = (win_hash * BASE + char_val(text[i+m])) % MOD
win_hash = (win_hash + MOD) % MOD # keep positive
return result
Prefix Hash Array for O(1) Substring Hash
Precompute prefix_hash[i] = hash(s[0..i-1]) and powers[i] = BASE^i mod MOD. Then hash(s[l..r]) = (prefix_hash[r+1] – prefix_hash[l] * powers[r-l+1]) mod MOD. This enables O(1) hash computation for any substring after O(n) preprocessing. Use this for: comparing many pairs of substrings efficiently (LC 1044 Longest Duplicate Substring with binary search + hashing), finding all distinct substrings, or checking if two strings share a common substring of length k.
class StringHasher:
def __init__(self, s, base=131, mod=10**18+9):
n = len(s)
self.mod = mod
self.base = base
self.pw = [1] * (n + 1)
self.h = [0] * (n + 1)
for i in range(n):
self.pw[i+1] = self.pw[i] * base % mod
self.h[i+1] = (self.h[i] * base + ord(s[i])) % mod
def query(self, l, r): # hash of s[l..r] inclusive
return (self.h[r+1] - self.h[l] * self.pw[r-l+1]) % self.mod
Double Hashing to Reduce Collisions
A single hash has a ~1/MOD collision probability per comparison. With n^2 comparisons, expected collisions = n^2/MOD. For n=10^5 and MOD=10^9, expected ~10 false positives per run. Double hashing uses two independent hash functions with different (BASE, MOD) pairs. The combined hash is (hash1, hash2). Collision probability = 1/(MOD1*MOD2) — negligible. Use double hashing when correctness is critical (competitive programming judges use anti-hash tests). Implementation: run two StringHasher instances with different parameters; compare both hashes.
Applications
LC 1044 (Longest Duplicate Substring): binary search on the length L. For each L, use a rolling hash to find all substrings of length L; store hashes in a set; if any hash repeats, L is achievable. O(n log n). LC 1062 (Longest Repeating Substring): same pattern without the binary search guarantee. LC 187 (Repeated DNA Sequences): fixed window (10 chars), rolling hash — find all windows that appear more than once. LC 28 (Find Needle in Haystack): Rabin-Karp is one solution (KMP is another). LC 718 (Maximum Length of Repeated Subarray): 2D extension — rolling hash on rows or binary search + 2D hash.
Asked at: Meta Interview Guide
Asked at: Apple Interview Guide
Asked at: Coinbase Interview Guide
Asked at: LinkedIn Interview Guide