The String Matching Problem
Given a text T of length n and a pattern P of length m, find all occurrences of P in T. Naive approach: for each position i in T, check if T[i:i+m] == P. This is O(n*m). For competitive programming and interview problems involving substring search, a rolling hash (Rabin-Karp) reduces this to O(n) average case. Rolling hash also appears in problems like: longest duplicate substring (LC 1044), distinct substrings, and string matching with wildcards.
Rolling Hash (Polynomial Hash)
Hash a string of length m using a polynomial: hash(s) = s[0]*p^(m-1) + s[1]*p^(m-2) + … + s[m-1]*p^0, where p is a prime base (31 for lowercase letters, 131 for all ASCII) and values are taken mod a large prime M (10^9+7 or 10^9+9).
Rolling: when the window slides right (remove leftmost character, add rightmost), update the hash in O(1): new_hash = (old_hash – s[left]*p^(m-1)) * p + s[left+m]. Precompute p^(m-1) mod M. The “remove and shift” formula avoids recomputing from scratch. Total time for sliding over all positions: O(n).
def search(text, pattern):
n, m = len(text), len(pattern)
if m > n: return []
MOD, BASE = 10**9 + 7, 31
def char_val(c): return ord(c) - ord('a') + 1
# compute pattern hash and initial window hash
ph, th, power = 0, 0, 1
for i in range(m):
ph = (ph * BASE + char_val(pattern[i])) % MOD
th = (th * BASE + char_val(text[i])) % MOD
if i > 0: power = power * BASE % MOD
results = []
for i in range(n - m + 1):
if th == ph and text[i:i+m] == pattern: # confirm match
results.append(i)
if i < n - m:
th = (th - char_val(text[i]) * power) % MOD
th = (th * BASE + char_val(text[i + m])) % MOD
th = (th + MOD) % MOD # keep positive
return results
Hash collision: two different strings can have the same hash (false positive). Always verify with direct string comparison on hash match. Probability of false positive with one hash mod 10^9+7 is about 1/10^9. Use double hashing (two different MOD values) for near-zero collision probability.
Rabin-Karp for Multiple Pattern Search
Search for K patterns simultaneously: compute the hash of each pattern, store in a hash set. For each window in the text: compute the rolling hash; if it matches any pattern hash, verify with string comparison. Total time: O((n + K*m) average). Used for plagiarism detection: hash every K-gram (substring of length K) from a document and compare against a database of known K-grams.
Binary Search + Rolling Hash Pattern
Many problems combine binary search on the answer with rolling hash for substring uniqueness checks. Longest duplicate substring (LC 1044): binary search on length L. For a given L, check if any substring of length L appears more than once using rolling hash to compute all L-length substrings’ hashes and check for duplicates with a hash set. O(n log n) total. Longest repeated substring (various): same approach. Distinct substrings count: for each length L from 1 to n, count unique hashes. Sum gives total distinct substrings (or compute analytically with suffix arrays for O(n log n)).
Interview Applications
LC 1044 (Longest Duplicate Substring): binary search + rolling hash. LC 187 (Repeated DNA Sequences): sliding window of length 10, use a hash set of hashes (or encode 4 bases as 2 bits each for exact O(1) comparison). LC 718 (Maximum Length of Repeated Subarray): DP or binary search + rolling hash. LC 214 (Shortest Palindrome): rolling hash to find longest palindromic prefix. String contains all permutations of pattern (LC 567): sliding window + frequency array (not rolling hash, but related). Recognize rolling hash when the problem involves finding repeated or matching substrings of a fixed length across a long text.
Asked at: Meta Interview Guide
Asked at: Apple Interview Guide
Asked at: Coinbase Interview Guide
Asked at: Databricks Interview Guide