The String Matching Problem
Given a text T of length n and a pattern P of length m, find all occurrences of P in T. Naive approach: for each position i in T, check if T[i:i+m] == P. This is O(n*m). For competitive programming and interview problems involving substring search, a rolling hash (Rabin-Karp) reduces this to O(n) average case. Rolling hash also appears in problems like: longest duplicate substring (LC 1044), distinct substrings, and string matching with wildcards.
Rolling Hash (Polynomial Hash)
Hash a string of length m using a polynomial: hash(s) = s[0]*p^(m-1) + s[1]*p^(m-2) + … + s[m-1]*p^0, where p is a prime base (31 for lowercase letters, 131 for all ASCII) and values are taken mod a large prime M (10^9+7 or 10^9+9).
Rolling: when the window slides right (remove leftmost character, add rightmost), update the hash in O(1): new_hash = (old_hash – s[left]*p^(m-1)) * p + s[left+m]. Precompute p^(m-1) mod M. The “remove and shift” formula avoids recomputing from scratch. Total time for sliding over all positions: O(n).
def search(text, pattern):
n, m = len(text), len(pattern)
if m > n: return []
MOD, BASE = 10**9 + 7, 31
def char_val(c): return ord(c) - ord('a') + 1
# compute pattern hash and initial window hash
ph, th, power = 0, 0, 1
for i in range(m):
ph = (ph * BASE + char_val(pattern[i])) % MOD
th = (th * BASE + char_val(text[i])) % MOD
if i > 0: power = power * BASE % MOD
results = []
for i in range(n - m + 1):
if th == ph and text[i:i+m] == pattern: # confirm match
results.append(i)
if i < n - m:
th = (th - char_val(text[i]) * power) % MOD
th = (th * BASE + char_val(text[i + m])) % MOD
th = (th + MOD) % MOD # keep positive
return results
Hash collision: two different strings can have the same hash (false positive). Always verify with direct string comparison on hash match. Probability of false positive with one hash mod 10^9+7 is about 1/10^9. Use double hashing (two different MOD values) for near-zero collision probability.
Rabin-Karp for Multiple Pattern Search
Search for K patterns simultaneously: compute the hash of each pattern, store in a hash set. For each window in the text: compute the rolling hash; if it matches any pattern hash, verify with string comparison. Total time: O((n + K*m) average). Used for plagiarism detection: hash every K-gram (substring of length K) from a document and compare against a database of known K-grams.
Binary Search + Rolling Hash Pattern
Many problems combine binary search on the answer with rolling hash for substring uniqueness checks. Longest duplicate substring (LC 1044): binary search on length L. For a given L, check if any substring of length L appears more than once using rolling hash to compute all L-length substrings’ hashes and check for duplicates with a hash set. O(n log n) total. Longest repeated substring (various): same approach. Distinct substrings count: for each length L from 1 to n, count unique hashes. Sum gives total distinct substrings (or compute analytically with suffix arrays for O(n log n)).
Interview Applications
LC 1044 (Longest Duplicate Substring): binary search + rolling hash. LC 187 (Repeated DNA Sequences): sliding window of length 10, use a hash set of hashes (or encode 4 bases as 2 bits each for exact O(1) comparison). LC 718 (Maximum Length of Repeated Subarray): DP or binary search + rolling hash. LC 214 (Shortest Palindrome): rolling hash to find longest palindromic prefix. String contains all permutations of pattern (LC 567): sliding window + frequency array (not rolling hash, but related). Recognize rolling hash when the problem involves finding repeated or matching substrings of a fixed length across a long text.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What makes rolling hash O(n) instead of O(n*m) for string matching?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Naive string matching compares the pattern against every position in the text, each comparison taking O(m) time = O(n*m) total. Rolling hash avoids recomputing the hash from scratch at each position. Polynomial hash: h = s[0]*p^(m-1) + s[1]*p^(m-2) + … + s[m-1]. When the window slides right by one: remove the contribution of s[0] (subtract s[0]*p^(m-1)) and add s[m] (add s[m]*p^0). This update is O(1). Across all n-m+1 positions: O(n) hash computations. When a hash matches: do a direct string comparison (O(m)) to confirm. False positives are rare (probability 1/MOD per position). Expected total comparisons: O(n) hash updates + O(1) expected string comparisons = O(n) average, O(n*m) worst case with adversarial input (all same character).”
}
},
{
“@type”: “Question”,
“name”: “How do you handle hash collisions in Rabin-Karp?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A hash collision occurs when two different strings produce the same hash value, causing a false positive match. Mitigation: (1) Large modulus: use MOD = 10^9 + 7 (prime). Probability of collision per position u2248 1/MOD u2248 10^-9. For n = 10^6 positions: expected 10^-3 false positives — near zero. (2) Double hashing: compute two independent hashes with different (BASE, MOD) pairs. A collision requires both hashes to match simultaneously: probability 1/(MOD1 * MOD2) u2248 10^-18. (3) Always verify: whenever the hash matches, do a direct string comparison before recording a match. This makes the algorithm always correct (zero false positives) at the cost of O(m) per hash match. In practice, with a large prime MOD, double hashing is used only when the input might be adversarially crafted (competitive programming, not typical interviews).”
}
},
{
“@type”: “Question”,
“name”: “How does the binary search + rolling hash pattern solve “longest duplicate substring”?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Longest duplicate substring (LC 1044): find the longest substring that appears at least twice in s. Binary search on the answer length L (range: 1 to len(s)-1). For a given L, check if any substring of length L appears more than once: compute rolling hashes for all n-L+1 substrings of length L using a sliding window. Store hashes in a set. If any hash appears twice, a duplicate of length L exists (verify with string comparison to handle collisions). If a duplicate exists for L, try larger. If not, try smaller. Binary search takes O(log n) iterations. Each iteration: O(n) to compute all hashes. Total: O(n log n). Without rolling hash, each check would be O(n*L) = O(n^2) giving O(n^2 log n) — too slow for n = 30,000.”
}
},
{
“@type”: “Question”,
“name”: “How do you use rolling hash to find all anagrams of a pattern in a string?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Anagram detection with rolling hash requires a hash that is order-independent (same characters = same hash regardless of order). Standard polynomial rolling hash is order-sensitive (position matters) — “ab” and “ba” have different hashes. Instead use a frequency-based hash: hash(s) = sum of (char_value^2) or a product of primes (each character maps to a distinct prime, hash = product of character primes). This is commutative — anagrams have the same hash. Sliding window: initialize the hash for the first m characters. Slide right: subtract the outgoing character’s contribution, add the incoming. Compare hash with the pattern hash at each position; verify on match. Simpler alternative (and preferred in interviews): sliding window with a frequency difference array (26 integers for lowercase). Track how many characters currently match the required frequency. O(n) time, O(1) space.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between rolling hash and suffix arrays for substring problems?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Rolling hash: simple to implement, O(n) to check if a specific substring exists or appears more than once, O(n log n) for binary search + hash. Has false positive risk (mitigated by double hashing). Works online (streaming). Cannot enumerate all unique substrings efficiently. Suffix arrays: O(n log n) or O(n) to build. Supports: finding the longest repeated substring (LCP array), counting distinct substrings (n*(n+1)/2 – sum of LCP array), finding all occurrences of a pattern in O(m + log n + occurrences) via binary search on the suffix array. No false positives. More complex to implement correctly. Suffix arrays are the theoretically superior tool for offline substring problems, but rolling hash is faster to code in an interview setting and sufficient for most problems.”
}
}
]
}
Asked at: Meta Interview Guide
Asked at: Apple Interview Guide
Asked at: Coinbase Interview Guide
Asked at: Databricks Interview Guide