Question 1

How does the rolling hash in Rabin-Karp avoid O(n*m) recomputation?

Accepted Answer

Naive string matching recomputes the hash of each window of size m from scratch in O(m) time, giving O(n*m) total. Rolling hash reuses the previous window's hash: new_hash = (old_hash - char_val(leaving_char) * BASE^(m-1)) * BASE + char_val(entering_char). Removing the leftmost character: subtract its contribution (char_val * BASE^(m-1)) then multiply by BASE to shift everything left. Adding the new rightmost character: add char_val. Each slide is O(1). The key insight: polynomial hashing is linear -- hash(s[i+1..i+m]) can be derived from hash(s[i..i+m-1]) by one multiply, one add, and one subtract. Precompute BASE^(m-1) mod MOD once before the sliding loop. This transforms O(n*m) into O(n+m).

Question 2

What causes hash collisions and how do you handle them?

Accepted Answer

A hash collision occurs when two different strings produce the same hash value. With MOD = 10^9+7, the probability of any single collision is ~10^-9. But if you make O(n^2) comparisons (all pairs of substrings), the expected number of collisions is n^2 * 10^-9. For n = 10^5, that's ~10 false positives. Handling: when hashes match, verify with a direct character-by-character comparison -- O(m) in the worst case but rare. Double hashing: use two independent (BASE, MOD) pairs. A false positive requires both hashes to collide simultaneously -- probability ~10^-18, negligible. In competitive programming, adversarial test cases are sometimes crafted to break single-hash solutions. Double hashing is the defense. For LeetCode: single hashing with a large MOD (10^18+9) is typically sufficient.

Question 3

How do you use hashing to find the longest duplicate substring?

Accepted Answer

Binary search on the answer length L. For each candidate L: use a rolling hash to compute the hash of every substring of length L in O(n). Store all hashes in a set. If any hash appears twice (with character verification to rule out collisions): a duplicate of length L exists. Binary search: if L is achievable, try L+1; otherwise try L-1. O(n log n) total. LC 1044: the tricky part is that the answer might be the empty string (length 0). Set the binary search range as [0, n-1]. The character verification on collision: for each repeated hash, find the two matching substrings and compare them directly. Use a dict: hash -> list of start positions. On duplicate hash: compare the actual substrings at those positions. If they match, we found a true duplicate.

Question 4

When should you use hashing vs KMP for pattern matching?

Accepted Answer

KMP (Knuth-Morris-Pratt): O(n+m) worst case, no false positives, deterministic. Rabin-Karp hashing: O(n+m) average, O(n*m) worst case (all collisions), with character verification. Use KMP when: a single pattern vs a single text, no risk of false positives needed, deterministic performance required. Use hashing when: multiple patterns (compute hashes of all patterns, check each window against the set -- O(n + sum(m_i))), finding any of k patterns simultaneously (KMP requires building an Aho-Corasick automaton), substring uniqueness checks (store hashes in a set, O(1) lookup), or comparing substrings at arbitrary positions (prefix hash array). The prefix hash array (O(1) per query after O(n) preprocessing) has no KMP equivalent for arbitrary substring comparison. Hashing wins for complex multi-query substring problems.

Question 5

How do you handle the modular arithmetic to keep rolling hash values positive?

Accepted Answer

After the rolling window update, win_hash can become negative in languages with signed integers: win_hash -= char_val * power can underflow. Fix: add MOD before taking mod -- win_hash = (win_hash - char_val * power % MOD + MOD) % MOD. The +MOD step ensures the result is in [0, MOD-1] even if the subtraction gave a negative value. In Python, the % operator always returns a non-negative result for positive MOD, so this is less of an issue, but it's still good practice. For the multiply-then-mod step: in languages with 64-bit integers (Java long, C++ long long), (a * b) can overflow if a and b are up to 10^9. Use __int128 in C++, or split the multiplication: avoid numbers larger than sqrt(LLONG_MAX) before taking mod. Python handles arbitrary precision natively.

String Hashing Interview Patterns: Rabin-Karp, Rolling Hash, and Polynomial Hashing (2025)

Why String Hashing?

Rolling Hash (Rabin-Karp Algorithm)

Prefix Hash Array for O(1) Substring Hash

Double Hashing to Reduce Collisions

Applications