🔍 String Matching: Finding Needles in Haystacks
The Detective Story
Imagine you’re a detective with a magnifying glass looking for a specific word hidden in a giant book. You could check every single letter, one by one… but that would take forever! What if you had magical shortcuts that helped you skip ahead when you knew the word couldn’t possibly be there?
That’s exactly what string matching algorithms do. They’re like clever detectives who find patterns in text — super fast!
🧵 What is String Matching?
String matching means finding where a small piece of text (called the pattern) appears inside a bigger piece of text (called the text).
Simple Example:
- Text: “I love eating bananas and apples”
- Pattern: “banana”
- Result: Found at position 14!
Think of it like finding your friend in a crowd. You know what they look like (the pattern), and you scan through people (the text) until you spot them.
🐌 The Slow Way (Brute Force)
Before we learn the magic tricks, let’s see the simple way:
def simple_search(text, pattern):
n = len(text)
m = len(pattern)
for i in range(n - m + 1):
match = True
for j in range(m):
if text[i + j] != pattern[j]:
match = False
break
if match:
return i
return -1
Problem: If the text has 1 million letters and pattern has 1000 letters, we might check 1 billion comparisons! 😰
🚀 KMP Algorithm: The Smart Skipper
The Story
Imagine you’re looking for the word “ABABC” in a long text.
You start matching: A-B-A-B… but then the next letter is wrong! The slow detective would go back to the start and try again from the next position.
But the KMP detective is smarter! They think: “Wait, I just saw A-B at the end of what I matched. That could be the BEGINNING of my pattern! I don’t need to start completely over!”
How KMP Works
Step 1: Build a “Failure Table” (also called LPS array)
This table tells us: “If I fail at position X, where should I jump back to?”
For pattern “ABABC”:
Pattern: A B A B C
Index: 0 1 2 3 4
LPS: 0 0 1 2 0
What does this mean?
- At position 3 (second B), LPS = 2
- This means: the first 2 characters (A-B) match the last 2 we just saw!
- So if we fail after matching “ABAB”, we can pretend we already matched “AB”!
Building the LPS Table
def build_lps(pattern):
m = len(pattern)
lps = [0] * m
length = 0 # Length of previous
# longest prefix suffix
i = 1
while i < m:
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
length = lps[length - 1]
else:
lps[i] = 0
i += 1
return lps
The KMP Search
def kmp_search(text, pattern):
n, m = len(text), len(pattern)
lps = build_lps(pattern)
i = 0 # Text pointer
j = 0 # Pattern pointer
while i < n:
if text[i] == pattern[j]:
i += 1
j += 1
if j == m:
return i - j # Found!
else:
if j != 0:
j = lps[j - 1] # Smart skip!
else:
i += 1
return -1
Why KMP is Magic
graph TD A[Start Matching] --> B{Characters Match?} B -->|Yes| C[Move Both Pointers] C --> D{Pattern Complete?} D -->|Yes| E[🎉 Found It!] D -->|No| B B -->|No| F{Pattern Pointer = 0?} F -->|Yes| G[Move Text Pointer] F -->|No| H[Jump Using LPS Table] H --> B G --> B
KMP never goes backward in the text! It only moves forward, making it very fast.
Time: O(n + m) — blazing fast! ⚡
🎲 Rabin-Karp: The Fingerprint Detective
The Story
Imagine every word has a unique fingerprint (a number). Instead of comparing letter by letter, you just compare fingerprints!
If the fingerprints don’t match → definitely not the same word. If they match → probably the same word (check to be sure).
The Hash Function
We turn letters into numbers using a clever formula:
# Simple hash: treat string as a number
# "ABC" → A×100 + B×10 + C×1
def calculate_hash(s, base=256, mod=101):
h = 0
for char in s:
h = (h * base + ord(char)) % mod
return h
The Rolling Hash Magic
Here’s the real trick! When we slide our window by one character, we don’t recalculate from scratch:
Old window: "ABC" → hash = 65×256² + 66×256 + 67
New window: "BCD"
Rolling update:
1. Remove 'A': subtract 65×256²
2. Shift left: multiply by 256
3. Add 'D': add 68
def rabin_karp(text, pattern):
n, m = len(text), len(pattern)
base, mod = 256, 101
# Calculate h = base^(m-1) % mod
h = pow(base, m - 1, mod)
# Calculate initial hashes
p_hash = 0 # Pattern hash
t_hash = 0 # Text window hash
for i in range(m):
p_hash = (base * p_hash +
ord(pattern[i])) % mod
t_hash = (base * t_hash +
ord(text[i])) % mod
# Slide the window
for i in range(n - m + 1):
if p_hash == t_hash:
# Hash match! Verify actual match
if text[i:i+m] == pattern:
return i
if i < n - m:
# Roll the hash forward
t_hash = (base * (t_hash -
ord(text[i]) * h) +
ord(text[i + m])) % mod
if t_hash < 0:
t_hash += mod
return -1
Why Use Rabin-Karp?
graph TD A[Calculate Pattern Hash] --> B[Calculate First Window Hash] B --> C{Hashes Match?} C -->|No| D[Roll Hash Forward] D --> E{More Windows?} E -->|Yes| C E -->|No| F[Not Found 😢] C -->|Yes| G[Verify Characters] G -->|Match| H[🎉 Found It!] G -->|No Match| D
Best for: Finding MULTIPLE patterns at once!
Time: O(n + m) average, O(nm) worst case
🔗 Longest Common Prefix (LCP)
The Story
You have two words. How many letters at the START are the same?
- “flower” and “flow” → “flow” (4 letters)
- “dog” and “racecar” → “” (0 letters — nothing in common!)
Simple LCP for Two Strings
def lcp_two(s1, s2):
i = 0
while (i < len(s1) and
i < len(s2) and
s1[i] == s2[i]):
i += 1
return s1[:i]
# Example
print(lcp_two("flower", "flow"))
# Output: "flow"
LCP for Many Strings
What if you have a whole list of words?
def lcp_array(strings):
if not strings:
return ""
# Start with first string
prefix = strings[0]
# Compare with each other string
for s in strings[1:]:
# Shrink prefix until it matches
while not s.startswith(prefix):
prefix = prefix[:-1]
if not prefix:
return ""
return prefix
# Example
words = ["flower", "flow", "flight"]
print(lcp_array(words))
# Output: "fl"
Smarter LCP: Divide and Conquer
def lcp_divide_conquer(strings):
if not strings:
return ""
def lcp_two(s1, s2):
i = 0
while (i < len(s1) and
i < len(s2) and
s1[i] == s2[i]):
i += 1
return s1[:i]
def helper(left, right):
if left == right:
return strings[left]
mid = (left + right) // 2
lcp_left = helper(left, mid)
lcp_right = helper(mid + 1, right)
return lcp_two(lcp_left, lcp_right)
return helper(0, len(strings) - 1)
LCP Visualization
graph TD A["flower, flow, flight"] --> B["flower, flow"] A --> C["flight"] B --> D["flower"] B --> E["flow"] D --> F["LCP: flow"] E --> F F --> G["LCP: fl"] C --> G G --> H["Final: fl"]
🎯 When to Use Each Algorithm?
| Situation | Best Choice | Why |
|---|---|---|
| Find one pattern once | KMP | Guaranteed fast |
| Find multiple patterns | Rabin-Karp | Hash all at once |
| Check pattern start | LCP | Direct comparison |
| Pattern has repetition | KMP | LPS table helps |
| Need simple code | Rabin-Karp | Easy to understand |
🌟 Real World Examples
KMP in Action
- Find all “error” in log files — KMP scans millions of lines quickly
- DNA sequence matching — Find gene patterns in chromosomes
Rabin-Karp in Action
- Plagiarism detection — Check if document chunks appear elsewhere
- Spam filtering — Match known spam phrases
LCP in Action
- Autocomplete — All suggestions share common prefix with what you typed
- File path matching — Find common directory root
🧠 Key Takeaways
- KMP = Build a smart failure table, never go backward
- Rabin-Karp = Use hash fingerprints for quick rejection
- LCP = Find matching beginnings of strings
All three are about being clever — not checking everything, but knowing what to skip!
💪 You’ve Got This!
String matching might sound scary, but remember:
- KMP is just a detective who remembers what they’ve seen
- Rabin-Karp is a detective who uses fingerprints
- LCP is just comparing beginnings
You’re now equipped with the tools to find ANY needle in ANY haystack! 🎉