String Matching

Loading concept...

🔍 String Matching: Finding Needles in Haystacks

The Detective Story

Imagine you’re a detective with a magnifying glass looking for a specific word hidden in a giant book. You could check every single letter, one by one… but that would take forever! What if you had magical shortcuts that helped you skip ahead when you knew the word couldn’t possibly be there?

That’s exactly what string matching algorithms do. They’re like clever detectives who find patterns in text — super fast!


🧵 What is String Matching?

String matching means finding where a small piece of text (called the pattern) appears inside a bigger piece of text (called the text).

Simple Example:

  • Text: “I love eating bananas and apples”
  • Pattern: “banana”
  • Result: Found at position 14!

Think of it like finding your friend in a crowd. You know what they look like (the pattern), and you scan through people (the text) until you spot them.


🐌 The Slow Way (Brute Force)

Before we learn the magic tricks, let’s see the simple way:

def simple_search(text, pattern):
    n = len(text)
    m = len(pattern)
    for i in range(n - m + 1):
        match = True
        for j in range(m):
            if text[i + j] != pattern[j]:
                match = False
                break
        if match:
            return i
    return -1

Problem: If the text has 1 million letters and pattern has 1000 letters, we might check 1 billion comparisons! 😰


🚀 KMP Algorithm: The Smart Skipper

The Story

Imagine you’re looking for the word “ABABC” in a long text.

You start matching: A-B-A-B… but then the next letter is wrong! The slow detective would go back to the start and try again from the next position.

But the KMP detective is smarter! They think: “Wait, I just saw A-B at the end of what I matched. That could be the BEGINNING of my pattern! I don’t need to start completely over!”

How KMP Works

Step 1: Build a “Failure Table” (also called LPS array)

This table tells us: “If I fail at position X, where should I jump back to?”

For pattern “ABABC”:

Pattern:  A  B  A  B  C
Index:    0  1  2  3  4
LPS:      0  0  1  2  0

What does this mean?

  • At position 3 (second B), LPS = 2
  • This means: the first 2 characters (A-B) match the last 2 we just saw!
  • So if we fail after matching “ABAB”, we can pretend we already matched “AB”!

Building the LPS Table

def build_lps(pattern):
    m = len(pattern)
    lps = [0] * m
    length = 0  # Length of previous
                # longest prefix suffix
    i = 1

    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    return lps

The KMP Search

def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    lps = build_lps(pattern)

    i = 0  # Text pointer
    j = 0  # Pattern pointer

    while i < n:
        if text[i] == pattern[j]:
            i += 1
            j += 1
            if j == m:
                return i - j  # Found!
        else:
            if j != 0:
                j = lps[j - 1]  # Smart skip!
            else:
                i += 1
    return -1

Why KMP is Magic

graph TD A[Start Matching] --> B{Characters Match?} B -->|Yes| C[Move Both Pointers] C --> D{Pattern Complete?} D -->|Yes| E[🎉 Found It!] D -->|No| B B -->|No| F{Pattern Pointer = 0?} F -->|Yes| G[Move Text Pointer] F -->|No| H[Jump Using LPS Table] H --> B G --> B

KMP never goes backward in the text! It only moves forward, making it very fast.

Time: O(n + m) — blazing fast! ⚡


🎲 Rabin-Karp: The Fingerprint Detective

The Story

Imagine every word has a unique fingerprint (a number). Instead of comparing letter by letter, you just compare fingerprints!

If the fingerprints don’t match → definitely not the same word. If they match → probably the same word (check to be sure).

The Hash Function

We turn letters into numbers using a clever formula:

# Simple hash: treat string as a number
# "ABC" → A×100 + B×10 + C×1

def calculate_hash(s, base=256, mod=101):
    h = 0
    for char in s:
        h = (h * base + ord(char)) % mod
    return h

The Rolling Hash Magic

Here’s the real trick! When we slide our window by one character, we don’t recalculate from scratch:

Old window: "ABC" → hash = 65×256² + 66×256 + 67
New window: "BCD"

Rolling update:
1. Remove 'A': subtract 65×256²
2. Shift left: multiply by 256
3. Add 'D': add 68
def rabin_karp(text, pattern):
    n, m = len(text), len(pattern)
    base, mod = 256, 101

    # Calculate h = base^(m-1) % mod
    h = pow(base, m - 1, mod)

    # Calculate initial hashes
    p_hash = 0  # Pattern hash
    t_hash = 0  # Text window hash

    for i in range(m):
        p_hash = (base * p_hash +
                  ord(pattern[i])) % mod
        t_hash = (base * t_hash +
                  ord(text[i])) % mod

    # Slide the window
    for i in range(n - m + 1):
        if p_hash == t_hash:
            # Hash match! Verify actual match
            if text[i:i+m] == pattern:
                return i

        if i < n - m:
            # Roll the hash forward
            t_hash = (base * (t_hash -
                     ord(text[i]) * h) +
                     ord(text[i + m])) % mod
            if t_hash < 0:
                t_hash += mod

    return -1

Why Use Rabin-Karp?

graph TD A[Calculate Pattern Hash] --> B[Calculate First Window Hash] B --> C{Hashes Match?} C -->|No| D[Roll Hash Forward] D --> E{More Windows?} E -->|Yes| C E -->|No| F[Not Found 😢] C -->|Yes| G[Verify Characters] G -->|Match| H[🎉 Found It!] G -->|No Match| D

Best for: Finding MULTIPLE patterns at once!

Time: O(n + m) average, O(nm) worst case


🔗 Longest Common Prefix (LCP)

The Story

You have two words. How many letters at the START are the same?

  • “flower” and “flow” → “flow” (4 letters)
  • “dog” and “racecar” → “” (0 letters — nothing in common!)

Simple LCP for Two Strings

def lcp_two(s1, s2):
    i = 0
    while (i < len(s1) and
           i < len(s2) and
           s1[i] == s2[i]):
        i += 1
    return s1[:i]

# Example
print(lcp_two("flower", "flow"))
# Output: "flow"

LCP for Many Strings

What if you have a whole list of words?

def lcp_array(strings):
    if not strings:
        return ""

    # Start with first string
    prefix = strings[0]

    # Compare with each other string
    for s in strings[1:]:
        # Shrink prefix until it matches
        while not s.startswith(prefix):
            prefix = prefix[:-1]
            if not prefix:
                return ""

    return prefix

# Example
words = ["flower", "flow", "flight"]
print(lcp_array(words))
# Output: "fl"

Smarter LCP: Divide and Conquer

def lcp_divide_conquer(strings):
    if not strings:
        return ""

    def lcp_two(s1, s2):
        i = 0
        while (i < len(s1) and
               i < len(s2) and
               s1[i] == s2[i]):
            i += 1
        return s1[:i]

    def helper(left, right):
        if left == right:
            return strings[left]
        mid = (left + right) // 2
        lcp_left = helper(left, mid)
        lcp_right = helper(mid + 1, right)
        return lcp_two(lcp_left, lcp_right)

    return helper(0, len(strings) - 1)

LCP Visualization

graph TD A["flower, flow, flight"] --> B["flower, flow"] A --> C["flight"] B --> D["flower"] B --> E["flow"] D --> F["LCP: flow"] E --> F F --> G["LCP: fl"] C --> G G --> H["Final: fl"]

🎯 When to Use Each Algorithm?

Situation Best Choice Why
Find one pattern once KMP Guaranteed fast
Find multiple patterns Rabin-Karp Hash all at once
Check pattern start LCP Direct comparison
Pattern has repetition KMP LPS table helps
Need simple code Rabin-Karp Easy to understand

🌟 Real World Examples

KMP in Action

  • Find all “error” in log files — KMP scans millions of lines quickly
  • DNA sequence matching — Find gene patterns in chromosomes

Rabin-Karp in Action

  • Plagiarism detection — Check if document chunks appear elsewhere
  • Spam filtering — Match known spam phrases

LCP in Action

  • Autocomplete — All suggestions share common prefix with what you typed
  • File path matching — Find common directory root

🧠 Key Takeaways

  1. KMP = Build a smart failure table, never go backward
  2. Rabin-Karp = Use hash fingerprints for quick rejection
  3. LCP = Find matching beginnings of strings

All three are about being clever — not checking everything, but knowing what to skip!


💪 You’ve Got This!

String matching might sound scary, but remember:

  • KMP is just a detective who remembers what they’ve seen
  • Rabin-Karp is a detective who uses fingerprints
  • LCP is just comparing beginnings

You’re now equipped with the tools to find ANY needle in ANY haystack! 🎉

Loading story...

No Story Available

This concept doesn't have a story yet.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

Interactive Preview

Interactive - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Interactive Content

This concept doesn't have interactive content yet.

Cheatsheet Preview

Cheatsheet - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Cheatsheet Available

This concept doesn't have a cheatsheet yet.

Quiz Preview

Quiz - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Quiz Available

This concept doesn't have a quiz yet.