Knuth-Morris-Pratt String Matching Algorithm

Subscribe Send me a message home page tags


Related Reading

Details

The idea of KMP algorithm is to skip unnecessary comparison of text and pattern. Suppose we start the search of the pattern \(P\) in the text \(T\) at position \(k\). For a brutal search approach, if there is a mismatch, it will start the next start from position \(k+1\). It turns out we can do much better than this. The key component is the longest prefix of the pattern string that is also a suffix.

KMP_algorithm.png

Definition - Happy prefix: A happy prefix of a string is a prefix that is also a suffix of the string and is not equal to the string itself.

The key data structure is an array (denoted by A) defined as follow:

For an index k, A[k] is the index of the last char in the longest happy prefix of substring \(P_0P_1...P_k\). In other words, \(P_0P_1...P_{A[k]}\) is the longest happy prefix of \(P_0P_1...P_k\).

If the substring does not have any happy prefix then A[k] is set to -1.

Note that by definition, A[0] = -1.

Here is the code to construct the array. The recursive relationship is straightforward:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public int[] constructArray(String s) {
    if (s.length() <= 1) {
        return new int[]{};
    }


    int[] A = new int[s.length()];
    A[0] = -1;


    for (int i = 1; i < s.length(); ++i) {

        int k = A[i-1];
        while (k >= 0 && s.charAt(i) != s.charAt(k+1)) {
            k = A[k];
        }

        if (s.charAt(i) == s.charAt(k+1)) {
            A[i] = k + 1;
        } else {
            A[i] = k; // k is -1 in this case.
        }
    }

    return A;
}

Now, what is the time complexity of this algorithm? The answer is \(O(N)\) where \(N\) is the length of the pattern string. The proof is quite elegant.

We first rewrite the code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
public int[] constructArray(String s) {
    if (s.length() <= 1) {
        return new int[]{};
    }


    int[] A = new int[s.length()];
    A[0] = -1;
    int k = -1;

    for (int i = 1; i < s.length(); ++i) {

        while (k >= 0 && s.charAt(i) != s.charAt(k+1)) {
            k = A[k];
        }

        if (s.charAt(i) == s.charAt(k+1)) {
            ++k;
        }

        A[i] = k;
    }

    return A;
}

Consider the quantity \(q = 2i - k\), for every execution of statement in the for loop or the while loop, this quantity can only increase. If we are in the while loop then k decrease thus \(q\) increases. If we are at line 18 and k increases by one; however, \(i\) increases by one as well in the the outer for loop so the net result is \(q\) increases by 1.

The lower bound of \(q\) is 1 when \(i = 0\) and \(k = -1\). The upper bound is \(2N\) so there will be no more than \(2N\) execution in the loop.

----- END -----

Welcome to join reddit self-learning community.
Send me a message Subscribe to blog updates

Want some fun stuff?

/static/shopping_demo.png