# Knuth-Morris-Pratt String Matching Algorithm

### Details

The idea of KMP algorithm is to skip unnecessary comparison of text and pattern. Suppose we start the search of the pattern $$P$$ in the text $$T$$ at position $$k$$. For a brutal search approach, if there is a mismatch, it will start the next start from position $$k+1$$. It turns out we can do much better than this. The key component is the longest prefix of the pattern string that is also a suffix. Definition - Happy prefix: A happy prefix of a string is a prefix that is also a suffix of the string and is not equal to the string itself.

The key data structure is an array (denoted by A) defined as follow:

For an index k, A[k] is the index of the last char in the longest happy prefix of substring $$P_0P_1...P_k$$. In other words, $$P_0P_1...P_{A[k]}$$ is the longest happy prefix of $$P_0P_1...P_k$$.

If the substring does not have any happy prefix then A[k] is set to -1.

Note that by definition, A = -1.

Here is the code to construct the array. The recursive relationship is straightforward:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public int[] constructArray(String s) {
if (s.length() <= 1) {
return new int[]{};
}

int[] A = new int[s.length()];
A = -1;

for (int i = 1; i < s.length(); ++i) {

int k = A[i-1];
while (k >= 0 && s.charAt(i) != s.charAt(k+1)) {
k = A[k];
}

if (s.charAt(i) == s.charAt(k+1)) {
A[i] = k + 1;
} else {
A[i] = k; // k is -1 in this case.
}
}

return A;
}


Now, what is the time complexity of this algorithm? The answer is $$O(N)$$ where $$N$$ is the length of the pattern string. The proof is quite elegant.

We first rewrite the code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
public int[] constructArray(String s) {
if (s.length() <= 1) {
return new int[]{};
}

int[] A = new int[s.length()];
A = -1;
int k = -1;

for (int i = 1; i < s.length(); ++i) {

while (k >= 0 && s.charAt(i) != s.charAt(k+1)) {
k = A[k];
}

if (s.charAt(i) == s.charAt(k+1)) {
++k;
}

A[i] = k;
}

return A;
}


Consider the quantity $$q = 2i - k$$, for every execution of statement in the for loop or the while loop, this quantity can only increase. If we are in the while loop then k decrease thus $$q$$ increases. If we are at line 18 and k increases by one; however, $$i$$ increases by one as well in the the outer for loop so the net result is $$q$$ increases by 1.

The lower bound of $$q$$ is 1 when $$i = 0$$ and $$k = -1$$. The upper bound is $$2N$$ so there will be no more than $$2N$$ execution in the loop.

----- END -----