Algorithm - Approximate String Matching

Subscribe Send me a message home page

Related Reading


Data Structure

We will maintain a state vector which tracks the current state of pattern matching. More specifically, let \(S_j\) denote the state vector of size \(m\) given a current position j in the text. \(S_j\) contains individual states of the search between each prefix of \(P\) and the corresponding substring of \(t\). Namely, for \(1 \leq i \leq m\), \(S_j[i]\) is the number of mismatches between \(p_1...p_i\) and \(t_{j-i+1}...t_j\).

This vector can be viewed as a number of base \(2^b\), where \(b = \lceil log_2(m+1) \rceil \). (We don't really need to encode all \(m\) numbers, setting \(b = \lceil log_2(k+1) \rceil \) should suffice as well.)

Boyer-Moor Strategy

The insight of Boyer-Moor algorithm is that we don't necessarily need to scan all characters in the text. In the context of approximate string matching, if l is the maximum length of the substring of text ending at \(t_j\) where the distance between this substring and the prefix of pattern P (e.g. \(p_1p_2...p_l\)) is less than \(k\), then we know the next candidate of a matched substring in the text should end at least at \(j + m - l\).


Boyer-Moore Strategy to Approximate String Matching

The pseudo code below is copied from [1]:


Notice that when processing the orange part in the figures below, we start from the right side and move towards left. This is the same as what we have in Boyer-Moor algorithm when doing the exact matching. Another observation is that updating the orange part should not have impact on the condition check in line 4 thanks to the bit vector representation.


----- END -----

Send me a message Subscribe to blog updates

Want some fun stuff?