FFT Algorithm For Solving Exact Pattern Matching With Don't Cares

From StringPedia

Jump to: navigation, search

This algorithm uses the Fast Fourier Transform in order to solve the Exact Pattern Matching With Don't Cares problem.

Contents

Formal Description

This algorithm considers each of the text and pattern to be a string of integer values(the ASCII value of each character, for example) with all "don't care" characters encoded as 0. It then takes advantage of the following summation:

S_i = \sum_{j=1}^m(p_j^3t_{i+j-1}-2p_j^2t_{i+j-1}^2+p_jt_{i+j-1}^3)

which equals 0 if there is an exact match with don't cares starting at index i of the text.[1]

Time Complexity

The summation requires the calculation of three correlations, which can be determined by finding a convolution after applying the Fourier Transform. Each of these takes O(nlogn) time. This can be further reduced to O(nlogm) time by splitting the text into n/m overlapping sub-strings of size 2m, before performing the same computation on each sub-string. Note that we must have this overlap otherwise we would not find matches spanning two non-overlapping sub-strings.

Implementation

The library FFTW was used in this implementation; it must be installed to compile the source files. Code for all of the algorithms used to solve this problem can be found here.

In general, the text and pattern were first converted to arrays of the their ASCII values, their ASCII values squared and their ASCII values cubed. As these arrays were produced, the pattern was also reversed to allow the later point-wise multiplication to compute a convolution rather than a correlation. In addition, the pattern was "padded" with zeros so that the pattern and text were of equal length. The Fourier Transform of each of these arrays were taken, and the required convolutions obtained, before applying the inverse Fourier Transform. Each possible alignment of text and pattern were then tested by combining the results of the three convolutions as required by the above formula (at this state, we also remove the scaling applied by FFTW).

Owing to floating point inaccuracies, a match was considered to have been found if the absolute value of the result was < 0.1. Further work is required in order to verify that this is a suitable threshold -- in particular, we require that it is impossible to obtain a false negative or a false positive.

A number of variations in the implementation were considered in order to extract the best possible running times. These include:

Standard O(nlogn)

     The basic version of this algorithm, which has a time complexity of O(nlogn). This follows the implementation outlined above

Standard O(nlogm)

     This applies the standard trick outlined above, which reduces the time complexity to O(nlogm). Care must be taken to avoid re-computing values in a loop, including: the squares and cubes of the pattern and text values (the text values being a little more subtle; because the substrings overlap, computation on the original text string inside a loop would be duplicated); the DFTs of the pattern (squared/cubed); and memory allocation for the convolutions and DFTs of the text (squared/cubed).

Padding to a power of 2

     For both the O(nlogn) and O(nlogm) versions, a variant where the text and pattern were padded to a power of 2 was considered. This may allow FFTW to compute the Fourier Transforms faster (e.g. these benchmarks), perhaps by the use of algorithms such as Cooley Tukey, which are greatly simplified if n = 2k.

In the O (n \log m) case, care must be taken in order to extract the best performance. In particular, we must maintain (i.e. not increase) the overlap of m characters in each substring otherwise we waste computation in each transform.

Pre-planned transforms

     In all of the above cases, FFTW has been "estimating" the best way to compute the Fourier Transform (determining the best way to break the transform down into smaller problems or setting the best twiddle factors, for example). Measuring this process for each n and m at run-time was found to be prohibitively expensive. However, using the Wisdom feature, it is possible to plan transforms of a given size and save this information (although it may still achieve worse performance than a run-time measurement [2]). Hence plans were created for transform sizes of 2k for k=0,1,\dots,21 (note that this implies the above padding to a power of 2).

Given a specific search application, it seems reasonable that time for pre-computation of these plans would be possible.

Real-to-Real Transforms

The above transforms have all used FFTW's real-to-complex (forward) and complex-to-real (backward) transforms, however a real-to-real interface is also provided[3].

No Wild Cards In Text

A variation on the algorithm where wild cards where only allowed in the pattern and not the text was also considered. It seems entirely plausible that this kind of situation may arise in use. This allows the above formula for determining if a match exists to be modified to:

S_i = \sum_{j=1}^m(p_j^3-2p_j^2t_{i+j-1}+p_jt_{i+j-1}^2)

This allows the "one-off" computation of the sum of the cubes of the pattern values, hence reducing the number of Fourier Transforms that need to be applied.

Minimum Sub-string length

As will be discussed in the following section, the run-time of the O(nlogm) version of the algorithm has been seen to be highly affected by small variations in m when m is small. Hence a version where a minimum sub-string size of 211 = 2048 was considered in order to alleviate this problem.

Randomized Methods

It is possible to reduce the number of required convolutions from 3 to 2 by introducing randomization [4]. This is a Monte Carlo algorithm and thus introduces the possibility of a false-positive, though it is worth noting that, in practice, it may be possible to verify potential matches naively without any real affect on the run-time.

If we then further consider the special case where there are no wild cards in the text, only one convolution is required.

Combinations Of The Above

Clearly a number of the above variations can be combined.

Actual Running Time

For details of the set-up used to generate these graphs and comparisons with other methods for solving the problem, see here.

The inconsistency of O(nlogm) varieties

The following graph shows some of the FFT-based methods for a fixed text size of 221 = 2097152 and a varying but small pattern size:

Graph of the FFT-based methods for solving exact matching with don't cares, for a fixed text size and varying pattern size

As can be seen, both the unpadded and padded version of the O(nlogm) algorithm fluctuate greatly. There are a number of contributing factors:

  • There are large fluctuations in performance of FFTW (and, indeed, other implementations of the Fourier Transform) for varying transform sizes when the transform size is not a power of 2. For further details, see FFTW's benchmarks and the pages linked from there.
  • Even if the transform is a power of 2, as in the padded case, FFTW is slower for small transform sizes. (see FFTW's benchmark page and the links from there), hence even if m only changes a small amount, the speed of the transform could change. And given the large number of transforms being applied in this case, this could be amplified in the results.
  • Variations can be observed even for the same sized text and pattern -- this seems to have particularly affected the pre-planned O(nlogm) algorithm for small m. The causes of these variations are somewhat a mystery, though it should be noted that the processor (or rather, it's architecture) these tests where benchmarked on is apparently known to be inconsistent (Matteo Frigo, co-developer of FFTW, email communication).

It may be important to consider the sheer number of transforms that are potentially being applied in the O(nlogm) case. For example, if n = 1000000 and m = 32, then there are roughly n / m = 31250 sub-strings that we consider. Each subs-string involves 3 forwards transforms for the text and 3 inverse transforms for each point-wise product. Hence there are a total of approximately 31250 * 3 * 3 = 281250 transforms applied. So a total variation of, say, 5 seconds, may be equivalent to a difference of ~17μs per transform.

Given these variations, the "forced-minimum-size" variation was implemented, in which a minimum sub-string of 2048 is used. This size was picked simply as it appeared to be the smallest power of 2 for which the run-times were not erratic. Further work could attempt to determine an optimum sub-string size. The next section then excludes approaches without a forced minimum size (though it focuses on larger pattern sizes, at which point the minimum size clearly has no effect), as they are considered to be a 'better' approach.

Comparison Of FFT-based methods

The following graph shows the run-time of the naive method with a fixed size of 221 = 2097152 and varying pattern sizes:

Available in full resolution here or in SVG format here

The first thing to notice is that all versions of the O(nlogm) algorithm are susceptible to variations in run-time, even though they all have sub-strings padded to a power of 2. The reason for this seems to be explained by the fact if a given m is 'far' from a power of 2 it must then be rounded considerably, whereas if m is 'close' to a power of 2, it is only rounded a small amount. This results in each sub-string of text to consider more alignments in the former case and hence requiring less overlapping sub-strings overall. As m grows, these variations become less and instead we observe the perhaps expected behaviour of run-time being unaffected by changes in m between each power of 2.

Next we consider the point at which the O(nlogm) algorithms become slower than their equivalent O(nlogn) counterparts. The above graph shows 4 variations of the FFT approach (standard; real-to-real transforms; no wild cards in the text; and real-to-real and no wild cards in the text) using both the O(nlogn) and O(nlogm) algorithms. In all 4 cases, the cross-over point is when m \approx n/4.

Finally, we note that as m increases past n / 2, the O(nlogm) algorithm becomes significantly worse than the O(nlogn) approach. This is because it is in fact computing transforms that are greater than n in size. This is not shown on the above graph but can be seen in this graph in PNG format or in SVG format.


Practicality

Open Questions

There are a number of other questions that have not been answered at all by this study:

  • What is a suitable threshold for determining if Si = 0, given floating point inaccuracies?
  • How is the run-time affected if we vary the substring size in the O(nlogm)method (e.g. by a greater factor of m), whilst maintaining the overlap of m characters?

References

  1. P. Clifford, R. Clifford, Simple deterministic wildcard matching, in: Information Processing Letters 101 (2007) 53–54 link
  2. Caveats in Using Wisdom link
  3. Real-to-Real transforms link
  4. Adam Kalai. Efficient Pattern-Matching with Don't Cares. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 655-656, 2002
Personal tools