Exact Pattern Matching With Don't Cares

From StringPedia

Jump to: navigation, search

Exact Pattern Matching With Don't Cares is a variant of the Exact Pattern Matching problem which allows a wild-card character (typically '?') to represent any single character from the alphabet. The wild-card may appear multiple times in both pattern and text.

Contents

Formal Description

Given text T of length n and a pattern P of length m, P matches T if:

\exists i such that P[j] = T[i + j − 1] or P[j] = ? or T[i + j − 1] = ? \forall 0\leq j<m

Algorithms

Comparison of Methods

Methodology

Computer Specifications:

  • 2x Intel Xeon 1700Mhz Processors
  • 2GBs of memory
  • Running Ubuntu 9.10 (Karmic)

Compiler: gcc 4.4.1

Flags: -Wall -pedantic -std=c99 -O2

The text was generated as an − 1b and the pattern as an − 1b, ensuring worst-case performance. Note that although neither text nor pattern contain wild-cards in this case, the presence (or lack of) wildcards should have no effect on run times, but leaving them out does allow a comparison against methods where we assume there are no wild cards in the text.

Each test was run 7 times and an average taken.

Run Time

More detailed analysis of the run times of each method can be found on the pages for the respective algorithms ( The naive approach, FFT-based, Using Number Theory). This section concerns itself with the comparison between each method.

When is naive really naive?

The first question to be answered when comparing run times is: at what point does the naive method become slower than the alternatives? Although it has a worse asymptotic time complexity, for small m it will perform better than some other methods. The following graph compares the naive method, the FLINT library for number theory and a number of variations on the FFT-based algorithm which are discussed in detail here. In this case, the text size has been fixed at 220 = 1048576 and the pattern varies between 32 and 1000.

Available in full resolution here or in SVG format here

As we can see, the point at which each method becomes faster than the naive approach varies from around 50 up to about 800 characters, in the case of the FLINT library.

The best O(nlogn) approach in this case is the pre-planned, real-to-real transform, which runs faster for m \gtrsim 300

The best O(nlogm) approach in this case is the pre-planned, real-to-real transform with a forced minimum sub-string size, which runs faster for m \gtrsim 80

If we consider the variation where we only allow wild-cards in the pattern and not the text these bounds are futher lowered to ~200 and ~50 respectively.

The worst FFT-based approach by far is in fact the only FFT-based method which has not included pre-planning. Given a dedicated searching application, however, it seems reasonable that pre-planned transforms would be available.

These observations seem to hold within reasonable bounds (perhaps with the exception of the FLINT library, which seems to improve in comparison to the naive method as m decreases); graphs of different text sizes can be found at the end of this section.

New Comparisons With Randomized Algorithms

New comparisons involving a randomized approach can be seen here in SVG format or in PNG format. This shows a comparison of the best FFT-based methods with two randomized approaches, one of which only allows wild cards in the text. All of the FFT methods are O(nlogm), pre-planned and use a minimum sized transform.

The use of randomization reduces the point at which the naive method becomes slower. In the case where wild cards are allowed in both text and pattern, it is reduced from around 80 to around 45; if wild cards are only allowed in the pattern it is reduce from around 50 to around 25.

Further discussion on the randomized algorithms can be found here.

What happens as m grows?

The following graph shows a comparison between the FLINT number theory library and the two standard pre-planned FFT implementations. The naive method is far too slow to be included for comparison (see here for example running times of the naive method for larger pattern sizes).

Available in full resolution here or in SVG format here

As can be seen, the FLINT library is substantially slower than the FFT implementations, whilst the O(nlogm) algorithm is faster than the O(nlogn) algorithm whilst m is relatively small. Further comparison between the FFT-based methods can be found here.

List Of Graphs

The following is a list of graphs that compare most or all of these methods.


Memory Usage

Literature References

The following is a (no-doubt incomplete) listing of references to this problem in the literature:

  • R. Cole, R. Hariharan, Verifying candidate matches in sparse and wildcard matching, in: Proceedings of the Annual ACM Symposium on Theory of Computing, 2002, pp. 592–601.
  • M. Fischer, M. Paterson, String matching and other products, in: R. Karp (Ed.), Proceedings of the 7th SIAM–AMS Complexity of Computation, 1974, pp. 113–125.
  • P. Indyk, Faster algorithms for string matching problems: Matching the convolution bound, in: Proceedings of the 38th Annual Symposium on Foundations of Computer Science, 1998, pp. 166–173.
  • A. Kalai, Efficient pattern-matching with don’t cares, in: Proceedings of the 13th Annual ACM–SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics,Philadelphia, PA, USA, 2002, pp. 655–656.
  • P. Clifford, R. Clifford, Simple deterministic wildcard matching, in: Information Processing Letters 101 (2007) 53–54 link
Personal tools