First Know Thy Self: Understanding String Matching Performance With ASCII

8 Jul

Introduction

When we seek performance, it is important to understand both when we are under-performing but also why we are under-performing. Both of these can be difficult-to-answer questions, especially once the tasks we want our code to perform become non-trivial. While we have a lot of tools at our disposal, ranging from algorithmic analysis to benchmarks, applying these consistently, or well, is a matter of both research and practice. We often assume that developers have these skills, but rarely do we talk about the specifics: how do you determine what tools to use, or believe, when answering such questions?

In this article, we will continue our series of articles involving dealing with ASCII text. We have already considered performance, but the context was relatively simple, both in terms of what was under-performing and how we discovered the under-performance. Instead, we will take a look at a more difficult problem: string matching on ASCII strings. We will discuss setting up the right benchmarks to evaluate the performance of such a task and then present three solutions for the problem. We will describe and analyze each one before comparing their performance according to our benchmarks, discussing why we are seeing the results that we are. We will also consider how our different analysis tools agree, or disagree, throughout.

To enjoy this article to its fullest, it helps to be broadly familiar with algorithm analysis and automata theory though neither is required.

The problem

The problem of string matching requires us to, when given a string to find (the 'needle') and a larger string to search within (the 'haystack'), either state that the needle does not occur anywhere in the haystack, or state the first position in the haystack which contains the needle in subsequent indices. In our case, we will assume that our strings are ASCII; following on from our article about ASCII text, we used ByteArray as the underlying representation; thus, throughout this article, we will use ByteArrays to represent both needles and haystacks, with the understanding that they both contain only valid ASCII.

To specify our problem precisely, we will need some definitions. Throughout, let s,t be sequences, whose lengths are k, l respectively. We will use the conventional x[i] to refer to the item in sequence x at the (zero-indexed) position i. Intervals are discrete unless otherwise stated. We assume valid index positions for s (and analogously for t) are in [0,k).

A factor of s is some t such that:

l≤k;
There exists i∈[0,k−l) such that s[i]=t[0]; and
For any j∈[0,l), s[i+j]=t[j].

For example, "cat" is a factor of "catamaran" and "non-categorical", but isn't a factor of "miaow" or "commentariat".

We can now describe our problem. Given two ASCII sequences (called the 'needle' and 'haystack'), either verify that the needle is not a factor of the haystack, or produce a valid haystack index i such that:

There is a copy of the needle in the haystack, starting from i; and
There is no valid haystack index j<i such that there is a copy of the needle in the haystack starting from j (that is, i is minimal).

First attempt: brute force

Let's begin with the first thing that could possibly work:

matchRestarting :: ByteArray -> ByteArray -> Maybe Int
matchRestarting needle haystack
  | needleLen == 0 = Just 0
  | needleLen > haystackLen = Nothing
  | otherwise = do
      let first = indexByteArray needle 0
      findFirst haystack 0 first >>= go first
  where
    haystackLen :: Int
    !haystackLen = sizeofByteArray haystack
    needleLen :: Int
    !needleLen = sizeofByteArray needle
    limit :: Int
    !limit = haystackLen - needleLen
    go :: Word8 -> Int -> Maybe Int
    go !first !ix
      | ix > limit = Nothing
      | otherwise = case compareByteArrays haystack ix needle 0 needleLen of
          EQ -> pure ix
          _ -> findFirst haystack (ix + 1) first >>= go first

-- Helper

findFirst :: ByteArray -> Int -> Word8 -> Maybe Int
findFirst !ba !start !target = case go start of
  (-1) -> Nothing
  i -> Just i
  where
    go :: Int -> Int
    go !i
      | i >= len = -1
      | indexByteArray ba i == target = i
      | otherwise = go (i + 1)
    len :: Int
    !len = sizeofByteArray ba

This corresponds directly to the brute force algorithm: we loop through the entire haystack, trying to find matches to the needle. If we fail to match, we step forward, then try again, until we either find a match or run out of haystack to search.

We make a few small changes to the algorithm as described, mostly for efficiency:

If we have a needle larger than the haystack, we don't bother doing any searching, as we could never find a match anyway.
Instead of starting at the beginning and stepping one byte per 'miss', we use findFirst to 'skip forward' to the next instance of the first byte in the needle.
Instead of an inner loop to drive the comparison, we use compareByteArrays, which is much more efficient[1].

The second change deserves some discussion. The original definition of the brute force algorithm only advances a single index position every time we fail to match the needle (or 'miss'). However, if the byte at this new position does not match the first byte of the needle, there is no point checking whether we have a match: we are guaranteed to 'miss'. Thus, we use findFirst to skip over any such cases, so that we can be sure that any test for a match at least starts with the right byte. This can potentially net us significant improvements in cases where the first byte of the needle is not common in the haystack.

Before we do any measuring, it is worth considering the inherent time and space complexity of our choice. If we have a needle of length k and a haystack of length n, the brute force algorithm for string matching famously has worst-case Θ(nk) time complexity, and Θ(1) space complexity. This time complexity arises because every time we 'miss', we have to examine a sequence of bytes with length k, but only advance one byte, which potentially means every byte requires Θ(k) examinations; as there are n bytes, we get Θ(nk). Our implementational improvements don't affect the worst-case analysis in any way, as findFirst may not gain us a larger 'skip', and while compareByteArrays is more efficient than a simple compare loop in practice, it still must examine Θ(k) bytes in the worst case. The space complexity stems from all our overheads being fixed: essentially only information enough to track loops.

At the same time, practical running times mirroring worst-case algorithmic complexity require fairly pathological cases. More typically, we can expect better performance: on average, the expected total number of byte comparisons on a single run of the brute force algorithm is Θ(2n) irrespective of needle length, which suggests an average-case time complexity of Θ(n). At the same time, this analysis makes some assumptions that might not be true of 'typical' ASCII text, much less the kind of inputs we can expect to actually see.

To this end, we can now discuss, and perform, some measurements of our approach in practice. After this, we can consider how we can improve our approach.

Measuring performance

To ensure that we get meaningful results, we have to select our benchmarking cases carefully. This is especially the case with string matching, as we have to choose not only the size of the needle and haystack, but also their composition. The reasons for this are to do with pathological cases exhibited by many string matching algorithms, including both the naive approach above, as well as the two more optimized approaches we will discuss later.

To that end, we will use three different benchmark cases. For all of these, the haystacks will be roughly half a megabyte, while the needles will vary in size between 8 and 64 bytes. The compositions of each case are as follows:

The haystack is 8096 copies of 64 'a' bytes, followed by the needle. The needle is 63 'a' bytes, followed by a 'b' byte. We will call this the 'worst case', for reasons we will discuss later.
The haystack is 8096 copies of 64 'b' bytes, followed by the needle. The needle is the same as for the 'worst case'. We will call this the 'best case', again, for reasons we are about to discuss.
The haystack is 11,775 copies of the bytes "All work and no play makes Jack a dull boy." plus a newline byte, all followed by the needle. The needle is the bytes of "overseer". We will call this the 'average case'.

We will discuss the motivations of each of the 'cases' in turn. The composition of the 'worst case' is designed specifically to make the brute force approach work as hard as possible. To see why, let us consider what would happen during a 'run' of the brute force implementation:

Since the starting byte of the needle is 'a', findFirst would find an instance at position 0.
Trying the match would fail, as the 64th byte is an 'a', rather than a 'b'.
findFirst would find another 'a' byte at position 1.
Trying the match would fail again, as the 65th byte is an 'a', rather than a 'b'.
findFirst would find another 'a' byte at position 2.
Etc.

This effectively means that we would need to run a linear number of matches, each of equal length to the needle, before we find the needle at the very end. This would mean every 'a' leading up to the needle match at the end of the haystack would be checked k times, where k is the length of the needle, which, given a haystack of length n, gives Θ(nk) performance exactly. This truly is the worst possible performance, hence our choice.

The 'best case' is meant to act as almost the polar opposite of the 'worst case'. As findFirst will be able to skip all the 'b' bytes before reaching the needle at the end of the haystack, we only need to try one needle match, which immediately succeeds. This would essentially make the performance linear in the size of the haystack. We include this case specifically to test the ability to 'skip' as many places in the haystack as possible which couldn't be matches.

Lastly, the 'average case' is designed to look like reasonable (if somewhat strange) ASCII text. Whereas both the 'best case' and 'worst case' benchmarks are designed specifically to find limits in each implementation, the 'average case' instead gives a more realistic demonstration of performance. Specifically, the haystack contains several 'o' bytes we could skip to, as well as a mixture of different bytes in both the needle and haystack.

By choosing these specific cases, we can show several things. Firstly, we can show the limits of the brute force approach, both on a highly favourable, and a highly unfavourable, set of parameters. Secondly, we can see whether our subsequent attempts improve on these limits in any way. Lastly, via the 'average case', we can see whether improvements at these limits mean anything in practice. This should give us a good indication of which of our attempts is the 'right choice'.

How good is our first try?

Our initial performance measurements are as follows:

All
 Worst-case, naive: OK
   1.77 ms ± 9.6 μs, 101 B  allocated,   2 B  copied,  21 MB peak memory
 Best-case, naive:  OK
   219  μs ± 3.0 μs,  79 B  allocated,   0 B  copied,  25 MB peak memory
 Average, naive:    OK
   334  μs ± 6.6 μs,  87 B  allocated,   1 B  copied,  26 MB peak memory

We can quickly see the disparity between the 'worst case' and 'best case', despite the needle being the same in both situations. This is unsurprising, as we are comparing quadratic performance to linear performance. We can also see that for the 'average case', we're closer to the 'best case', being only about 50% worse.

It is worth mentioning here that we could potentially improve the 'worst case' by not using findFirst. Had we implemented the brute force algorithm as originally defined, we would advance to the next index every time we 'missed' instead of calling findFirst. This would decrease overheads in the worst case considerably, as we would no longer need to make Θ(n) calls to a function. However, this would come at a large cost for both the 'best case' and 'average case':

All 
 Worst-case, no-skip: OK
   1.54 ms ±  14 μs, 117 B  allocated,   5 B  copied,  21 MB peak memory
 Best-case, no-skip:  OK
   1.44 ms ±  24 μs, 178 B  allocated,  10 B  copied,  25 MB peak memory
 Average, no-skip:    OK
   1.54 ms ±  11 μs,  86 B  allocated,   2 B  copied,  26 MB peak memory

We can see that, while the 'worst case' does become better, it does so by a small amount (~15%), but the 'best case' becomes almost seven times worse, while the 'average case' becomes just as bad as the 'worst case'[3]. This highlights the importance of benchmark choice: had we ignored the 'best case' and 'average case', we would have concluded that findFirst was actually detrimental.

Second attempt: Boyer-Moore-Horspool

While the brute force approach is definitely far from unusable, we would prefer to improve on it if we could. Before we can do this, however, we need to understand what exactly makes the brute force approach weak. We can then attempt to address it, and see if it helps.

The key problem with the brute force approach is how we handle match failure, or, more precisely, what we do after a match to the needle fails. When the brute force approach detects a match failure of the needle at index i, it tries again at index i + 1. In general, this is unavoidable, as for an arbitrary needle failing to match inside an arbitrary haystack, we cannot be sure that 'skipping' any further doesn't risk missing a match. However, this ignores the fact that, for some needles, we can potentially 'skip' much more, depending on the needle and what we mismatched on. By not making use of this information, the brute force approach is forced to take the worst option 'just in case'.

To see why this can help, consider the needle "cats". Suppose we try to match our needle into a haystack, and detect a mismatch at position i, where, instead of what we expected, we saw a 'c'. This already tells us a lot: there is no possible way that a match starting anywhere before i could work, as a 'c' can only occur at the start of our needle, and nowhere else. Thus, we could safely 'skip' our next starting match point to i, without having to worry that we ignored potential matches. In fact, for any needle, we can calculate what 'safe skip distances' we have by simply examining the last position where any given byte occurs: that position will tell us exactly how far it is safe to skip to on a mismatch.

This specific insight enables the Boyer-Moore algorithm, which preprocesses the input needle and constructs a table of safe 'skip distances'. Then, depending on where a mismatch is detected, the algorithm can use that table to 'skip' as much as is safe. Thus, the Boyer-Moore (or BM) algorithm is used in several standard libraries (specifically Boost and the Go standard library), as well as GNU grep.

We choose instead to use a simplified version of Boyer-Moore, known as Boyer-Moore-Horspool, or BMH. The simplification relates to how we calculate the 'skip' on a needle mismatch. Suppose that, at position i in the haystack, we detect a mismatch on a needle of length k. Instead of trying to establish where the mismatch arose, we examine the byte at position i + k-1 (that is, the haystack byte lining up with the last needle byte), and use it to calculate the 'skip' we require by checking the skip table. This simplification works as, regardless of whether the last position matched or didn't, as long as we 'align' our next attempt with the first occurrence of that byte that could exist in the haystack, we can be sure we won't miss any possible matches. Furthermore, this simplifies the implementation to a small extension of the brute force method.

Putting all this together, we get the following implementation:

matchBMH :: ByteArray -> ByteArray -> Maybe Int
matchBMH needle haystack
  | needleLen == 0 = Just 0
  | needleLen > haystackLen = Nothing
  | otherwise = go 0
  where
    haystackLen :: Int
    !haystackLen = sizeofByteArray haystack
    needleLen :: Int
    !needleLen = sizeofByteArray needle
    limit :: Int
    !limit = haystackLen - needleLen
    go :: Int -> Maybe Int
    go !ix
      | ix > limit = Nothing
      | otherwise = case compareByteArrays haystack ix needle 0 needleLen of
          EQ -> pure ix
          _ ->
            let !lastHaystackByte :: Word8 = indexByteArray haystack (ix + needleLen - 1)
             in go (ix + skipVector Vector.! fromIntegral lastHaystackByte)
    skipVector :: Vector Int
    !skipVector = runST $ do
      mv <- MVector.replicate 127 needleLen -- only ASCII needed
      for_ [0, 1 .. needleLen - 2] $ \i -> do
        let needleByteAsIx = fromIntegral @Word8 . indexByteArray needle $ i
        MVector.write mv needleByteAsIx (needleLen - 1 - i)
      Vector.unsafeFreeze mv

We use a mutable Vector of Ints for our skip table[3]. While this representation has a lot of duplicate entries, in practice, being able to index quickly using the mismatched byte itself makes up for this. Additionally, given that we only need entries corresponding to ASCII, these 'wasted' entries aren't really a problem. The use of Ints for skip distances is probably excessive here, as we are unlikely to see needles long enough to need the representable positive range of Int, this is convenient for us to work with, and doesn't add that much to the size of the skip table4. We fill the skip table in ST, as our construction depends on looping through the needle, which means doing it mutably is much more efficient than the equivalent work over an immutable Vector.

One important detail to single out regarding the skip table is that we disregard the last byte of the needle. This is because a direct extension of the way we generate the skip table would produce the wrong answer: at needleLen - 1, the skip distance would be 0, which would produce an infinite loop. Furthermore, regardless of what the last byte of the needle is, we already know what its skip distance should be without even having to examine it:

If the last byte of the needle does not occur anywhere else in the needle, we can skip a needle's worth of bytes, as this could not align with any other part of the needle.
If the last byte of the needle does occur somewhere else in the needle, we will already have calculated its entry correctly from previous steps.

Otherwise, our implementation is very similar to matchRepeating from before, with the main difference being us not using findFirst, both at the start and when we detect a mismatch. Instead, we always begin at position 0, and when we detect a mismatch at position i, we check the byte at i + needleLen - 1 in our skip table, and advance accordingly.

The decision not to use findFirst may seem surprising here. In theory, using findFirst could improve performance, as it would potentially allow larger skips. However, we would end up in the same situation as we did when we eliminated findFirst from the brute force approach: the overheads of the use of findFirst would drown out its benefits. Furthermore, we can see that the BMH algorithm already inherently has the ability to skip sections as large as the size of the needle. Thus, it's not as important here as it was for matchRestarting.

Before we run any benchmarks, let us consider whether this approach has any algorithmic complexity advantage. To properly state this complexity, we must now consider σ, which is the size of the alphabet used to construct the inputs, as well as the needle length k and haystack length n. Typically for such algorithms, σ is taken to be 256, but in our case it's actually 127, as we only ever deal with ASCII.

Unfortunately for us, there is no improvement to be gained in the worst case. The worst-case time complexity of BMH is now Θ(nk+σ), while the space complexity is now Θ(σ)[4]. We can't gain an algorithmic improvement in time complexity as, while the largest possible 'skip' we can get on a mismatch is now k bytes, the smallest is still only 1 byte. This means a sufficiently pathological input could still give us quadratic performance in the same way as the brute force algorithm. The extra σ in both time and space complexity stems from the skip table, whose length will be proportional to the alphabet size in the worst case.

However, the average case complexity is provably better for BMH[2]: the number of comparisons for each byte in the input is in the range [1/σ, 2/σ+1]. Thus, the average case complexity should be Θ(n+k+σ2+kσ). This analysis is a good example of parameterized complexity, which aims to more sharply define worst-case behaviour against a parameter, which is some measurable property of the input that is not its length. While typically used in complexity theory for problems that (probably) lie outside of P, it can be just as useful in cases like the one we are examining. Specifically, we can see that the real 'quadraticness' of the problem stems from a combination of alphabet size and needle length.

How does this analysis reflect the actual performance, though?

All
 Worst-case, naive: OK
   1.77 ms ± 9.6 μs, 101 B  allocated,   2 B  copied,  21 MB peak memory
 Worst-case, BMH:   OK
   1.74 ms ±  33 μs, 1.2 KB allocated,  10 B  copied,  21 MB peak memory
 Best-case, naive:  OK
   219  μs ± 3.0 μs,  79 B  allocated,   0 B  copied,  25 MB peak memory
 Best-case, BMH:    OK
   24.5 μs ± 432 ns, 1.1 KB allocated,   0 B  copied,  25 MB peak memory
 Average, naive:    OK
   334  μs ± 6.6 μs,  87 B  allocated,   1 B  copied,  26 MB peak memory
 Average, BMH:      OK
   206  μs ± 2.5 μs, 1.1 KB allocated,   0 B  copied,  26 MB peak memory

We can see that in the 'worst case', BMH is no better than the naive approach. Furthermore, our allocations are larger for BMH due to the cost of the skip table. This matches our analysis well.

At this stage, it is worth discussing that, while pathological cases exist for both BMH and the brute force approach, including our 'worst-case'[5], the number of pathological cases for BMH is significantly smaller. To see why, let us consider what kind of situation would trigger frequent repeated re-checks of the same byte in both algorithms. For the brute force approach, any haystack with frequent occurrences of the first symbol of the needle, but few or no actual matches, would start to become quadratic. For BMH, however, we would need not only frequent occurrence of a specific byte from the needle, but also a homogenous, or nearly homogenous, needle to defeat 'skipping'. While clearly, any pathological BMH case is also a pathological brute force case, the converse is provably not true. Thus, the probability that we end up with worst-case performance from BMH is much lower than for brute force, even in absolute terms.

Furthermore, for other cases, we see the benefits of BMH on display. The 'best case' is almost 10 times faster using BMH as compared to brute force, and the average case is about 40% faster. This not only tracks with our analysis above, but also acts as good evidence to confirm our assumptions regarding the use of findFirst (or rather, our choice not to use findFirst). The length of the needle helps us here, as we potentially skip as many as 64 bytes at a time without having to pay any costs of 'calling into' a different routine.

This is certainly an improvement, but can we do even better?

Third attempt: backward nondeterministic DAWG matching

Once again, before we try to improve on our earlier attempt, we need to consider where BMH is inadequate. In order to state this problem precisely, we will need some more definitions. We say that s is a prefix of t if s is a factor of t and s[0]=t[0]; analogously, s is a suffix of t if s is a factor of t and s[k−1]=t[l−1]. Continuing our examples from the problem description, "cat" is a prefix of "catamaran", but not "non-categorical", whereas "rat" is a suffix of "socorrat", but not "crate". We also note that, given a sequence s=s0,s1,…,sk−1, and its reversal sr=sk−1,sk−2,…,s0, any prefix of s is a suffix of sr, and vice versa.

We can see from both the descriptions, and implementations, of the brute force approach and BMH, that they work in a similar way:

Start at the first possible position in the haystack that could contain a needle match (a 'candidate position').
Check increasingly longer prefixes of the needle at the candidate position.
If the longest prefix of the needle exists starting at the candidate position, report the candidate position and stop.
Otherwise, calculate a new candidate position; if you can't, stop and report failure.
Go to step 2 with the new candidate position.

In particular, step 2 involves starting with the shortest non-empty prefix of the needle (that is, the prefix whose length is 1), and repeatedly trying to extend it, until we either fail to match, or match the prefix of maximum length. This does not differ between the brute force and BMH algorithms or implementations: instead, they vary in how new candidate positions are calculated.

This prefix-extension approach certainly works, but unfortunately, it is limited both in terms of how quickly it can detect failures, and how much information it can supply to step 4 of the above description. To see why this is the case, consider the situation where we match a prefix of length 2, but fail to match its length 3 extension. The matched prefix is definitely a factor of the needle, but it doesn't have to be the only factor; thus, we have to be careful that our next candidate index doesn't accidentally exclude this possibility. However, to do this optimally would require solving a different problem: we would need some way to check the last occurrence of any factor. While this problem is definitely solvable, doing so efficiently is another challenge altogether. Neither the brute force approach nor BMH attempt this task, instead choosing a conservative estimate of the furthest-away candidate index, based on different measurements. We can see that BMH can give better answers than the brute force approach, but not necessarily in all cases.

Thus, we can see that to improve further, we need some way of solving the 'last factor' finding problem described above. This would allow us to optimize our 'skips'. Fortunately for us, solutions for doing this efficiently exist, but we will have to consider our problem from a different perspective, namely that of automata theory. As this is a deep topic, we will have to take a focused, and thus somewhat cursory, look at a deep topic with many results. We provide multiple sources (via links) for those who want to understand every detail.

The key tool that enables us to solve the 'last factor' problem fast is a directed acyclic word graph (DAWG), also known as a suffix automaton. This is a finite automaton which, given some string p, recognizes the language consisting of all suffixes of p. The chief benefit of the suffix automaton is that it is efficient and relatively easy to construct. In particular, it can be shown that for any string of length n, its suffix automaton will have Ω(n) states and transitions. When we combine this fact with the observation that we can use this construction to recognize prefixes as well (by reversing the string), we can get an algorithm that is similar to Boyer-Moore, called Reverse Factor. This is an approach that is quite similar to BMH, but uses a different formalism. We can see from its analysis that it performs similarly: notably, it does not solve the 'last factor' problem.

However, we can use this approach as a springboard to construct an implementation which does solve the 'last factor' problem, provided we can accept a small restriction. This algorithm, called BNDM (for Bounded Non-deterministic DAWG Matching), makes use of several noteworthy features of both suffix automata and typical hardware, allowing us to both solve the 'last factor' problem and also run competitively with, or even outright beat, BMH.

To describe the combination of techniques used by BNDM, we first observe that, by convention, a suffix automaton is taken to be deterministic: that is, for every symbol in the alphabet, every state of a suffix automaton has a transition to another state. This has some beneficial properties: in particular, given any input, and any position in that input, a suffix automaton will always be in exactly one state. However, it is possible to view suffix automata as nondeterministic: instead of being in a single state at any given position in an input, they can be in multiple states simultaneously (or none whatsoever). While this complicates running the automaton, it makes defining it much more straightforward. Furthermore, if we accept nondeterminism, we can construct a variant of a suffix automaton for some string p that is also able to recognize factors of p. Furthermore, for any string of length n, such an automaton has o(n) states and transitions, which is asymptotically smaller than the equivalent deterministic automaton.

However, the downside is that when we run a nondeterministic automaton, we must track not just a single state, but a set of states, that the automaton is in at any point. This is normally an untenable proposition, as given any set S of states, we would need to track any one of 2𐞷ˢ𐞷 possible current states at any point. Furthermore, during any transition, we would need to consider every current state we are in to determine the set of states the automaton will be in next, which in general means Θ(∥S∥) cost per transition. Fortunately, Navarro and Rafinot showed that this is still practical, provided that we limit the length of the needle to the word size of the machine (typically 64 on modern architectures). They called their approach 'Backward Nondeterministic DAWG Matching', or BNDM.

The key insight used in BNDM is that, to simulate a nondeterministic automaton for detecting the factors of a needle that has length k, we need to track being in only at most k states at once. This means that, provided that k is not larger than the size of a single machine word, we can maintain the current state as a single machine word. Furthermore, because of the specific form of such an automaton, we can compute a transition using only a fixed number of bitwise operations, regardless of the current, or future, state. This means that each transition requires Θ(1) time. This combination allows rapid checking for the 'last factor' problem, while also needing minimal additional time or space.

We implement BNDM as below. This implementation is based on this 'Coding on the edges' article.

matchDAWG :: ByteArray -> ByteArray -> Maybe Int
matchDAWG needle haystack
  | needleLen == 0 = Just 0
  | needleLen > haystackLen = Nothing
  | needleLen > 64 = matchRestarting needle haystack
  | otherwise = go 0
  where
    haystackLen :: Int
    !haystackLen = sizeofByteArray haystack
    needleLen :: Int
    !needleLen = sizeofByteArray needle
    limit :: Int
    !limit = haystackLen - needleLen
    go :: Int -> Maybe Int
    go !i
      | i > limit = Nothing
      | otherwise = do
          let !haystackIx = i + needleLen - 1
          let !maskIx = fromIntegral @Word8 . indexByteArray haystack $ haystackIx
          let !mask = maskVector Vector.! maskIx
          goInner (needleLen - 1) mask i
    goInner :: Int -> Word64 -> Int -> Maybe Int
    goInner !j !mask !i
      -- We missed
      | mask == 0 = go (i + j + 1)
      -- We found
      | j == 0 = pure i
      | otherwise = do
          let !maskVectorIndex = fromIntegral @Word8 . indexByteArray haystack $ i + j - 1
          let !lookedUp = maskVector Vector.! maskVectorIndex
          let !newMask = (mask `unsafeShiftL` 1) .&. lookedUp
          goInner (j - 1) newMask i
    maskVector :: Vector Word64
    !maskVector = runST $ do
      mv <- MVector.new 128 -- only ASCII needed
      for_ [0, 1 .. needleLen - 1] $ \i -> do
        let needleByteAsIx = fromIntegral @Word8 . indexByteArray needle $ i
        MVector.modify mv (\w64 -> setBit w64 (needleLen - 1 - i)) needleByteAsIx
      Vector.unsafeFreeze mv

We can see that, in many ways, our implementation of BNDM mirrors the one for BMH. We still need to pre-process the needle to build up a table: however, the table now corresponds to a collection of bit masks. This 'mask table' represents a mapping between ASCII bytes and sets of positions where such bytes occur in the needle. For example, in the word "catamaran", the ASCII byte 'a' would be paired with a bit mask that has 4 set bits, the ASCII byte 't' would be paired with a bit mask that has 1 set bit, while the ASCII byte 'C' would be paired with an empty bit mask. We set the bits 'in reverse' in these masks, as we have to match in reverse (as for Reverse Factor).

The actual implementation also works similarly to BMH. The largest difference stems from how we match the needle: instead of comparing an entire block of the haystack, we simulate the nondeterministic automaton we described previously, starting at the end of the block, using a machine word as a 'state accumulator'. If our 'state accumulator' is ever all-zero, it suggests we have 'fallen off', as we have found a sequence that is not a factor of the needle. We can then use the last 'step' we took as an indicator of how much we can skip, as this last 'step' would indicate the last time we had a factor match. To keep the 'state accumulator' updated, we use a combination of a lookup in our table for the byte we just read from the haystack, a shift, and a bitwise AND.

As previously mentioned, we will first consider algorithmic performance. To make our analysis as clear as possible, we will need an additional parameter w, which is the size of a machine word in bits; alternatively, we can see w as the maximum length of the needle we are allowed to search for using the optimized method. Due to this, we have to be careful how we define 'worst case' complexity, as for needles longer than a machine word's bit length, we defer to the BMH algorithm, and thus would inherit its complexity. Thus, we will assume that the needle length is bounded by w, and thus use w instead of min{k,w} in the worst case. Furthermore, for consistency, we will assume that allocating a single entry in the mask table requires Θ(w) work, not Θ(1), as otherwise, w is being treated as a constant for space, but not time[6].

If we consider the worst-case time complexity, we can unfortunately see that, even with all of our improvements, we are still vulnerable to the same performance issue as both the brute force approach and BMH on sufficiently pathological inputs. Thus, our worst-case time complexity is Θ(w(σ+n+1)), while our space complexity is Θ(wσ). While this looks like we eliminated the k factor of needle length completely, in fact, it applies in the exact same way as with our other approaches. This suggests that needle length still affects complexity multiplicatively in the worst case, and that a pathological case would behave similarly. The average-case complexity is harder to calculate, but according to the authors of BNDM, it is effectively O(n logσ/(k)). This is difficult to compare to our existing average case for BMH, as the bounds are different; however, we can see that only the logarithm of k is a multiplicative factor for BNDM, unlike BMH where it is k itself; furthermore, we note that k is also a divisor here, which means that as needles become longer, our average performance actually increases.

But what happens in practice?

All
 Worst-case, naive: OK
   1.77 ms ± 9.6 μs, 101 B  allocated,   2 B  copied,  21 MB peak memory
 Worst-case, BMH:   OK
   1.74 ms ±  33 μs, 1.2 KB allocated,  10 B  copied,  21 MB peak memory
 Worst-case, DAWG:  OK
   24.7 ms ± 397 μs, 1.5 KB allocated,  40 B  copied,  21 MB peak memory
 Best-case, naive:  OK
   219  μs ± 3.0 μs,  79 B  allocated,   0 B  copied,  25 MB peak memory
 Best-case, BMH:    OK
   24.5 μs ± 432 ns, 1.1 KB allocated,   0 B  copied,  25 MB peak memory
 Best-case, DAWG:   OK
   12.6 μs ± 112 ns, 1.1 KB allocated,   0 B  copied,  25 MB peak memory
 Average, naive:    OK
   334  μs ± 6.6 μs,  87 B  allocated,   1 B  copied,  26 MB peak memory
 Average, BMH:      OK
   206  μs ± 2.5 μs, 1.1 KB allocated,   0 B  copied,  26 MB peak memory
 Average, DAWG:     OK
   55.5 μs ± 836 ns, 1.1 KB allocated,   0 B  copied,  26 MB peak memory

We can see here that the 'worst case' is far worse for the BNDM implementation as compared to the brute force approach or the BMH implementation: nearly 14x worse in fact! This stems largely because the BNDM implementation cannot benefit from the speed of compareByteArray[7], as this routine would not provide enough information for the BNDM implementation. However, the 'best case' and 'average case' are significantly faster as compared to BMH: the 'best case' is about twice as fast, and the 'average case' is almost four times faster. We also see that there is no additional cost in allocations, as the mask table happens to need the same space as the skip table did for the BMH implementation.

These results seem to bear out the algorithmic analysis for BNDM, as it seems to be much better than BMH on 'typical' or 'easy' cases. While it would be good to see how much of its poor performance on the 'worst case' is due to the case's pathology, versus the limitations of the implementation, this is difficult to assess. It is worth noting that the improvements came at the expense of limiting the length of needle our BNDM implementation could work with. However, we do not have to give up generality, as our implementation shows: for needles over 64 bytes, we can defer to the BMH implementation. This means we can still get good performance even on long needles, and keep excellent performance on shorter ones, at little to no real cost.

Conclusion

Our goal throughout this process was to illustrate how performance can be assessed, and improved upon. We showed this in multiple ways throughout:

Use of algorithmic analysis in worst and average-case, to show the 'weak points' of our performance we could theoretically see.
Construction of benchmarking cases to highlight 'weak points' in our practical performance.
Seeking out improvements based on what both of these measurements told us.
Seeing if those improvements help us in practice.

By going through three approaches, each of increasing complexity, with an eye to using each subsequent one to address the problems of the previous, we also demonstrated the practical implications of dealing with these measurement outcomes. We could see that ultimately, both theoretical analysis and practical measurements can guide us right, but can also guide us wrong, and in some cases, we needed to make adjustments for issues of our implementation language.

The string matching problem serves as an interesting case study, as it is both a useful problem to solve as efficiently as possible, while also being complex enough to make analysis of its performance less than straightforward. By showing the practical implications of three different algorithms and their implementations on performance, we show that in nearly all cases, tradeoffs exist, and even noticing the impact of these tradeoffs can be challenging. These tradeoffs can often be 'hidden' by both theoretical analysis and practical measurement, and we must take some care to have each inform the other. We also saw the problems of pathological cases in full force: just because such cases exist doesn't mean that they are common, or even relevant.

Overall, a hybrid approach between BMH and BNDM seems to be a good choice overall for 'typical' ASCII text inputs. While more optimizations are likely possible, these are more likely to come from use cases more narrow than just 'ASCII text', or would require capabilities GHC Haskell doesn't (currently) provide us with. Nevertheless, there is significant performance to be found if we know how to look for it, and GHC Haskell still gives us enough tools to improve our performance, if not always in the most obvious way. If you like what you’re reading and this sounds relevant to your business, please don’t hesitate to get in touch.

compareByteArrays calls down to memcmp via the runtime, which is heavily optimized at the level of the C stdlib. Additionally, because this is part of the runtime, it takes no FFI penalty. We could hardly improve on that.
At least, under the same assumptions as for the brute force approach, however flawed.
Primitive, or unboxed, Vectors are the best choice here, as they're more compact in memory.
This is assuming that the length of any needle is bounded to the maximum positive number representable using a machine word. This isn't really much of an assumption, and tends to be made silently with descriptions of similar algorithms.
It's almost like we did this on purpose.
This is notably different from how we treated machine words when analyzing BMH. However, the assumption we needed for BMH was much weaker, and this assumption tends to get made silently by algorithm analysis anyway.
Technically speaking, it should be possible to write a comparison-like operation similar to compareByteArray that serves the same role as the inner loop of the BNDM implementation. The inner loop is essentially a prefix scan, which is known to be data parallel. However, to implement this efficiently requires either actual SIMD hardware with wide registers or a significant reduction in permissible needle sizes. This means that implementing this in Haskell is unlikely to help much.

HaskellDAWGAlgorithmsBoyer-Moore-HorspoolAutomata TheoryBNDM

Koz Ross

First Know Thy Self: Understanding String Matching Performance With ASCII

From Term to Script: How PlutusType Drives Plutarch