Fast Findings with SWAR and the FFI: The Good, the Bad, and the Inefficient

14 Aug

Introduction

In a previous article, we discussed the use of SIMD-within-a-register, or SWAR techniques, to speed up ASCII validation. We showed considerable performance improvements, without ever having to leave Haskell, but also demonstrated its limitations. In a different article, we worked on the problem of string matching in ASCII; as part of that, we needed an auxiliary function to find the first match to a specific byte. While we found that in some cases, the use of this auxiliary function helped with performance, we also noticed quickly that in other cases, performance actually became worse.

These observations and articles naturally led us to wonder:

What is the cause of this surprising loss of performance under some conditions?
Could we do better with the FFI than with SWAR?
How do the answers to the two previous questions affect each other?

With that in mind, we decided to investigate. Starting from the problem of finding the first byte match in an ASCII sequence, we will compare a naive implementation to a SWAR implementation to show what kind of speedup we could theoretically get. In the process, we will show a more complex example of SWAR methods. We will then compare to an implementation via the FFI, as we observed previously that what we want is just memchr, while discussing how to properly set up and use the FFI. We will then investigate whether the performance losses we noticed stemmed from call overhead, and see how bad such overheads are with each of the naive, SWAR and FFI implementations. Lastly, we will consider how these performance losses can, and cannot, be addressed.

We recommend reading both of the articles that inspired this write-up before reading this one. While the material in this article stands alone, a lot of what we use and why it is better explained in the previous articles. Familiarity with these articles will help contextualize both what we are doing and why. To make them easier to find, the links are below:

The problem

For clarity, we will reiterate the specific functionality from our string matching article that we will work with. However, to make the problem more interesting, we will also allow callers to specify a substring of the original input to search within, rather than just a suffix.

More precisely, given the following:

An ASCII ByteArray (the 'haystack'),
A byte (the 'needle'),
A starting index in the haystack, and
A span in bytes

We need to find the first (lowest-index) occurrence of the needle in the haystack, within the index range from the starting index to the starting index plus the span. If no such occurrence exists, we need to indicate this.

Throughout, we will assume that the haystack contains only ASCII bytes. We feel this is safe to do, as the context of the original problem was over ASCII, and thus, anything passed to such a function would have been checked for validity previously. However, we will not assume this for the needle, as this could be any byte. We could potentially also ignore this check, but as it is minor, and Haskell lacks a dedicated ASCII byte type, we feel it's not worth it.

Our previous attempt

We'll begin by adapting the findFirst implementation we used in our string matching article:

  
    findFirstNaive :: ByteArray -> Int -> Int -> Word8 -> Maybe Int
findFirstNaive haystack start span needle
  | needle >= 128 = Nothing
  | start < 0 = Nothing
  | start >= len = Nothing
  | span < 1 = Nothing
  | otherwise = case go start of
      (-1) -> Nothing
      i -> Just i
  where
    len :: Int
    !len = sizeofByteArray haystack
    limit :: Int
    !limit = min (start + span) len
    go :: Int -> Int
    go !ix
      | ix == limit = -1
      | indexByteArray haystack ix == needle = ix
      | otherwise = go (ix + 1)
  

The only changes here are replacing guard with explicit uses of Nothing with guards[1], as well as some additional work to allow specifying a starting point and what span from that point to search in. Otherwise, this amounts to a loop over indexes, checking for the first match to the needle.

To see how this performs, let's set up a basic benchmark. To design this benchmark, we first note that the most work will be done when:

start is 0;
span is the length of the haystack; and
The needle is not in the haystack.

We will use a 2MiB ByteArray, entirely full of zero bytes, and try to find a nonzero byte (in our case, 0x01), with start and span as above. Initial results are as follows:

  
    All
  Naive: OK
    880  μs ±  17 μs,  74 B  allocated,   4 B  copied,  53 MB peak memory
  

Since we've processed 2MiB (2,097,152 bytes) of data, this gives us an effective worst-case processing rate of about 2.2 GiB per second. This is certainly not terrible, and it proved useful enough for us when implementing naive string matching. We will use this as a baseline to see how much we can improve.

Better with SWAR?

findFirstNaive suffers from performance limitations similar to those we observed in our previous SWAR article: we process only one byte at a time, thus making little use of temporal or spatial locality. Since we have access to an eight-byte 'window' through just our register sizes, we should be able to speed up our search using similar techniques to what we used for detecting non-ASCII bytes previously. However, as with all SWAR, we have to reformulate how we process these eight-byte blocks to get the answer we want.

Our first goal is to find some way of taking an unknown block of eight bytes and creating a new block such that any byte in that block that matches our needle takes one known form, while any other byte takes a different, but still known, form. To help visualize this process, we will use some ASCII diagrams: to keep it manageable, we will use four-byte blocks here.

Suppose that our needle is the byte 0110_0001 (the ASCII for 'a'), and we load a block where the second byte matches, but none of the others do. We will label bits of unknown state using x.

  
    | i + 3     | i + 2     | i + 1     | i         |
|-----------|-----------|-----------|-----------|
| 0xxx_xxxx | 0xxx_xxxx | 0xxx_xxxx | 0xxx_xxxx |
  

We have our indexes (relative to our loop-tracked position i) along the top of this diagram, with the actual bit patterns along the bottom. The ordering here appears backwards from what we expect, as higher indices occur 'to the left' rather than 'to the right'. This is to do with how bytes are ordered within a word (endianness), specifically when we use indexByteArray (or indexByteArrayUnaligned). While this is technically platform-specific, in our case, earlier bytes (that is, those with lower indexes) end up later in the block[6].

We can see that at this stage, all but the high bits are in an unknown state. The reason we can be sure that the high bit is 0 is that our haystack has only ASCII bytes. This will become critical later.

Our first step involves an XOR operation with a block whose every byte is a copy of our needle. Since XOR will zero any bit in the input block that matches, we end up with a situation like this:

  
    |       | i + 3     | i + 2     | i + 1     | i         |
|-------|-----------|-----------|-----------|-----------|
|       | 0xxx_xxxx | 0xxx_xxxx | 0xxx_xxxx | 0xxx_xxxx |
| XOR   | 0110_0001 | 0110_0001 | 0110_0001 | 0110_0001 |
| gives | 0xxx_xxxx | 0xxx_xxxx | 0000_0000 | 0xxx_xxxx | 

  

Since we stated that the second byte (at i + 1) matches our needle, we can see that it is now zeroed. This fulfils our first goal: any matching bytes now have a known form. However, non-matching bytes are still uncertain, but this step does give us a bit more information than before. As the haystack (and thus, the original block) contains only ASCII bytes, the high bit of every byte remained clear after the XOR step. Additionally, since only an exact match to the needle would be zeroed, we know that any byte that didn't match must be nonzero. Putting these observations together, we now also know that non-matching bytes are in the range [1, 127]. While not a large increase in what we know about the block, it's enough for our next step.

Our next step is the critical part of our SWAR implementation: we subtract the block with 0x01 in every lane. Why this works, or what this even does, is a bit difficult to see. Before we fully explain this, we will go through a small example on just two bytes. Suppose after an XOR step, we have this:

| 0100_1110 | 0000_0000 |

The block we will subtract will look like this:

| 0000_0001 | 0000_0001 |

As subtraction is just addition of the twos-complement[5], we actually have the following operation:

  
    |       | 0100_1110 | 0000_0000 |
|-------|-----------|-----------|
| +     | 1111_1110 | 1111_1111 |
| gives | 0100_1100 | 1111_1111 |
  

We observe that the result has a 1 in its high bit only for the byte that was 0 initially. This persists even if we swap the bytes:

  
    |       | 0000_0000 | 0100_1110 |
|-------|-----------|-----------|
| +     | 1111_1110 | 1111_1111 |
| gives | 1111_1111 | 0100_1101 |
  

We can extend this to any number of bytes in a block and still see the same outcome. To see why this is, let's consider our cases:

Last byte of the block, byte is 0x00.
Last byte of the block, byte is in [1, 127].
Other byte, byte is 0x00.
Other byte, byte is in [1, 127].

Case 1 is the most straightforward: we're essentially doing 0000_0000 plus 1111_1111, which gives 1111_1111, with no carry. Thus, in that situation, we will have a high bit of 1 in the result.

Case 2 is less straightforward. First, we observe that as the byte is in the range [1, 127], we have a 1 bit somewhere in the byte (except at the highest position). As we are adding 1111_1111, this must produce a carry bit. Given that the high bit in the byte is always 0, we end up having the addition 0 + 1 with a carry bit, which must give 0, while 'sending' the carry into the next byte. This will happen regardless of how many 1 bits the byte contains, as adding a 1 bit will never discard a carry, whether we're adding to 1 or 0. Thus, case 2 will have a high bit of 0.

Before we discuss cases 3 and 4, we note that, as we are looking for the first match, we only consider these cases relevant if we do not have a zero byte in a less significant position in the block. Thus, for both cases 3 and 4, we assume that every byte we looked at previously must be nonzero, and thus, we have a carry; all other situations don't matter, as we will have enough information to properly detect our byte.

For case 3, we are doing the equivalent of 0000_0000 plus 1111_1110, but with a carry. Thus, this is the same as case 1, and thus will also give us a high bit of 1. For case 4, as with case 2, we will have a 1 bit somewhere in the byte (except at the highest position). Even though we are adding 1111_1110, because of the carry from the previous byte, this is effectively the same as adding 1111_1111, which gives an identical result to case 2. Thus, the high bit of such a byte will remain 0.

Applying this to our example, we now have this:

  
    |       | i + 3      | i + 2      | i + 1     | i         |
|-------|------------|------------|-----------|-----------|
|       | 0xxx_xxxx  | 0xxx_xxxx  | 0000_0000 | 0xxx_xxxx |
| +     | 1111_1110  | 1111_1110  | 1111_1110 | 1111_1111 |
| gives | irrelevant | irrelevant | 1111_1111 | 0xxx_xxxx | 
  

We are now close to having a fixed form for both matching and non-matching bytes, or at least, ones relevant to detection. However, we still have a problem: a relevant byte that doesn't match, but lacks a known form. To fix this, we will use an AND with 1000_0000 in every lane:

  
    |       | i + 3       | i + 2      | i + 1     | i         |
|-------|-------------|------------|-----------|-----------|
|       | irrelevant  | irrelevant | 1111_1111 | 0xxx_xxxx |
| AND   | 1000_0000   | 1000_0000  | 1000_0000 | 1000_0000 |
| gives | x000_0000   | x000_0000  | 1000_0000 | 0000_0000 | 

  

Having done all this, we can see that the form of (relevant) non-matches is now always the zero byte, while the form of (relevant) matches is now 1000_0000. We can now employ a similar technique to the one we used when checking for non-ASCII bytes:

Check if the result is all zero bytes; if not, we can skip it completely.
If any nonzero bytes exist, count trailing zeroes, then (integer) divide by 8. This is the number of bytes offset from i where our first match is.

However, at first glance from the example, it isn't clear why the first step is always safe. Indeed, at i + 3 and i + 2, we see that the high bit is uncertain, which could potentially trip up our check. However, we in fact know that the high bit of any such byte can only be 1 in one of two circumstances:

It is a match itself; or
There was a match in a byte of lower significance (thus, at a smaller offset from i).

Therefore, there is no possibility of such bytes causing false positives. Furthermore, as we use a trailing zero count in step 2, any bytes after the first match will be completely ignored. This makes the process completely safe[7].

Putting it all together, we get the following:

  
    findFirstSWAR :: ByteArray -> Int -> Int -> Word8 -> Maybe Int
findFirstSWAR ba start span w8
  | w8 >= 128 = Nothing
  | start < 0 = Nothing
  | start >= len = Nothing
  | span < 1 = Nothing
  | otherwise =
      let !toGo = limit - start
          !bigSteps = toGo `quot` 8
       in case goBig start bigSteps of
            (-1) -> Nothing
            i -> Just i
  where
    len :: Int
    !len = sizeofByteArray ba
    limit :: Int
    !limit = min (start + span) len
    goBig :: Int -> Int -> Int
    goBig !ix !bigStepsLeft
      | bigStepsLeft == 0 = goSmall ix
      | otherwise =
          let !chunk = indexUnalignedByteArray ba ix
              !masked = chunk `xor` mask
              !subtracted = masked - ones
              !result = subtracted .&. highBits
           in if result == 0
                then goBig (ix + 8) (bigStepsLeft - 1)
                else ix + (countTrailingZeros result `quot` 8)
    goSmall :: Int -> Int
    goSmall !ix
      | ix == limit = -1
      | indexByteArray ba ix == w8 = ix
      | otherwise = goSmall (ix + 1)
    mask :: Word64
    !mask = fromIntegral w8 * ones
    ones :: Word64
    !ones = 0x01_01_01_01_01_01_01_01
    highBits :: Word64
    !highBits = 0x80_80_80_80_80_80_80_80
  

Many of the organizational techniques used here (such as loop sectioning) will look familiar from our SWAR ASCII verifier. A major difference here is that indexUnalignedByteArray may actually be doing unaligned reads, as there is no guarantee that start aligns to our block size. While this isn't a real issue for us, it can potentially lead to worse performance, or even outright bugs, on some platforms[2].

How much better is this solution? In theory, we should get an eightfold speedup, but the reality is not quite there:

  
    All
  Naive: OK
    880  μs ±  17 μs,  74 B  allocated,   4 B  copied,  53 MB peak memory
  SWAR:  OK
    123  μs ± 2.3 μs,  23 B  allocated,   0 B  copied,  53 MB peak memory
  

We see that SWAR gives about a sevenfold speedup, which is similar to what we saw for ASCII validation. We also notice that allocations on the whole are lower by about 60% (though it hardly makes a difference) as well. This now gives us a worst-case processing rate of about 15.9 GiB. This is definitely a worthwhile improvement.

How much better can we do with FFI?

The big guns: FFI

In our article on string matching, we observed that findFirst (and indeed, our whole problem for this article) is essentially the memchr function from the C standard library. While we avoided the use of the FFI at that stage, we would like to try it here to see just how much of a difference it will make. However, if we look at the (C) type signature of memchr, we can immediately see a problem:

  
    // slightly simplified presentation
void* memchr (const void* s, int c, size_t n);

For those less familiar with C, memchr takes three arguments:

A pointer to something of unspecified type (like Haskell Any), promising not to modify that something;
An int needle, which confusingly is treated as an unsigned char; and
A span to search in, specified using the size_t type.

If memchr finds a match by comparing c to all memory starting at s up to, but not including, s + n, it returns a pointer to that match, or NULL otherwise. We will need some additional logic to sensibly use memchr, for reasons we are about to discuss.

Whenever we want to use the FFI, we essentially follow the same process. Some of these steps can be skipped in some cases; for example, we may not need C-side wrappers to better suit our use case.

Provide a C header file with prototype(s) for our functionality. Place this functionality into our project as a separate file; typically, this would go into a subfolder called cbits.
Define functionality in C, either directly or as a wrapper, that we want to call. Place this functionality into our project as a separate file, in the directory where we put our header file from Step 1.
Specify to Cabal that we are using functionality in C, as well as any flags we want to pass to the C compiler when compiling it.
For each function we want to call from Step 1, add a foreign import declaration in our Haskell code, specifying a Haskell type signature for that function. This signature should faithfully reflect the C type given to the function in Step 1 and 2.
Write a 'Haskelly' wrapper which calls the foreign imports, performing any wrapping and unwrapping needed along the way.

Before we can launch into this process, however, we must take some time to discuss how we can supply C functions with Haskell values. In particular, we need to discuss non-pointer types (which are relatively straightforward) and pointer types (which are much less straightforward), as in our case, we need to use both.

In Foreign.C.Types, base provides us with a collection of (non-pointer) types corresponding to C types of similar names. These have a (mostly) uniform naming scheme; some examples follow.

int becomes CInt
size_t becomes CSize
ptrdiff_t becomes CPtrdiff

Foreign.C.Types provides equivalents for many, but not all, types required by the C99 standard, as well as some types from the POSIX standard. A noteworthy exception is the fixed-size types, such as uint8_t and similar[4]. These are taken to be equivalent to corresponding Haskell fixed-width types, with X standing for their bit size:

uintX_t becomes WordX
intX_t becomes IntX

The situation with pointer types is more complicated. While a full discussion of this topic is beyond the scope of this article, for our purposes, we are only interested in how ByteArrays interact with pointer arguments. This will form the focus of what we are about to discuss. For a fuller treatment of pointers, we recommend the appropriate documentation.

ByteArrays are secretly newtypes over a primitive type ByteArray#, which represents a packed, counted array of bytes. While ByteArrays cannot be passed to the FFI directly, ByteArray# can be provided we enable the UnliftedFFITypes extension. Furthermore, to actually use types like ByteArray# in a type signature, we must enable the MagicHash extension, which allows us to use # in names.

To explain more clearly how a ByteArray# is treated by the FFI, we have to take a small detail to talk about memory pinning. The GHC runtime treats most memory as unpinned, meaning that it is free to move it to avoid memory fragmentation. However, we can have pinned memory, which the runtime promises never to move. This is significant when we are dealing with pointer arguments in the FFI, as references to unpinned memory aren't guaranteed to stay the same over time. Therefore, if we are passing values to the FFI as pointers, we need to ensure they refer to pinned memory only, or we're in for a nasty surprise if the garbage collector decides to move that memory. This also applies to returning pointers from the FFI: if we return a pointer into an unpinned region of memory, there's no guarantee that it will point to anything sensible.

Most allocations by the GHC runtime are unpinned, as this gives the runtime maximum freedom to avoid memory fragmentation. However, ByteArrays (and ByteArray#s) are special in that they can be allocated as either pinned or unpinned. However, there are two snags:

Most of the default ByteArray functionality, including its IsList instance, allocates unpinned; and
Once a ByteArray is allocated, we cannot change its 'pinned status'.

This seems to pose a huge problem for us, as we cannot guarantee that the ByteArray being passed to us is pinned, and if it isn't, we can't do anything about it. This seems to doom us to failure no matter what: whether we use memchr directly or through a wrapper, we have to use pointers somehow. Luckily for us, the UnliftedFFITypes extension comes to our rescue, as it allows us to take a pointer to a ByteArray#, whether it is pinned or not, without risk. Essentially, the runtime promises not to move the memory of the ByteArray# until the FFI call finishes, effectively simulating it being pinned temporarily. This allows us to use a ByteArray# for a pointer argument in the FFI.

We will now follow the five-step process we defined previously. First, we will define a C header file with the type signature of our function:

  
    #ifndef CBITS_H
#define CBITS_H
#include <stddef.h>

ptrdiff_t find_first(void*, unsigned char, size_t, size_t);

#endif /* CBITS_H */

We use void* as the pointer argument representing a ByteArray#, which effectively means 'a pointer to anything'. We could have been more specific and stated that it is an unsigned char * or even a uint8_t*, possibly with a const qualification as well, but in our situation, this isn't really important, as all we will be doing is calling memchr. We use an unsigned char argument for the needle, and two size_t arguments to represent the starting point (as an offset from the pointer) and the span. We use size_t for both of these, as that is the correct type to use for offsets.

The return type of find_first is worth discussing further. As memchr returns a pointer to the match (or NULL if there isn't one), we cannot use its result directly. Even though the runtime guarantees that the memory of the ByteArray# will be pinned for the duration of the FFI call, it makes no such promises after the call finishes! Thus, we instead calculate an offset from the start of the ByteArray# in the FFI call, and return that, hence the ptrdiff_t type. As the offset will be in bytes, it is guaranteed to stay correct no matter where the memory is moved to, and is, in fact, the (byte) index into the ByteArray# that contains our first match.

Having defined this header, we will place it into a subdirectory cbits in our project, under the name cbits.h. We then implement find_first in a file find_first.c in the same directory:

  
    #include "cbits.h"
#include <stdlib.h>
#include <string.h>

ptrdiff_t find_first(void *haystack, unsigned char needle, size_t start, size_t span) {
  void *start_ptr = haystack + start;
  void *result_ptr = memchr(start_ptr, needle, span);
  if (result_ptr == NULL) {
    return -1;
  } else {
    return (result_ptr - haystack);
  }
}
  

We skip doing any checks of our arguments, as we can (and should) do these on the Haskell side. The process find_first follows is roughly as follows:

Calculate the pointer argument to memchr by using start as an offset from the pointer to the start of our region of interest.
Call memchr using the pointer we calculated in Step 1, searching for needle with an extend of span.
If Step 2 gives NULL, return an impossible index; otherwise, subtract the result pointer from the starting pointer to get a number of bytes (and thus, an index) and return that.

Next, we need to notify our Cabal file about our new FFI dependencies. The fields below need to go into whichever stanza contains the Haskell modules that will use the FFI definitions:

  
    c-sources: cbits/find-first.c
include-dirs: cbits
cc-options:
    -std=gnu11
    -O2
    -Wall
    -Wextra
  

The c-sources field must be a list of all .c files that define anything we want to call via the FFI. The include-dirs should be the folder with all our .h and .c files; if you need several folders, you should have them all here. cc-options is the trickiest part, as this consists of the flags to be passed to the C compiler that GHC must know about. What exact compiler this is will vary based on the platform:

Windows bundles a copy of clang, whose version will depend on the GHC version being used.
MacOS will use the system compiler, which is clang. The version will be determined by the OS.
Linux, by default, will use the system C compiler, which is almost always GCC, but can be different on some distros.

Thus, the flags provided in cc-options should take this into consideration. A good set of flags that will work in general are as follows:

-std=gnu11 specifies that you want to use C11, with GNU extensions. While a default for this does exist for every compiler, it depends heavily on the compiler and its version, so it's best to specify it explicitly. Other options include c11 if you don't want GNU extensions, c99 if you don't need C11 features, and gnu99 if you want GNU extensions.
-O2 works similarly to GHC. While it is possible to use -O3 on both GCC and Clang, it's not necessarily worth it, and can make code worse in some cases.
-Wall enables all standard warnings. Given how much latitude C gives you in writing things that make no sense, this is a really critical flag to set unless you know exactly what you are doing.
-Wextra enables additional warnings. These are also worth having, for the exact same reason as -Wall.

Depending on your code, you may need other flags, but this is a topic that is far out of scope for the current article. Having done all this, we have now made our project aware of the C code we want to FFI into. We next have to write a foreign import declaration in our Haskell code:

  
    foreign import capi unsafe "cbits.h find_first"
  cFindFirst ::
    ByteArray# -> CUChar -> CSize -> CSize -> CPtrdiff
  

There are several important things going on here:

foreign import is a stock indicator that we are writing a signature for a function called via the FFI.
capi refers to the calling convention we are using. This is the calling convention to use unless you know exactly what you are doing, as it prevents several possible issues, while being more capable in general. Ensure you have the CApiFFI extension enabled.
unsafe specifies that our FFI function does not need to call back into Haskell. This allows less setup and thus a faster call.
"cbits.h find_first" specifies the identifier our signature will correspond to in C, as well as the header file we can find its definition in. The header file is a requirement of the capi calling convention, while the identifier specifies what exactly gets called.

We then have a Haskell name to refer to the FFI function: we start the name with c to indicate this. We can see that the signature for cFindFirst lines up with the signature given by the cbits.h header for find_first:

ByteArray# gets passed as the void* argument;
CUChar gets passed as the unsigned char argument;
CSize gets passed for the start and the span arguments; and
The result is given as CPtrdiff.

When defining a foreign import declaration, you may get errors from GHC stating that for types in Foreign.C.Types, their constructors must be imported to use them in the FFI. This is because technically, the types in Foreign.C.Types are newtypes over lower-level types of appropriate size to the platform, and unless the compiler can see those definitions, it won't know what to do.

Lastly, we need to write a wrapper which will call cFindFirst, but will handle the initial checks, as well as the various wrapping and unwrapping required to convert more typical Haskell types like Int into the Foreign.C.Types types like CSize:

  
    findFirstFFI :: ByteArray -> Int -> Int -> Word8 -> Maybe Int
findFirstFFI ba@(ByteArray ba#) start span w8 = do
 guard (w8 < 128)
 guard (start >= 0)
 guard (start < len)
 guard (span > 0)
 let !limit = min (start + span) len
 case cFindFirst ba# (fromIntegral w8) (fromIntegral start) (fromIntegral limit) of
   (-1) -> Nothing
   i -> Just . fromIntegral $ i
 where
   len :: Int
   !len = sizeofByteArray ba
  

It is generally good practice to hide the details of the FFI from the public API of your functionality, as well as to do as many 'safety checks' as possible outside of the FFI. This has several advantages: it is clearer (less code in the FFI), more efficient (the FFI code can make more assumptions about its inputs) and easier to use (callers don't have to concern themselves with FFI types or even know if the FFI is in use). We follow this practice with our implementation, by doing all argument checks, as well as unwrapping into FFI types, then rewrapping into Maybe, in Haskell, leaving the FFI to do only the call to memchr.

But how fast is it? In a word, extremely:

  
    All
  Naive: OK
    880  μs ±  17 μs,  74 B  allocated,   4 B  copied,  53 MB peak memory
  SWAR:  OK
    123  μs ± 2.3 μs,  23 B  allocated,   0 B  copied,  53 MB peak memory
  FFI:   OK
    16.8 μs ± 176 ns,  16 B  allocated,   0 B  copied,  53 MB peak memory
  

As we make use of memchr, which is optimized using SIMD instructions in assembly at the C standard library level, we gain another sevenfold speedup or so on top of what SWAR can give us. This means that using the FFI for this task is almost fifty times faster than what we could do with 'regular' Haskell. More precisely, our worst-case processing rate is about 116.3 GiB per second; this shows just how efficient using functions like this through the FFI can be. While we would be unlikely to see such improvements with hand-written C[3], this is still an improvement that would be impossible to match any other way in many cases.

The downside: repeated calls in a loop

Given such incredible outcomes, it seems like FFI use should be a no-brainer. After all, if we can get 50x speedups, we should take them! However, this benefit is not free: even if we set aside considerations such as knowledge of C and the complexities of the FFI in general, there is an overhead to FFI calls we must pay. This overhead isn't large, but if we expect to make many repeated calls to the FFI to do a small amount of work each time, we can easily dwarf whatever benefits we might see.

To demonstrate this, we will define a different kind of functionality: given a needle byte, an ASCII haystack, a starting point and a span to search in, return all positions that match, rather than just one. We will define four versions of this functionality:

A 'direct' implementation using filter;
Three solutions that use unfoldr together with one of our findFirst implementations from above.

The code for these is below:

  
    findAllNaive :: ByteArray -> Int -> Int -> Word8 -> [Int]
findAllNaive ba start span w8
 | w8 > 127 = []
 | start < 0 = []
 | start >= len = []
 | span <= 0 = []
 | otherwise =
     let !limit = min (start + span) len - 1
      in filter (\ix -> indexByteArray ba ix == w8) [start, start + 1 .. limit]
 where
   len :: Int
   !len = sizeofByteArray ba

findAllLoop :: ByteArray -> Int -> Int -> Word8 -> [Int]
findAllLoop ba start span w8 = unfoldr go start
 where
   go :: Int -> Maybe (Int, Int)
   go ix = (\x -> (x, x + 1)) <$> findFirstNaive ba ix span w8

findAllSWAR :: ByteArray -> Int -> Int -> Word8 -> [Int]
findAllSWAR ba start span w8 = unfoldr go start
 where
   go :: Int -> Maybe (Int, Int)
   go ix = (\x -> (x, x + 1)) <$> findFirstSWAR ba ix span w8

findAllFFI :: ByteArray -> Int -> Int -> Word8 -> [Int]
findAllFFI ba start span w8 = unfoldr go start
 where
   go :: Int -> Maybe (Int, Int)
   go ix = (\x -> (x, x + 1)) <$> findFirstFFI ba ix span w8
  

To demonstrate these overheads, we will construct a benchmark where we need to make a lot of repeated calls to each of the findFirst functions, while having each call do a minimal amount of work. Our haystack will be 2MiB in size, and consist of the block 0x01_00_00_00_00_00_00_00 repeated over and over. We will start from index 1, and span until the end of the haystack, with our needle being 0x01. This means that each call to our findFirst functions only needs to process eight bytes per call. While not a perfect measurement of overheads, it will do for our purposes.

The results are as follows:

  
    All
  Naive:      OK
    2.26 ms ±  40 μs,  20 MB allocated, 153 B  copied,  50 MB peak memory
  Loop:       OK
    3.05 ms ±  46 μs,  26 MB allocated, 208 B  copied,  50 MB peak memory
  Loop, SWAR: OK
    2.24 ms ±  41 μs,  26 MB allocated, 201 B  copied,  50 MB peak memory
  Loop, FFI:  OK
    3.42 ms ±  54 μs,  32 MB allocated, 226 B  copied,  50 MB peak memory
  

We can see that the overheads on the FFI version are by far the worst. Not only do we need about 50% more time (despite the 50x speedup of the functionality!), we also need about 50% more space, over the naive solution. Our SWAR-based loop is better than the FFI version in terms of overheads, but once again, the overheads completely drown out any benefit we get from using SWAR.

Seeing these results, it isn't surprising that in our string matching article, we saw no benefit to using findFirst with BMH. Even though the SWAR-driven findFirst is seven times faster than the naive version, the overheads completely drown out this improvement unless we give the function enough work to do in each call. For the FFI, this is even more severe: whereas for SWAR, the penalty comes mostly from the added setup to make use of SWAR techniques, for FFI, we have to pay the cost of calling out of Haskell.

Going further

The solution to this performance problem seems straightforward: write the equivalent of findFirstNaive, but using either SWAR or the FFI. However, this is not as straightforward as it looks. Both SWAR and the FFI run up against fundamental problems that have no solutions that are both easy and efficient.

To see why SWAR would give us little benefit, consider the following snippet of Haskell, designed to act as the working logic of a goBig-style worker for counting all matches:

  
    let !chunk = indexUnalignedByteArray haystack ix
    !masked = chunk `xor` mask
    !added = masked + 0x7F_7F_7F_7F_7F_7F_7F_7F
    !result = added .&. highBits
  in if result == highBits
        then goBig (ix + 8) (bigStepsLeft - 1)
        else ... -- what should happen here?
  

Most of this is familiar from findFirstSWAR, with some changes as there are no irrelevant bytes anymore. However, once we have detected one (or more) matches, we are forced to construct [Int] of varying length depending on how many matches we detected. Worse still, the goBig we had in findFirstSWAR accumulated into a Maybe Int, meaning that when we hit the else, we were done and wouldn't need to do any more work. For this block of code, however, we would have to potentially construct these lists many times, and then use <> to accumulate. This is famously Θ(n) for a list of length n, which on the whole gives us quadratic performance. Needless to say, no matter how fast we can detect matches using SWAR methods, the need to construct lists in this way completely drowns out any benefits we might get.

With the FFI, we are in an even worse situation. C isn't aware of Haskell lists as a structure by default; the only way to make it aware of such is to resort to calling Haskell from C. Not only is this significantly more complicated to both implement and build, but it would force us to call back and forth between Haskell and C, magnifying the FFI penalty we already know to be a problem. Worse still, this would prohibit us from using unsafe foreign imports, which would make our performance even worse. Needless to say, even if we could figure out how to employ C-side improvements for performance, these overheads would again drown them out.

Thus, it isn't really possible to speed up findAll using either of these approaches, as the need to work with Haskell lists forces us to give up any performance benefits we could potentially get. However, we can potentially use these techniques to define other, similar functions:

Counting needle matches
Finding the last index with a match
Counting needle mismatches
Determining what fraction of the haystack consists of the needle byte
Replace every match found with a different (ASCII) byte, assuming the haystack is mutable

The commonality in all these cases is that each such functionality is described by some Monoid:

Counting matches or mismatches is Sum Int, where a match (or mismatch) contributes 1.
Finding the last index is Last Int (and findFirst corresponds to First Int).
Determining what fraction matches is Sum (Int, Int); the first element of the pair counts how many matches we saw, while the second counts how many elements we saw (so far), regardless of whether they matched or not.
Replacement is technically (), but with a mutation effect. More precisely, it is PrimMonad m => Ap m () for some m.

However, findAll corresponds to a Monoid as well; specifically, it is [Int], or the free monoid[8] on Ints. The difference between [Int] as compared to Sum Int, Sum (Int, Int), Last Int and even m () is that [Int] is variable-length, while the others are fixed-length. This nicely generalizes our observations from before: when we are forced to 'expand' a fixed-length outcome into a variable-length result, SWAR and the FFI both struggle performance-wise.

Conclusion

We've covered a lot of ground in this article! Starting from a small piece of functionality in string matching, we showed how SWAR can improve its performance drastically. We then discussed how we can use the FFI to improve it even further, with some discussion of how exactly we can do this. However, we also confirmed an observation from our earlier article: the overhead of SWAR and the FFI cannot be ignored, and if we don't give our SWAR or FFI functions enough work to do, we can potentially end up worse off.

We also spent some time discussing exactly to what we can, and cannot, easily apply SWAR techniques. We found a preliminary characterization by way of Monoids and fixed width, but this needs further exploration. We also showed how the problem of finding first matches fits into that characterization, but the problem of finding all matches does not. We also talked about how the FFI relates to this characterization. It is worth looking into this characterization further - let us know if you'd be interested in such an article!

What we've seen shows that performance is a surprisingly subtle property, and while a technique can be extremely effective, there is always a cost, and one that might not be easy to spot. While SWAR and the FFI are effective and powerful tools, they need to be used with care and understanding, as they can backfire on us. The observations and experiments that led to this article certainly gave the author some surprising insights, which we wanted to share with other Haskellers. So give these techniques a try: if nothing else, you will learn a lot about Haskell, GHC and computing.

If you like what you’re reading and this sounds relevant to your business, please don’t hesitate to get in touch.

  
    In our experiments, we found that guard produces significantly worse code, especially for certain monads. [] in particular was awful, but Maybe wasn't great either.
Luckily for us, no Tier 1 GHC platform is one of them.
Unless we can optimize to the same level as C standard library developers.
Spreaking very technically, the C99 standard does not require such types to be provided. However, it would be difficult to find a non-museum architecture in 2025 that can avoid providing these types and still be standard-compliant. Furthermore, all Tier 1 platforms for GHC have these types available in the C compiler it must make available to us.
Or at least, how it's defined on every machine you're likely to see outside of a museum.
While technically there is no requirement that GHC does this, on all GHC Tier 1 platforms, this is the behaviour we will see. Thus, we won't consider endianness further.
This example highlights just how specific we have to be with SWAR operations. The author of this article only caught this 'problem' when writing up an explanation of the method, only to later discover that it, in fact, was not a problem at all!
Being very pedantic, [] isn't technically a free monoid due to laziness. However, if used the way we are using it here, it qualifies.


  

HaskellPerformanceBenchmarkingSWARByteArrayFFIOptimizationsC Language

Koz Ross

Fast Findings with SWAR and the FFI: The Good, the Bad, and the Inefficient

From Bugs to Breakthroughs: Auditing Cardano Smart Contracts

Our Performance is massiv: Getting the Most Out of Your Hardware in Haskell