The 'A' is for 'Accelerated': Checking ASCII with SWAR
Introduction
Performance is a sought-after property of our code, following just after correctness. It drives considerations ranging from the choice of data structures and algorithms to specific language constructs used for implementation. In an ideal world, optimization is something we could entirely leave to compilers: in practice, if we really want the most out of our code, we must do some of the work ourselves.
Haskell, and GHC, are no exception. While Haskell's purity, as well as GHC's extensive engineering, can do remarkable things for the performance of otherwise naively written code, they are not a panacea. In many cases, we can get significantly better performance by making better choices, or writing in a 'lower-level' way, than what GHC could provide us with. While tedious, these methods can, and should, be used where appropriate. Many of these techniques end up resembling more 'traditional' or 'imperative' approaches, particularly in their choice of structures, and thus, end up being neglected by Haskellers.
In particular, we will focus on techniques around data parallelism working over packed data. Using ASCII verification as a test case, we will show not only how such techniques can be used, but how big of an improvement they present over 'regular' ways of solving the same problem in Haskell. As part of this, we will discuss various optimizations enabled by such an approach, as well as explain why they work. Lastly, we will see how this could be taken further, and what currently prevents us from doing this (easily) in GHC and Haskell today.
The problem
In a previous article, we needed a check for whether a given input was ASCII. As it was not the primary focus, we gave little thought to its optimization. As we are interested in demonstrating performance optimization techniques over packed data, we can use this check as an example.
For our packed representation, we will continue to use ByteArray
, for the same reasons as before. We will also define a type for the result of our check, in order to avoid Boolean blindness:
data IsAsciiResult = IsAscii | InvalidByte Int Word8
deriving stock (Eq, Show)
-- Needed for `tasty-bench` to work properly
instance NFData IsAsciiResult where
rnf = \case
IsAscii -> rnf ()
InvalidByte ix w8 -> seq (rnf ix) (rnf w8)
Thus, our goal is to construct, and optimize, a function with the signature ByteArray -> IsAsciiResult
, which produces IsAscii
if every byte in the argument is an ASCII byte, and InvalidByte
otherwise, indicating the position and value of the first such byte (in order of indices).
A first attempt
We'll start with the first possibility that could possibly work:
isAsciiFold :: ByteArray -> IsAsciiResult
isAsciiFold ba = foldl' go IsAscii [0, 1 .. sizeofByteArray ba - 1]
where
go :: IsAsciiResult -> Int -> IsAsciiResult
go acc ix = case acc of
InvalidByte * *-> acc
IsAscii -> let indexed :: Word8 = indexByteArray ba ix
in if indexed >= 0x80
then InvalidByte ix indexed
else acc
This is essentially an indexed fold over the ByteArray
, with a 'shortcut out' if we ever discover a non-ASCII byte. We could have also written this using an actual indexed fold, but this would first involve converting the ByteArray
to a list, as ByteArray
currently lacks an instance of FoldableWithIndex
.
This implementation is quite similar to the approach we already described in a previous article, with the added difference being the better error reporting. Due to its simplicity and visible correctness, we will use it as a reference for future versions to ensure our optimizations don't break anything.
Measuring our performance
Before we do anything performance-related, we must first establish what our performance already is. To do this requires choosing some benchmarks, then running our initial attempt against them to establish a baseline.
One important choice here is the size of the data with which we will be measuring. This can't be too small, as not only will this make it difficult to measure our performance accurately, it may not realistically reflect the performance. For problems involving packed memory representations like ours, this is especially important, as the cache hierarchy can easily make a badly-performing function look good on small inputs. We went with a 2 MiB ByteArray
, as this is big enough to give good readings while still being realistic.
At the same time, size is not the only thing that matters for our specific problem. As we 'shortcut out' when finding a non-ASCII byte, two inputs that have the same size can have vastly different running times. Thus, we want to select an input that takes as long as possible for the given size. We chose to have every byte but the last be ASCII, with the very last being non-ASCII. This ensures that not only do we end up checking the entire input, but we also test how the failure-finding path performs as well.
Running isAsciiFold
against the benchmark described above yields the following:
Naive: OK
891 μs ± 3.0 μs, 94 B allocated, 1 B copied, 10 MB peak memory
If we assume the worst case (894 microseconds for 2MiB of data), we have a processing rate of about 2354 bytes per microsecond, which is about 2.2 GiB per second. This is already fairly impressive and might be fine for some use cases. We now have a starting point: let's try and improve on it.
What we aren't doing, but should be
Before we can improve the performance of our verifier, we need to consider what we aren't doing as well as we could be. From an algorithmic point of view, there isn't much to be had here: in the worst case, we must check every single byte of the input before we can confirm that there are no non-ASCII bytes. Thus, our improvements must stem from how efficiently we process this data linearly: that is, reducing our overheads.
If we examine isAsciiFold
, we can see that we process one byte at a time. While this is a conceptual necessity, in practice, this actually significantly limits our performance. To understand why, we must first talk a little bit about how modern-day CPUs, and memory, work. While these are going to be generalizations, all Tier 1 GHC architectures basically work this way, with small differences we don't need to worry about. Specifically, we will discuss register-based workloads, memory hierarchies and layouts and superscalarity, as these will be key to our design choices later.
All CPUs targeted by Tier 1 GHC architectures[6] must pull data out of memory and into registers before they can process it. However, on all of these architectures, registers (at least general-purpose ones) are 64 bits wide. This means that if we process only at the byte level, we're wasting 7/8ths of the available register. This is a severe inefficiency already: if we are going to all the effort to pull data into a register, we want to use all 64 bits, not just 8.
This phenomenon is made worse by the fact that the primary cost of moving data between memory and registers is the so-called 'memory wall'. This has two implications:
A memory move essentially costs the same no matter how large the move is; and
We should limit memory moves where at all possible.
Luckily, current CPU designs help us with this in two ways. The first is cache hierarchies. Typically, we imagine that our computer memory looks like this:
Registers
---------
Memory
In practice, there are multiple levels of caches 'in the way':
Registers
---------
L1 cache
---------
L2 cache
---------
L3 cache (on some platforms)
---------
Memory
Each 'higher' cache in the hierarchy is smaller, but faster, and when a memory fetch is requested in code, in addition to moving the requested data to a register, that data (plus some more) is moved into caches as well. The amount of data moved into the cache (a cache line) is typically eight machine words on modern architectures (and definitely is the case for all Tier 1 GHC platforms). Therefore, if data we need soon after a fetch is physically nearby, it won't need to be fetched from memory: instead, it would come from a cache, which is faster (by a considerable margin). This concept is known as spatial locality; this is one major reason why packed data is faster to traverse and operate over.
Secondly, CPUs bring over more than one register's worth of data when a memory move is requested. For GHC Tier 1 platforms, this is a cache line, 64 bytes in size. Thus, while the conceptual result of a memory move might look like this:
Registers: [b2] [ ] [ ] .... [ ]
Memory: [ b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11 ]
the real picture is more like this:
Registers: [b2] [ ] [ ] .... [ ]
L1 cache: [ b2, b3, b4, b5, b6, b7, b8, b9, b10, b11 ]
Memory: [ b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11 ]
As we can see, even though we only requested a move of the byte b2
, we ended up getting faster access to the entire cache line containing b2
. This is another advantage of packed data: if we end up needing nearby data soon, we already have it close to hand. This concept, known as temporal locality, is essential to performance, as an L1 cache move is about 200 times faster than a memory move, while an L2 cache move is about 20 times faster.
We have one last advantage we could exploit. While conceptually, CPUs execute instructions one at a time, in practice, they are superscalar, allowing them to execute instructions in parallel, provided no data dependencies are broken. This is partly because it's an easy way to obtain performance for hardware manufacturers: doubling the number of logic units on a CPU is much easier and cheaper than doubling the speed of those logic units. This phenomenon is done automatically, and invisibly, by the CPU. Data dependencies are automatically detected, and work assigned, using an approach similar to Tomasulo's algorithm, stalling the work if a data dependency is detected. Together with the above phenomena of memory hierarchies and caching, well-written code can run significantly faster, despite seemingly needing to do the same amount of work.
What do the above concepts tell us? Firstly, since we're already making the effort to move one byte into a register for processing, we might as well move as many bytes as we can while we're at it. Secondly, we should do as much work at each 'fold step' as possible, provided we can avoid data dependencies. Lastly, both of the above are facilitated by using a packed representation like ByteArray
, as the data is packed closely in memory, allowing us better spatial locality than a linked representation like a list would have. We already have good temporal locality, as we process the data in its entirety, start to finish.
At the same time, while we can move more data easily, how to process these larger blocks is far from obvious. Instead of treating our registers as containing a singular blob of bits, we want to treat them as fixed-length arrays of bytes and apply the same operation to every byte at once. This is a form of parallelism, called Single Instruction Multiple Data, or SIMD. SIMD approaches are designed to take maximum advantage of memory hierarchies, spatial and temporal locality, and superscalarity, by essentially enabling both large memory moves and caching, as well as parallel processing using a large number of ALUs. However, SIMD approaches require dedicated hardware. We can get similar benefits using a technique called SWAR, standing for 'SIMD within a register'. This uses similar ideas to SIMD, but instead of using dedicated hardware, does a similar kind of processing using a regular register and regular register operations.
We will now try using SWAR to speed up our processing, as well as to show just how large a speedup such techniques can achieve.
Using SWAR, first attempt
As we have just discussed, our largest problem is using data parallelism to process eight-byte blocks 'all at once'. As part of that, we need to not only be able to check if a block contains only ASCII bytes but properly detect the 'first' non-ASCII one by its index and report it. Furthermore, another problem is that our input length may not be an exact multiple of eight. We will solve both of these problems, and then see how much improvement this gets us.
To solve the first problem, let's first address the problem of detecting whether a non-ASCII byte exists in an eight-byte block. To solve this, we first observe that any valid ASCII byte is less than 0x80
; thus, an ASCII byte's high bit is always clear. As typical with SWAR-style parallelism, we want to operate on the whole block in parallel (meaning, on all of its bytes at once) to make each byte into one of two forms:
A form representing 'this byte is ASCII'; and
A form representing 'this byte isn't ASCII'.
Since we only care about the high bit to classify bytes, can achieve the above result by masking. This technique uses bitwise operations to set all bits in particular positions to clear. In our case, we want to clear everything but the high bits in each byte. This involves taking a bitwise AND with 0x80
(or 10000000
in binary) in each 'lane', which leaves the high bit as it was but sets all other bits to clear.
We can now use this to detect if any of the bytes in our block were non-ASCII by comparing it to a block where every byte is zero. If these compare equal, we know that we had only ASCII bytes, and can proceed. Otherwise, we need to isolate the first byte in the block that was the problem.
We can potentially 'drop down' into a byte-by-byte check of the block, but this is undesirable: we already have the information we need, and should be able to extract it. To see why we already have this information, we observe that at this stage, the only possible set bits in our modified block must be the high bits of any byte. Therefore, we have the following:
If the first byte in the block is non-ASCII, there will be a set bit immediately.
If the first byte in the block is ASCII, but not the second, there will be eight clear bits, followed by a set bit.
If the first and second bytes in the block are both ASCII, but not the third, there will be sixteen clear bits, followed by a set bit.
If we continue this pattern, we observe two things:
The number of initial contiguous clear bits in a block with non-ASCII bytes is always a multiple of 8; and
The length of such a sequence determines where in such a block a non-ASCII byte is.
Thus, we need a way of counting the length of such a sequence of initial contiguous clear bits. Luckily for us, we have just such an operation: countTrailingZeros
from Data.Bits
[3]. This operation is also quite efficient, needing only a single hardware instruction on any Tier 1 GHC platform. countTrailingZeros
gives us an offset in bits, which we can convert into an offset in bytes using a division by 8. This is also quite fast, as it amounts to a shift since indexes are Int
s by convention. Combined with the information we already have from looping, we can compute the index of the byte in the entire input from this information, and then retrieve the problem byte with a single indexing operation. This solves the first problem, without forcing us to use a secondary loop at all.
The second problem is fortunately more straightforward: we can employ a technique called loop sectioning. More specifically, instead of taking 'steps' of exactly one byte throughout, we take two kinds of steps:
'Big steps', which are eight bytes at a time; followed by
'Small steps, which are one byte at a time.
Since the input length is known, we can calculate exactly how many steps of each size are required. This works for any input length—even when it isn’t a multiple of eight. Computing those step counts is cheap (our index is just an Int
). Putting it all together, the implementation looks like this:
isAsciiSwar :: ByteArray -> IsAsciiResult
isAsciiSwar ba = goBig 0 bigSteps
where
bigSteps :: Int
!bigSteps = sizeofByteArray ba `quot` 8
smallSteps :: Int
!smallSteps = sizeofByteArray ba `rem` 8
goBig :: Int -> Int -> IsAsciiResult
goBig byteIx bigStepsLeft
| bigStepsLeft == 0 = goSmall byteIx smallSteps
| otherwise = let !indexed :: Word64 = indexUnalignedByteArray ba byteIx
!masked = indexed .&. 0x80_80_80_80_80_80_80_80
in if masked == 0
then goBig (byteIx + 8) (bigStepsLeft - 1)
else let !q = countTrailingZeros masked `quot` 8
!badIx = byteIx + q
in InvalidByte badIx (indexByteArray ba badIx)
goSmall :: Int -> Int -> IsAsciiResult
goSmall byteIx smallStepsLeft
| smallStepsLeft == 0 = IsAscii
| otherwise = let indexed :: Word8 = indexByteArray ba byteIx
in if indexed >= 0x80
then InvalidByte byteIx indexed
else goSmall (byteIx + 1) (smallStepsLeft - 1)
We use two recursive functions to perform the work: goBig
, which processes eight bytes at a time using SWAR techniques, and goSmall
, which works one byte at a time. We use a combination of quot
and rem
to calculate the number of both kinds of steps, then use a countdown to know when to switch over (or stop).
goBig
is where SWAR techniques are applied. Blocks of 64 bits (eight bytes) are processed at each call of goBig
, and we treat these as blocks of eight bytes. For each block, the following happens:
The block is loaded using
indexUnalignedByteArray
[1].A mask is applied to the entire block, which zeros any bits except the high bit of each byte.
The result of 2 is compared to a block where every byte is zero.
If the comparison from 3 shows that the blocks are equal, we have all ASCII bytes and can proceed. Otherwise, we use the method we described previously to locate the first non-ASCII byte in the block, then use that information, together with the index of the block, to compute the index of the problem byte. Then, we only need one indexing operation to extract it.
But how much of an improvement did we get? In theory, we should be getting roughly an eight-fold improvement, as we are on the whole processing our data eight times faster. In practice?
SWAR: OK
140 μs ± 2.1 μs, 103 B allocated, 0 B copied, 10 MB peak memory, 0.16x
This is certainly an improvement: with essentially the same allocations, we have now gone up to (again, worst-case) about 14,769 bytes per microsecond, which is about 13.75 GiB per second. This is around a six-fold improvement, which is not quite the eight-fold we would have expected in theory.
This result illustrates one of the inevitabilities of parallelism: parallel cost. Essentially, due to the change in how we must write parallel processing operations, there is always an overhead we must pay, no matter how much parallelism we end up using. Most of this cost comes from the requirements to avoid branching, but in our case, SWAR techniques end up costing us something as well. This means that, once we find some parallelism, we need to have as large a parallel width as we can sustain, as this amortizes out the parallel cost.
This is already a large improvement - but can we do even better?
Loop unrolling for even more speed
With our last changes, we have optimized our register use internally: now that we are processing eight bytes at a time, no space within a register is being wasted. However, we aren't really using our registers to their full capability: as we discussed above, superscalarity can allow us to do much more work in parallel than we expect, provided no data dependencies exist between the work items.
To improve our register use, we can use a technique called loop unrolling. As a brief demonstration of how this works, imagine we had a routine step
, which we called n
times. Loop unrolling would reduce the number of calls by a factor of k
, and instead would call the routine step
k
times on each 'rotation'. For example, suppose n = 10
and k = 2
. Then, loop unrolling would transform a loop like this:
-- We use `for_` here for simplicity, but this converts to a fold just as easily
for_ [0, 1 .. 9] step
into a loop like this:
for_ [0, 1 .. 4] (step >> step)
On first glance, this seems pointless: in either case, we are calling step
the same number of times. However, there are two (conditional) advantages:
If
step
calls don't depend on each other, we expose more work for superscalarity purposes.We pay the loop overhead fewer times.
Together, these two benefits potentially allow us to run an unrolled computation more efficiently than its non-unrolled version. We typically describe loop unrolling by talking about the factor k
from above; if k = 2
, we call this a 'twofold unroll', if k = 4
it's a 'fourfold unroll', and so on.
Loop unrolling can be combined with loop sectioning. Our original description of loop sectioning assumes two kinds of 'step', but ultimately, we can have more than two if we so choose. For example, we could have a scheme like this:
'Huge steps' of sixteen bytes, using a twofold unroll; then
'Large steps' of eight bytes, using our original SWAR technique without unrolling; and lastly
'Small steps' of one byte.
This could be extended to as many 'sections' as we choose, of whichever sizes we choose. The above scheme seems like a reasonable starting point; the code that results is below:
isAsciiSwar2x :: ByteArray -> IsAsciiResult
isAsciiSwar2x ba = goHuge 0 hugeSteps
where
hugeSteps :: Int
!hugeSteps = sizeofByteArray ba `quot` 16
bigSteps :: Int
!bigSteps = (sizeofByteArray ba `rem` 16) `quot` 8
smallSteps :: Int
!smallSteps = (sizeofByteArray ba `rem` 16) `rem` 8
goHuge :: Int -> Int -> IsAsciiResult
goHuge byteIx hugeStepsLeft
| hugeStepsLeft == 0 = goBig byteIx bigSteps
| otherwise =
let !indexed1 :: Word64 = indexUnalignedByteArray ba byteIx
!indexed2 :: Word64 = indexUnalignedByteArray ba (byteIx + 8)
!combined = (indexed1 .|. indexed2) .&. 0x80_80_80_80_80_80_80_80
in if combined == 0
then goHuge (byteIx + 16) (hugeStepsLeft - 1)
else
let !masked1 = indexed1 .&. 0x80_80_80_80_80_80_80_80
in if masked1 == 0
then
let !masked2 = indexed2 .&. 0x80_80_80_80_80_80_80_80
!q = countTrailingZeros masked2 `quot` 8
!badIx = byteIx + 8 + q
in InvalidByte badIx (indexByteArray ba badIx)
else
let !q = countTrailingZeros masked1 `quot` 8
!badIx = byteIx + q
in InvalidByte badIx (indexByteArray ba badIx)
goBig :: Int -> Int -> IsAsciiResult
goBig byteIx bigStepsLeft
| bigStepsLeft == 0 = goSmall byteIx smallSteps
| otherwise =
let !indexed :: Word64 = indexUnalignedByteArray ba byteIx
!masked = indexed .&. 0x80_80_80_80_80_80_80_80
in if masked == 0
then goBig (byteIx + 8) (bigStepsLeft - 1)
else
let !q = countTrailingZeros masked `quot` 8
!badIx = byteIx + q
in InvalidByte badIx (indexByteArray ba badIx)
goSmall :: Int -> Int -> IsAsciiResult
goSmall byteIx smallStepsLeft
| smallStepsLeft == 0 = IsAscii
| otherwise =
let indexed :: Word8 = indexByteArray ba byteIx
in if indexed >= 0x80
then InvalidByte byteIx indexed
else goSmall (byteIx + 1) (smallStepsLeft - 1)
This is quite similar to isAsciiSwar
, and re-uses goBig
and goSmall
essentially unchanged. goHuge
works similarly to goBig
, but to account for the larger processing size, we have to adjust the process in two ways.
Firstly, we need to modify the detection process:
Load two eight-byte blocks with
indexUnalignedByteArray
, with the second using a manual offset.Use a bitwise OR to combine the two blocks into one.
Mask the result of 2 the same way as
goBig
.Compare the result of 3 to a block with zero bytes everywhere.
To see why this works, consider our previous observation that any ASCII byte will have its high bit clear. As bitwise OR will set a bit high if either argument bit is high, by combining the blocks this way, we end up with a block where, for each byte, the high bit will be set if the corresponding byte in either of the blocks we read in step 1. Then, by using the mask in step 3, we will get a block of all zero bytes only if there were only ASCII bytes in either of the step 1 blocks.
If we find no non-ASCII bytes, we can continue as before. However, if the comparison in step 4 leads to a not-equal result, we must take a few more steps to isolate the result. This is because the result of step 3 only tells us whether we had a non-ASCII byte, not where, as the result of step 3 loses information about positions. This is different from isAsciiSwar
, as the analogous result in that function could be used to find the non-ASCII byte quickly as well.
Thus, we instead have to perform a similar countTrailingZeros
check, but now on each of the loaded blocks individually. We can save ourselves some work by using a check against all-zero to avoid one of the countTrailingZeros
operations. While this adds a little more work, it's still a fixed overhead due to the countTrailingZeros
instruction doing most of the work[4].
In theory, we should gain another twofold speedup, as we are now processing data twice as fast as with isAsciiSwar
. In practice, however, it's not quite what happens:
2x: OK
84.4 μs ± 901 ns, 115 B allocated, 0 B copied, 10 MB peak memory, 0.60x
While the speedup is still noteworthy at about 40%, it's definitely not the theoretical twofold. Despite this, this improved version of our function crunches at almost 23GiB per second - over 10x what we started with!
But can we go even faster? If a twofold unroll got us somewhere, would a fourfold unroll still be worthwhile? We can extend our approach fairly straightforwardly to achieve this:
isAsciiSwar4x :: ByteArray -> IsAsciiResult
isAsciiSwar4x ba = goHuge 0 hugeSteps
where
hugeSteps :: Int
!hugeSteps = sizeofByteArray ba `quot` 32
bigSteps :: Int
!bigSteps = (sizeofByteArray ba `rem` 32) `quot` 8
smallSteps :: Int
!smallSteps = (sizeofByteArray ba `rem` 32) `rem` 8
goHuge :: Int -> Int -> IsAsciiResult
goHuge byteIx hugeStepsLeft
| hugeStepsLeft == 0 = goBig byteIx bigSteps
| otherwise =
let !indexed1 :: Word64 = indexUnalignedByteArray ba byteIx
!indexed2 = indexUnalignedByteArray ba (byteIx + 8)
!indexed3 = indexUnalignedByteArray ba (byteIx + 16)
!indexed4 = indexUnalignedByteArray ba (byteIx + 24)
!combined = (indexed1 .|. indexed2 .|. indexed3 .|. indexed4) .&. 0x80_80_80_80_80_80_80_80
in if combined == 0
then goHuge (byteIx + 32) (hugeStepsLeft - 1)
else
let !masked1 = indexed1 .&. 0x80_80_80_80_80_80_80_80
in if masked1 == 0
then
let !masked2 = indexed2 .&. 0x80_80_80_80_80_80_80_80
in if masked2 == 0
then
let !masked3 = indexed3 .&. 0x80_80_80_80_80_80_80_80
in if masked3 == 0
then
let !masked4 = indexed4 .&. 0x80_80_80_80_80_80_80_80
!q = countTrailingZeros masked4 `quot` 8
!badIx = byteIx + 24 + q
in InvalidByte badIx (indexByteArray ba badIx)
else
let !q = countTrailingZeros masked3 `quot` 8
!badIx = byteIx + 16 + q
in InvalidByte badIx (indexByteArray ba badIx)
else
let !q = countTrailingZeros masked2 `quot` 8
!badIx = byteIx + 8 + q
in InvalidByte badIx (indexByteArray ba badIx)
else
let !q = countTrailingZeros masked1 `quot` 8
!badIx = byteIx + q
in InvalidByte badIx (indexByteArray ba badIx)
goBig :: Int -> Int -> IsAsciiResult
goBig byteIx bigStepsLeft
| bigStepsLeft == 0 = goSmall byteIx smallSteps
| otherwise =
let !indexed :: Word64 = indexUnalignedByteArray ba byteIx
!masked = indexed .&. 0x80_80_80_80_80_80_80_80
in if masked == 0
then goBig (byteIx + 8) (bigStepsLeft - 1)
else
let !q = countTrailingZeros masked `quot` 8
!badIx = byteIx + q
in InvalidByte badIx (indexByteArray ba badIx)
goSmall :: Int -> Int -> IsAsciiResult
goSmall byteIx smallStepsLeft
| smallStepsLeft == 0 = IsAscii
| otherwise =
let indexed :: Word8 = indexByteArray ba byteIx
in if indexed >= 0x80
then InvalidByte byteIx indexed
else goSmall (byteIx + 1) (smallStepsLeft - 1)
The noteworthy changes here are that we now read (and combine) four blocks instead of two, and have to perform a fairly extensive set of nested if
s to find bad bytes[2].
However, the results are somewhat disappointing:
4x: OK
63.0 μs ± 1.2 μs, 115 B allocated, 0 B copied, 10 MB peak memory, 0.75x
We only gained about a 25% improvement as compared to the twofold unroll, which is a far cry from the twofold improvement we should have gained theoretically. This also illustrates that, after a certain point, more parallel width stops yielding benefits: this is also known as Amdahl's law. In our specific case, a likely source of the stall is that, while we can load multiple blocks (largely) in parallel, having to combine them to check forces us to slow down. While this still gets us about 30.5 GiB per second, we are definitely getting to the point of diminishing returns, especially if we factor the readability of the code into our calculations.
Going even further
Could we go faster than even this? Not with Haskell alone, unfortunately. We might be able to gain small speedups here or there, but ultimately, the limiting factor is that we are unable to process larger blocks than eight bytes in one 'step'. This is despite most consumer hardware (including every Tier 1 GHC platform) having support for instructions that process at least 16 bytes at a time if not more.
Why can we not use these instructions? At the moment, there are basically two ways to access such capabilities:
GHC's built-in SIMD support; and
The FFI.
The first of these has traditionally been quite limited: aside from certain compiler builtins calling into C standard library functions, SIMD support has been limited to the LLVM backend until recently. Even then, the set of supported instructions is limited: at the time of writing, neither logical nor comparison operations are supported, which makes our task essentially impossible.
The second option can definitely achieve much more, as you have access to everything that the platform's C compiler can provide. However, there are downsides here too:
There is always a penalty for FFI calls, which you have to factor in.
Returning more complex Haskell values from the FFI is difficult, which can create friction or require complex wrapping logic.
SIMD instructions are very platform-specific, and Cabal currently does not provide any mechanism to detect these differences.
The default C compiler for GHC is highly platform-dependent (and even installation-dependent in some cases), and as FFI code is run through whatever C compiler is associated with GHC, getting consistent results can be difficult.
This doesn't mean that the FFI can't be used to access the hardware needed for significant parallel speedups. A good example of a library making great use of these techniques is bytestring
, which makes great use of SIMD instructions to speed up UTF-8 validation. However, the linked implementation also shows just what kind of work is needed to properly detect, and use, supported instructions on a per-platform basis consistently. These costs can be worth paying, but the required time and expertise might not justify the improvements.
Conclusion
Working with packed data is one way to realize large performance gains today. While Haskell and GHC weren't designed with these kinds of workloads in mind, using SWAR techniques that make use of the memory hierarchy, superscalarity, and spatial and temporal locality, we can still gain many performance improvements. We demonstrated this by showing more than a 10x speedup[5] on a task involving processing a packed linear collection as compared to the naive Haskell way of solving this problem. Furthermore, our work didn't require leaving the safety or convenience of Haskell: we didn't use the FFI or even unboxed operations!
While our example is somewhat toy, we can see that, using clever techniques, we can have GHC work better with our hardware, giving us speedups far in excess of what would be possible normally. While such techniques (in the context of GHC) can't use the hardware to its maximum, and only work with the right representations, we believe that as a tool in your toolbox, they are worth knowing about and using. SWAR specifically can be used with only safe Haskell operations, which are familiar and easy to use.
We hope this article leads to both the increased use of such techniques among Haskellers and also to motivate better support for true SIMD in GHC, which would allow us to go much further. If you like what you're reading and this sounds relevant to your business, please reach out.
- Here, the use of 'unaligned' is a bit inaccurate. We never actually load unaligned blocks, as we start reading from the beginning of the ByteArray, and GHC maintains word alignment, which on the testing platform was 8 bytes. We need to use this function to allow us to use byte indices to access whole words. Technically this is unnecessary, but it would require maintaining two sets of indices, which makes the work fiddlier without any performance gains.
- Readers familiar with C would note that a fallthrough switch statement could do this with significantly fewer code lines, and probably with fewer branches. However, we can't easily write something like this in Haskell, or have any certainty about how it would translate.
- Technically this only applies to little-endian architectures. However, as no big-endian architecture is currently Tier 1 for GHC, we're going to disregard this concern.
- A question some might ask is why we prefer checking for a zero block instead of counting both blocks every time. This is because a count of trailing zeroes is a much slower operation on hardware than a comparison to zero; the second of these is used so often (such as for loop control) that practically all architectures and compilers heavily optimize it.
- Closer to 15x actually.
- And in practice, all architectures you're likely to encounter outside of a museum nowadays.