Our Performance is massiv: Getting the Most Out of Your Hardware in Haskell

25 Jul

Introduction

To get performance from current-day hardware, irrespective of language, requires the use of two techniques:

Temporal and spatial locality; and
Parallelism.

We have previously published a demonstration of just how much temporal and spatial locality techniques benefit workloads. Key to this is the effective use of arrays: by being compact in memory, they allow us to benefit from both temporal and spatial locality easily. While Haskell is not directly oriented around working with arrays, the Haskell ecosystem has libraries designed to make it easy to work with them. The premier such library is vector, which is both convenient and fairly efficient when used correctly. Thus, vector is used widely in the Haskell ecosystem: it is practically impossible to find any major project in Haskell that doesn't depend on vector somehow. MLabs' work is no exception in this regard.

However, for all of vector's strengths, it does have some notable downsides. The biggest of these is its lack of support for parallelism of any kind: even if arrays could be generated using parallel processing, this doesn't happen. However, a second problem also exists: vector only works natively with one-dimensional arrays (that is, linear collections). While this is sufficient for many applications, higher-dimensional arrays are something that, when you need, you really need. While we can manually simulate higher-dimensional arrays by using index projections, this is tedious, error-prone and quite hard to understand when encountered in code written by others. These two issues limit vector's capabilities, both from the point of view of performance, and also from the point of view of programmer productivity.

massiv is a library that overcomes both of these limitations, while also providing other conveniences and safety measures. By supporting both higher-dimensional indexing and parallelism directly, massiv makes many computations both faster to write and faster to run. Together with the additional convenience and safety that it provides, massiv is an excellent replacement for vector in most situations. At MLabs, we have made effective use of massiv, specifically by way of its support for parallelism, to hugely improve the performance of client code, with minimal difficulty, on at least one occasion. Thus, we wanted to discuss how massiv is useful and can be used, especially as it can be a little cryptic sometimes, as well as to compare it to vector in both performance and usability.

We will use the example of cellular automata to demonstrate how both vector and massiv can be used to solve the same problem. We will compare two instances of cellular automata, demonstrating two-dimensional and three-dimensional array use, for both implementation convenience and ease, and also performance. We will show that massiv can not only make such computations easier and clearer to define, but can also give huge performance improvements. We will additionally discuss one helpful added feature of massiv (array stencils), while also mentioning some care that we must take when measuring performance, especially parallel performance. While this is not a tutorial about every feature of massiv, we will spend some time discussing some parts of massiv to show the benefits of their design.

Cellular automata in brief

Cellular automata are a model of computation. Briefly, the basic unit of computation is the cell, which is allowed to be in one of several discrete states (for example, 'off' and 'on'). Additionally, each cell has a neighbourhood, which is some number of other cells it is connected to. You can think of the overall computation taking place in some space, with each neighbourhood being a sub-space of that space. Typically, this space is represented finitely using a periodic boundary condition. We also have a discrete notion of time, meaning that time moves in 'steps', starting at some initial 'zero time'.

At each 'time step', the state of a cell can change depending on the state of its neighbourhood, as well as its own state. These changes are applied 'simultaneously' to all cells: more specifically, the state of a cell at time t+1 depends on the state of its neighbourhood, and itself, at time t only. This essentially makes a cellular automaton a type of finite state machine. The way the cells change is typically described by a set of rules, which specify how a cell's state changes depending on its neighbourhood.

The choice of space, neighbourhood, cell states and rules determines the cellular automaton entirely. Cellular automata are used in a range of applications, ranging from biology to chemistry to physics. We use cellular automata as a demonstration of where arrays (of various dimensionality) are a good choice of data structure.

Our cellular automata

To demonstrate how both vector and massiv can be used to solve the same problem, we will use two cellular automata:

The Game of Life; and
Architecture.

Both of these automata have cells that can be in one of two states: alive or dead. The Game of Life works over a 2-dimensional space; we will represent this using a 2D grid, with a periodic boundary condition to 'wrap around' row and column dimensions. Architecture instead works over a 3-dimensional space, which we will represent with a 3D cube of cells, using a similar 'wrap around' periodic boundary condition. Both use a Moore neighbourhood. More specifically, for Game of Life, the neighbourhood is the eight surrounding cells in X and Y, whereas for Architecture, the neighbourhood is the 26 surrounding cells in X, Y and Z.

The major difference between these two cellular automata is their rules. Since in both cases, cells can only be in two states, for the state of a neighbourhood, it is enough to know how many cells in that neighbourhood are alive. Thus, we can describe the rules for both cellular automata using transition tables, which have a combination of the state of a cell, and the state of its neighbourhood, indicating its next state.

For Game of Life, the table looks like this:

 
   
       
       0
       1
       2
       3
       4
       5
       6
       7
       8
     

   
       Alive
       Dead
       Dead
       Alive
       Alive
       Dead
       Dead
       Dead
       Dead
       Dead
     

       Dead
       Dead
       Dead
       Dead
       Alive
       Dead
       Dead
       Dead
       Dead
       Dead
     

 

For simplicity, we can coalesce ranges of neighbourhood states that produce the same results:

 
       0-1
       2
       3
       4-8
     
       Alive
       Dead
       Alive
       Alive
       Dead
     
       Dead
       Dead
       Dead
       Alive
       Dead

The rules for Architecture (coalesced similarly) look like this:

  
        0-2
        3
        4-6
        7-26
      
        Alive
        Dead
        Dead
        Alive
        Dead
      
        Dead
        Dead
        Alive
        Dead
        Dead

Initial set-up

We first define a type to represent the state of a cell, along with some helpers:

newtype CellState = CellState Word8
   deriving (Eq, Prim) via Word8
   
instance Show CellState where
   show = \case
       Dead -> "Dead"
       Alive -> "Alive"
       
pattern Dead :: CellState
pattern Dead <- CellState 0
   where Dead = CellState 0
   
pattern Alive :: CellState
pattern Alive <- CellState 1
   where Alive = CellState 1

{-# COMPLETE Dead, Alive #-}

We use a Word8 to represent the CellState here, as this allows us to use the more efficient representation of both vector Vectors and massiv Arrays. We could have defined a sum type, or used Bool as our newtype representation instead, but in both cases, this would have meant using boxed representations, which take more space and are less compact in memory[1]. However, this introduces a lot of unnecessary invalid states; we avoid this problem by defining a pair of pattern synonyms, along with a COMPLETE pragma to ensure that we get an exhaustive match once we match on both of them.

Another reason behind our choice of Word8 has to do with our need to determine the number of living cells in a neighbourhood quickly. With Word8, we can directly sum over representations, as living cells are 1, while dead ones are 0. If we had used a sum type (or Bool), this would involve an additional conversion, whereas in our case, a coerce is all we need.

Having these definitions, we can now implement both transition tables:

gameOfLifeRules :: Word8 -> CellState -> CellState
gameOfLifeRules living c
   | living == 2 = c
   | living == 3 = Alive
   | otherwise = Dead

architectureRules :: Word8 -> CellState -> CellState
architectureRules living c
   | living == 3 = case c of Dead -> Alive; Alive -> Dead
   | 4 <= living && living <= 6 = c
   | otherwise = Dead

We use Word8 for our living neighbour count here, based on our representation choice as described above.

Solving with vector

For simplicity, we will assume our grid (or cube) is always square in the subsequent implementations. We do this only to avoid having to deal with more difficult index arithmetic, but our work could be extended to use different dimensions without issue.

For the Game of Life, we need a two-dimensional array. vector does not support such constructions inherently, which means we must choose how to project (or 'flatten') our two-dimensional array into a Vector. We will use a row-major order: this means that, given dim number of both rows and columns, the cell at row i and column j is projected to Vector index i * dim + j[2]. Put another way, row-major ordering 'slices apart' a two-dimensional array along its rows, then puts these rows end-to-end.

As we also need to track the dimension (size of our rows and columns), we need a wrapper type to go with our Vector. We will use the Vector from Data.Vector.Primitive, as CellState is an instance of Prim.

data GameOfLife = GameOfLife Int (Vector CellState)
   deriving stock (Eq, Show)

-- Will be necessary for benchmarks later
instance NFData GameOfLife where
   rnf (GameOfLife dim v) = seq (rnf dim) (rnf v)

Lastly, we need a way of applying our transition to every cell simultaneously. We can drive the work using imap, as this allows us to apply a function to every element of a Vector, together with its index, both of which we need. We will then use a helper function taking the index, value and original array to calculate the number of living cells in our neighbourhood, then apply the transition.

transitionGOL :: GameOfLife -> GameOfLife
transitionGOL (GameOfLife dim v) = 
   GameOfLife dim . Vector.imap go $ v
   where
       go :: Int -> CellState -> CellState
       go ix c = 
           let !rowIx = ix `quot` dim
               !colIx = ix `rem` dim
               !rowIxBehind = (rowIx - 1) `mod` dim
               !rowIxAhead = (rowIx + 1) `rem` dim
               !colIxBehind = (colIx - 1) `mod` dim
               !colIxAhead = (colIx + 1) `rem` dim
               !rowIxBehindDim = rowIxBehind * dim
               !rowIxDim = rowIx * dim
               !rowIxAheadDim = rowIxAhead * dim
               !living :: Word8 =
                   coerce (v Vector.! (rowIxBehindDim + colIxBehind))
                   + coerce (v Vector.! (rowIxBehindDim + colIx))
                   + coerce (v Vector.! (rowIxBehindDim + colIxAhead))
                   + coerce (v Vector.! (rowIxDim + colIxBehind))
                   -- Skip rowIx, colIx pair
                   + coerce (v Vector.! (rowIxDim + colIxAhead))
                   + coerce (v Vector.! (rowIxAheadDim + colIxBehind))
                   + coerce (v Vector.! (rowIxAheadDim + colIx))
                   + coerce (v Vector.! (rowIxAheadDim + colIxAhead))
               in gameOfLifeRules living c

The most complex process here is determining which indexes the neighbourhood of any given cell maps to. The first step is to convert a 'combined' index into a combination of row and column indexes. Since, given a dim, row indexes i and j and combined index ix, we have:

ix = dim * i + j

we see that the combination of quot and rem with dim will produce i and j,respectively. Then, we need to calculate both the row and column indexes that come before i and j, but also the row and column indexes that come after i and j. This is made more complicated by the periodic boundary condition: row or column index -1 must become dim - 1 instead, while row or column index dim must become 0. We handle these cases separately:

For cases where an index may end up as -1, we use mod dim to produce dim - 1; and
For cases where an index may end up as dim, we use rem dim to produce 0.

We could use mod in both cases (and must use mod for the first case), but rem is slightly more efficient, so it's better to use it where possible. Lastly, to produce the entire neighbourhood, we use a list comprehension to pair every possible combination of row indexes (previous, current, last) and column indexes (previous, current, last), avoiding the index i, j, rebuild them into combined indexes, then count any living cells at any of those indexes. Thanks to our choice of Word8 as a representation, this can be done with a combination of indexing and coerce.

We wrote the indexing and sum in a manual way, instead of relying on something like sum or a list comprehension. Indeed, we could have written living as

sum . fmap (coerce . (v Vector.!)) $ [ i * dim + j | 
                                      i <- [rowIxBehind, rowIx, rowIxAhead],
                                      j <- [colIxBehind, colIx, colIxAhead],
                                      i /= rowIx || j /= colIx
                                      ]

However, this is inefficient on several levels:

There is no guarantee that the list comprehension will get fused away[3];
We redundantly re-compute rowIxBehind * dim, rowIx * dim and rowIxAhead * dim unless GHC is smart enough to float them out[4];
We are forced to check the condition nine times, even though we know it can only happen once.

If these occurred only once, it would probably be acceptable in the name of brevity and convenience. However, living is constructed for every cell, and thus, inefficiencies magnify considerably.

We can take a very similar approach with Architecture, but adding one more dimension:

data Architecture = Architecture Int (Vector CellState)
    deriving stock (Eq, Show)

-- As previously mentioned, we will need this for benchmarking later
instance NFData Architecture where
    rnf (Architecture dim v) = seq (rnf dim) (rnf v)

transitionArchitecture :: Architecture -> Architecture
transitionArchitecture (Architecture dim v) = 
    Architecture dim . Vector.imap go $ v
    where
        go :: Int -> CellState -> CellState
        go ix c = 
            let !dimSquared = dim * dim
                !xIx = ix `quot` dimSquared
                !rest = ix `rem` dimSquared
                !yIx = rest `quot` dim
                !zIx = rest `rem` dim
                !xIxBehind = (xIx - 1) `mod` dim
                !xIxAhead = (xIx + 1) `rem` dim
                !yIxBehind = (yIx - 1) `mod` dim
                !yIxAhead = (yIx + 1) `rem` dim
                !zIxBehind = (zIx - 1) `mod` dim
                !zIxAhead = (zIx + 1) `rem` dim
                !xIxBehindDimSquared = xIxBehind * dimSquared
                !xIxDimSquared = xIx * dimSquared
                !xIxAheadDimSquared = xIxAhead * dimSquared
                !yIxBehindDim = yIxBehind * dim
                !yIxDim = yIx * dim
                !yIxAheadDim = yIxAhead * dim
                !xy1 = xIxBehindDimSquared + yIxBehindDim
                !xy2 = xIxBehindDimSquared + yIxDim
                !xy3 = xIxBehindDimSquared + yIxAheadDim
                !xy4 = xIxDimSquared + yIxBehindDim
                !xy5 = xIxDimSquared + yIxDim
                !xy6 = xIxDimSquared + yIxAheadDim
                !xy7 = xIxAheadDimSquared + yIxBehindDim
                !xy8 = xIxAheadDimSquared + yIxDim
                !xy9 = xIxAheadDimSquared + yIxAheadDim
                living :: Word8 =
                    coerce (v Vector.! (xy1 + zIxBehind))
                    + coerce (v Vector.! (xy1 + zIx))
                    + coerce (v Vector.! (xy1 + zIxAhead))
                    + coerce (v Vector.! (xy2 + zIxBehind))
                    + coerce (v Vector.! (xy2 + zIx))
                    + coerce (v Vector.! (xy2 + zIxAhead))
                    + coerce (v Vector.! (xy3 + zIxBehind))
                    + coerce (v Vector.! (xy3 + zIx))
                    + coerce (v Vector.! (xy3 + zIxAhead))
                    + coerce (v Vector.! (xy4 + zIxBehind))
                    + coerce (v Vector.! (xy4 + zIx))
                    + coerce (v Vector.! (xy4 + zIxAhead))
                    + coerce (v Vector.! (xy5 + zIxBehind))
                    -- Skip xIx, yIx, zIx triple
                    + coerce (v Vector.! (xy5 + zIxAhead))
                    + coerce (v Vector.! (xy6 + zIxBehind))
                    + coerce (v Vector.! (xy6 + zIx))
                    + coerce (v Vector.! (xy6 + zIxAhead))
                    + coerce (v Vector.! (xy7 + zIxBehind))
                    + coerce (v Vector.! (xy7 + zIx))
                    + coerce (v Vector.! (xy7 + zIxAhead))
                    + coerce (v Vector.! (xy8 + zIxBehind))
                    + coerce (v Vector.! (xy8 + zIx))
                    + coerce (v Vector.! (xy8 + zIxAhead))
                    + coerce (v Vector.! (xy9 + zIxBehind))
                    + coerce (v Vector.! (xy9 + zIx))
                    + coerce (v Vector.! (xy9 + zIxAhead))
            in architectureRules living c

We can see that transitionArchitecture does something quite similar to transitionGOL, differing only in how neighbourhoods are determined and what rules to use. The way we calculate the neighbourhood indexes is also quite similar to transitionGOL, but we must take another dimension into account. Since we need to use the square of the size of any given dimension frequently, we calculate that value once and then reuse it. We similarly save repeated combinations of our first two dimensions (multiplied by appropriate constants) to save recalculating them.

Evaluating the vector solution

Clearly, vector gives us the tools to define both the Game of Life and Architecture cellular automata. However, the process has two major deficiencies: one relating to ease of implementation, the other to performance.

For our use case, and many others, we need higher-dimensional arrays (2D, 3D or higher). As vector has no concept of such arrays (only 'flat' ones), we have to handle the higher dimensionality ourselves. This is complex, fiddly and error-prone, especially once we get beyond two dimensions. Worse, if we wanted to provide APIs to the public that work with such arrays, we are again on our own: vector doesn't know of any such concept. This forces yet more work on us.

Furthermore, we are currently making zero use of parallel execution, despite our workload being very well-suited to it. As we are using imap, we only ever write to each index of our result once; furthermore, even though we read multiple indexes for each write, we never read a value that we wrote to (effectively, there are no read-after-write dependencies). This shows a lack of contention, which means in theory we can perform the work of applying our transition rules largely in parallel. However, vector doesn't give us any way to do this: imap is always applied serially.

To help us see exactly how much of a limitation this lack of parallelism is, as well as to see how our performance is more generally in practice, we will define and run some benchmarks. To ensure meaningful results, we will use a space of 224=16777216224=16777216 cells. For the 2D automaton, this amounts to a row and column size of 212=4096212=4096, while for the 3D automaton, this gives a size of 28=25628=256 for each dimension. We will measure the time it takes to complete one transitionGOL and one transitionArchitecture, and also the memory this requires.

Running these, we get the following:

All
  Vector, life:          OK
    238  ms ± 3.6 ms,  16 MB allocated, 3.9 KB copied, 4.6 GB peak memory
  Vector, architecture:  OK
    419  ms ± 7.1 ms,  16 MB allocated, 257 KB copied, 4.6 GB peak memory

We notice that the 3D case requires more time, even though it's working over the same number of cells. This is because each cell update must do more work: instead of reading an eight-cell neighbourhood, it must read a 26-cell neighbourhood. Given these results, we can expect to run about four generations a second with Game of Life, and about two generations a second with Architecture. Even if all we wanted to do was render the results, this is quite slow: if we tried to animate this, we would have significant lag and jerky motion.

We will now look at massiv and see if we can do better.

massiv, and how it is better

massiv is an array library that is designed to be more expressive and more performant than vector. Like vector, massiv supports both immutable and mutable arrays, performs array fusion, and has a range of array representations for various data. However, massiv is more capable than vector:

Higher dimensions are supported directly. 2D, 3D or even higher-dimensional arrays can be defined and manipulated using an explicit interface, rather than implicitly like for vector.
Explicit separation between 'manifest' arrays (ones that have already been computed) and 'delayed' arrays (which are not yet computed). This makes array fusion much clearer and easier to reason about.
Direct support for (task) parallel evaluation of arrays with a built-in scheduler. This can be controlled on a per-array and per-evaluation basis.
A capability-based API, similar to mtl, allowing powerful combinators that work across many different representations and array shapes.
Support for array stencilling (which we will discuss in more detail later).

To see each of these in action, consider the index function provided in Data.Massiv.Array:

index :: forall ix r e . (Index ix, Manifest r e) => Array r ix e -> ix -> Maybe e

To unpack this definition, we first look at Array, which is the type of arrays provided by massiv. It has three type parameters:

A representation, which describes how the array looks in memory and how it is computed. This is typically denoted by r.
An index type, which describes the array's dimensionality. This is typically denoted by ix.
An element type, which describes what the array is 'full of'. This is typically denoted by e.

We also see the capability-based form that massiv's functionality mostly takes. Here, we see that there are two constraints: Index ix, which states that ix is a valid index type, and Manifest r e, which says that the combination of representation and element type can be represented directly in memory and computed. In particular, this tells us that the Array argument cannot be 'delayed' (meaning, yet to be computed); thus, indexing as an operation requires us to first compute an array. Knowing all this, the signature basically states "given an array that has been computed already, and a valid index into that array, either return the element at that index, or Nothing if the index is out of bounds".

But what exactly do valid ix and r look like? We can see this by looking first at instances of the Index type class:

-- there are more than these, but these are the ones we're most interested in
-- implementations are omitted, some simplifications are used
instance Index Int

data Ix2 = !Int :. !Int

instance Index Ix2

data IxN (n :: Nat) = !Int :> !(IxN (n - 1))

instance Index IxN 3

instance (4 <= n, KnownNat n, KnownNat (n - 1), Index (Ix (n - 1))) => 
   Index (IxN n)

Essentially, massiv provides indexing over tuples of Ints of any size, together with some additional safety to ensure that we never try to use an index of the wrong dimensionality by forcing an Array to indicate what kind of index it can work with. If we look at the index function's type signature, we see the benefits of this: we need to give an array indexed by the type ix, along with a term of that same type.

For r, the situation is more complex, as we don't consider r standalone most of the time. For example, looking at the constraints for the index function from before, we see that it requires Manifest r e to hold. This is because a representation doesn't only define what an array will look like in memory but also its elements. Some representations directly mirror those of vector:

B corresponds to boxed arrays, which are full of pointers to their elements.
S corresponds to storable arrays, which have Storable instances for both themselves and their elements.
U corresponds to unboxed arrays.
P corresponds to primitive arrays, which are filled with elements whose type has a Prim instance.

However, other representations are unique to massiv. A good example is D, which represents an array that is delayed: instead of being a physical collection of elements in memory (a manifest array), a delayed array is a function from its indexes to its elements. This additional information allows massiv to be safer, more efficient, and clearer than vector can.

To see the advantages of this kind of design, let us examine a different function from Data.Massiv.Array:

imap :: forall r ix e a. (Index ix, Source r e) => (ix -> e -> a) -> Array r ix e -> Array D ix a

This is analogous to a similar function from vector. We will compare it to the imap provided by Data.Vector.Generic, as all other versions of imap from vector ultimately delegate to the implementation in Data.Vector.Generic:

imap :: (Vector v a, Vector v b) => (Int -> a -> b) -> v a -> v b

We can see these functions do essentially the same thing: given a function argument f taking an index and an element, and an array, produce another array of the same size (and in massiv's case, shape). This new array, at each index i of the array argument that held the value x, will have f i x at i instead. However, we can see two notable differences between the versions provided by vector and massiv:

The vector version produces a result array of the same type as the argument array (but with different contents); the massiv version produces a delayed array instead.
The vector version deals only with one-dimensional arrays, while the massiv version can deal with any shape or dimension of array.

The significance of the more general indexing is clear to see, as it avoids the need for multiple similar functions: ultimately, imap doesn't really need to specialize by shape or dimension. The significance of the delayed result of massiv is more subtle, but just as major. Consider the following combinations of operations:

-- from vector, compiles just fine
tryVector = (imap f v) ! 2

-- from massiv, will not compile
tryMassiv = (imap f v) `index` (Ix1 2)

This is where massiv potentially protects us from a performance hit. The reason behind this is that imap as an operation is subject to array fusion: specifically, it follows the same law as fmap:

imap f . imap g = imap (\i -> f i . g i)

Thus, it is best to delay computing the results of any such operation as long as possible, to avoid constructing intermediate arrays. Indexing, however, must work over an array that has been computed already. Unlike massiv, vector does not make it clear whether an array is manifest or delayed: thus, something like tryVector would cause the result of imap f v to be constructed, without us being informed. tryMassiv on the other hand does not compile. To see why this is, we note that Manifest D a is not an instance for any choice of a. This protects us from the implicit construction of an array: we instead have to do this explicitly if we really want to. In this sense, massiv is both more efficient and clearer in its intent.

We can see that massiv is far more suited than vector to our current task. As we need two and even three-dimensional indexing, which massiv supports natively, one of our two identified problems with the vector-based implementation is already solved. We will now see how, by choosing to implement our functionality with another massiv feature, we can solve the second problem too.

A second attempt

How would we implement our two cellular automata using massiv? We could replicate the method we used with vector and gain the benefits of first-class higher-rank indexing. However, massiv gives us a much more powerful tool for this: array stenciling. More properly known as Iterative Stencil Loops (or ISL), array stenciling can be thought of as a generalization of the (indexed) mapping operation over arrays. However, while a mapping operation can use only the element (and index) that is to be replaced to compute the new value, a stencil computation can use adjacent elements as well.

More precisely, let us examine the imap operation from massiv one more time:

imap :: forall r ix e a. (Index ix, Source r e) => (ix -> e -> a) -> Array r ix e -> Array D ix a

We can see that the function argument to imap is what determines what the resulting array will contain at each index. We can use the element at the same index from the original array, together with that index, to compute the new element. Stencil computations are similar, but instead of being given a single element, they are given a subarray of fixed size, also called a neighbourhood; the original element is a part of that subarray.

To better see how this works, we can look at the signatures for makeStencil and mapStencil from Data.Massiv.Array.Stencil:

makeStencil :: forall ix e a . (Index ix) => Sz ix -> ix -> ((ix -> e) -> a) -> Stencil ix e a

mapStencil :: forall ix r e a . (Index ix, Manifest r e) => Border e -> Stencil ix e a -> Array r ix e -> Array DW ix a

We can immediately see the similarity between mapStencil and other map functions. The difference is that, instead of a function argument, we provide a Stencil, as well as a Border, whose function we will describe shortly. To make a stencil, we need to supply the following:

A size (represented by Sz), which specifies the dimensions of the subarray that the stencil can use.
A 'centre' for the stencil. This is an index for an array of the size specified by the Sz parameter.
A function argument to produce the result. This gets to use a 'getter' function, which, given a relative index into a subarray of the specified size, retrieves the element at that position.

We also note the DW representation: this stands for 'delayed windowed', and it exists specifically to serve the needs of stenciling. Its differences from D (the standard delayed representation) aren't hugely important for our purposes.

We need to discuss what we mean by relative indexing here. The centre of the stencil represents the element of the original array that will be replaced by the stencil computation's result: inside the function argument, this is considered to be the index with 0 in all positions. All other indexes used in the function argument (as arguments to the 'getter') must be written as relative to the centre (and thus, 0 in all positions), which can sometimes lead to negative indexes. To show what this means in practice, suppose we defined a stencil as follows:

makeStencil (Sz1 3) 1 $ \get -> f

This stencil effectively operates over a three-element sliding window, centred at the element to be modified. The valid indexes to get inside f would be any of the following:

-1 (left of the centre)
0 (the centre)
1 (right of centre)

Had we defined the stencil like this instead:

makeStencil (Sz1 3) 0 $ \get -> f

Then the sliding window would not be centred on the element to be modified. Instead, we would have a two-element 'lookahead', giving the following valid indexes to get inside f:

0 (the 'centre')
1 (look one element ahead)
2 (look two elements ahead)

However, this leads to a natural question: what happens to elements on the 'edges' of an array? In our vector-based definitions, we needed to be careful to take this into account when calculating neighbourhoods for our cells; do we have to do the same in our stencils? The answer is 'no': stencils don't need to concern themselves with how they are provided their subarrays, merely that they are, and that any (relative) index into those subarrays will be valid. So, how can we ensure that we don't run into issues with array 'edges'?

massiv chooses to defer this problem to the point where the stencil is applied, rather than defined. This is where the Border argument to mapStencil comes in. Effectively, Border specifies what we want to do about the 'edges' of an array. There are several options:

Fill, which substitutes every element 'off the edge' with a fixed value
Wrap, which performs 'wraparound' behaviour (just like we used in our vector-based definitions from before)
Edge, which duplicates the element 'on the edge' to fill any gaps
Reflect, which mirrors, rather than 'wraps around', the array
Continue, which is like Reflect, but doesn't duplicate the element 'on the edge'

This separation of concerns, together with 'pre-baked' functionality for handling 'edges', is convenient, as we can now define stencils without worrying about handling 'off the edge' indexes, while also being able to run the same stencils in different ways as required.

Putting all of this together, we can define the Game of Life using the following stencil!

lifeStencil :: Stencil Ix2 CellState CellState
lifeStencil = makeStencil (Sz2 3 3) (1 :. 1) go
 where
   go :: (Ix2 -> CellState) -> CellState
   go get =
     let centre = get (0 :. 0)
         living :: Word8 =
           coerce (get ((-1) :. -1))
             + coerce (get ((-1) :. 0))
             + coerce (get ((-1) :. 1))
             + coerce (get (0 :. -1))
             -- skip (0 :. 0)
             + coerce (get (0 :. 1))
             + coerce (get (1 :. -1))
             + coerce (get (1 :. 0))
             + coerce (get (1 :. 1))
      in gameOfLifeRules living centre

We can see that this is almost a direct translation of the Game of Life rules. The stencil specifies a 3 by 3 working area, with the center being put directly in the middle. We then only need to collect all the neighbours (by summing them), before calling the gameOfLifeRules function. We can see relative indexing in full effect here: to get the centre, we call get (0 :. 0), to get the element 'to the right' of the centre, we call get (0 :. 1), and so on.

We can do something very similar with Architecture: the main difference being the need to work in three dimensions instead of two:

architectureStencil :: Stencil (IxN 3) CellState CellState
architectureStencil = makeStencil (Sz3 3 3 3) (Ix3 1 1 1) go
 where
   go :: (IxN 3 -> CellState) -> CellState
   go get =
     let !centre = get (Ix3 0 0 0)
         living :: Word8 =
           coerce (get (Ix3 (-1) (-1) (-1)))
             + coerce (get (Ix3 (-1) (-1) 0))
             + coerce (get (Ix3 (-1) (-1) 1))
             + coerce (get (Ix3 (-1) 0 (-1)))
             + coerce (get (Ix3 (-1) 0 0))
             + coerce (get (Ix3 (-1) 0 1))
             + coerce (get (Ix3 (-1) 1 (-1)))
             + coerce (get (Ix3 (-1) 1 0))
             + coerce (get (Ix3 (-1) 1 1))
             + coerce (get (Ix3 0 (-1) (-1)))
             + coerce (get (Ix3 0 (-1) 0))
             + coerce (get (Ix3 0 (-1) 1))
             + coerce (get (Ix3 0 0 (-1)))
             -- skip (Ix3 0 0 0)
             + coerce (get (Ix3 0 0 1))
             + coerce (get (Ix3 0 1 (-1)))
             + coerce (get (Ix3 0 1 0))
             + coerce (get (Ix3 0 1 1))
             + coerce (get (Ix3 1 (-1) (-1)))
             + coerce (get (Ix3 1 (-1) 0))
             + coerce (get (Ix3 1 (-1) 1))
             + coerce (get (Ix3 1 0 (-1)))
             + coerce (get (Ix3 1 0 0))
             + coerce (get (Ix3 1 0 1))
             + coerce (get (Ix3 1 1 (-1)))
             + coerce (get (Ix3 1 1 0))
             + coerce (get (Ix3 1 1 1))
      in architectureRules living centre

If we compare these to the definitions we needed to write for transitionGOL and transitionArchitecture, we can see that the massiv versions separate responsibilities much better. Whereas transitionGOL and transitionArchitecture had to deal with converting higher-dimensional indexes into one-dimensional indexes, our stencil code doesn't need to concern itself with this due to massiv's direct support of higher-dimensional indexing. Furthermore, our stencil code doesn't deal with index wraparound at all: we can simply assume that the get function behaves correctly relative to the subarray the stencil can 'see' without needing to worry about how that subarray is provided. This makes writing stencil code much less error-prone and much more declarative. too.

To run these stencils, we will use mapStencil with the Wrap border strategy. We can see that mapStencil produces a delayed array, just like massiv's version of imap. This allows us to make use of massiv's inherent support of parallelism. As mentioned before, delayed arrays are not really arrays in the traditional sense: instead of being physical collections of elements in memory, they are functions from indexes to elements. To produce a manifest array from a delayed one, we need to evaluate it. massiv has a range of options for doing this, but the one we will consider is computeP:

computeP :: forall r ix e r'. (Manifest r e, Load r' ix e) => Array r' ix e -> Array r ix e

We can see that the result of computeP must be a manifest array of the same dimensions and elements as the input array. However, the input array does not have to be manifest; instead, its representation is constrained by Load. Load essentially specifies that this combination of representation, index and element can be computed to produce a manifest array. Of particular interest to us is the following Load instance:

-- Definition of the instance omitted
instance (Index ix) => Load DW ix e

Thus, we can use computeP to transform the result of applying our stencil into a manifest array. Furthermore, computeP attempts to be as parallel as possible (hence, the P) by making use of as many cores as possible. We can control the degree of core utilization if we wish, but we won't consider this here. Thus, our execution pipeline will look something like this:

-- And analogously for Architecture
computeP . mapStencil Wrap lifeStencil $ originalState

This once again shows massiv's combination of convenience and performance: we can get parallelism when we want it by calling a single function, without having to consider (too much) exactly what code we give to it. While this doesn't necessarily make parallel workloads trivial, it does significantly reduce the tedium involved in parallelizing.

Comparing the performance

How do our massiv-based solutions compare to their vector-based equivalents? While we have suitable benchmarks to compare with, we have to be somewhat careful, as parallel execution is easy to mis-measure. Furthermore, before we do any measuring, to get the most performance possible, we want to convert both of our stencils to use makeUnsafeStencil. This conversion will speed up execution by not performing boundary checks, and is straightforward:

lifeStencil :: Stencil Ix2 CellState CellState
lifeStencil = makeUnsafeStencil (Sz2 3 3) (1 :. 1) (const go)
-- go is the same as before

-- architectureStencil will be the same

Furthermore, we must ensure that we compile with parallelism enabled, which means passing

-threaded -rtsopts -with-rtsopts=-N to our build. It is also a good idea to pass -O2 as well, as many optimizations that benefit computations over arrays only really happen at -O2 with any degree of consistency. We need to do this for our benchmarking code as well: this is unusual, as typically benchmarks are run without parallelism of this kind enabled, for reasons of accuracy5. Lastly, as we are using tasty-bench, we have to pass --time-mode=wall as an argument. Without this argument, instead of seeing the time taken to complete the benchmark, we will see the sum time taken by all cores: this will mean that, on a six-core speedup, we will see six times the runtime!

Taking all this into consideration, when running the benchmarks for our massiv-based implementations, we get the following:

All
  Vector, life:          OK
    242  ms ± 4.5 ms,  16 MB allocated, 5.2 KB copied, 4.2 GB peak memory
  Stencil, life:         OK
    11.2 ms ± 213 μs,  16 MB allocated, 4.4 KB copied, 4.2 GB peak memory
  Vector, architecture:  OK
    418  ms ± 5.2 ms,  16 MB allocated, 257 KB copied, 4.2 GB peak memory
  Stencil, architecture: OK
    39.0 ms ± 476 μs,  16 MB allocated, 4.8 KB copied, 4.2 GB peak memory

This is a huge improvement. For the Game of Life, we're at about 21x, while for Architecture, we're at just shy of 11x. Notably, we're using the same amount of memory, and do less copy-through, using massiv and stencils. Coming back to our rendering example, the stencil-based Game of Life could now theoretically perform about 90 updates per second, while the stencil-based Architecture would manage something closer to 25. We would have no trouble animating Game of Life at this scale at 30, or even 60, frames per second, and while Architecture wouldn't be fully smooth even at 30 frames per second, it would at least look reasonable.

It is also interesting to see exactly how much of our parallel capabilities we are utilizing. The machine these benchmarks were run on has 16 physical cores and 32 hyperthreads, thus giving us theoretically as much as a 32-fold speedup if we use all our resources. To see this, we can re-run our benchmarks without setting --time-mode=wall. This will show us the sum of all the time taken on all CPUs; we can then compare it to our previous benchmark result to see how much parallel speedup we are seeing.

All
  Vector, life:          OK
    241  ms ± 2.3 ms,  16 MB allocated, 3.8 KB copied, 4.5 GB peak memory
  Stencil, life:         OK
    215  ms ± 1.6 ms,  16 MB allocated, 4.2 KB copied, 4.5 GB peak memory
  Vector, architecture:  OK
    415  ms ± 6.0 ms,  16 MB allocated, 257 KB copied, 4.2 GB peak memory
  Stencil, architecture: OK
    803  ms ±  10 ms,  16 MB allocated, 7.4 KB copied, 4.2 GB peak memory

This suggests that our parallel speedup is about 20 for both stenciled computations. While this technically doesn't saturate our capabilities, this is still extremely good. Both of these results show massiv's strengths when it comes to performance: we had to do essentially zero work to gain these benefits above just making use of massiv. When combined with the added declarativity of stencils, this gives us a double improvement: clearer code and faster code.

Conclusion

We have seen how massiv can make our code both clearer and faster, especially when we have to work with multi-dimensional arrays. massiv's direct support of higher-dimensional indexing, use of the type system to help avoid accidentally materializing intermediate arrays, easy parallelism, and tools like stenciling, make massiv a powerful solution for anything requiring working with arrays. While it is more complex than vector, a lot of this complexity is either necessary to deal with its expanded range of concerns, or is actually a beneficial source of information to us about how our computations will run in practice.

Our examples of cellular automata, while contrived, show exactly how much of a benefit using massiv can be over other array-oriented packages like vector. We showed that not only are cellular automata computations easier to define and clearer in their intent, using massiv, but that we can achieve large speedups with next-to-no effort beyond just using massiv. These benefits extend far beyond cellular automata examples, however. At MLabs, we used massiv to solve a performance bottleneck in client code with almost no friction, and this solution is still in use to this day. We chose to use massiv because we saw that parallelism could address this problem, and massiv offered us the easiest path to parallelizing our work.

As we mentioned at the beginning, the use of temporal and spatial locality, together with parallelism, is how we can obtain performance in 2025. This statement is not limited to Haskell, but in Haskell, we have been somewhat slow adopters of both. massiv as a library makes using arrays easy, convenient and fun, which allows us to benefit from spatial and temporal locality while enjoying the power of Haskell. At the same time, massiv also gives us parallelism with ease. With massiv, getting performance in Haskell is easier than ever: give it a try!

If you like what you’re reading and this sounds relevant to your business, please don’t hesitate to get in touch.

As we discussed in a previous article on ASCII validation.
There are many other possible representation choices here, but we
use row-major ordering for two reasons. Firstly, it is the same as what
massiv uses, which makes our results more comparable. Secondly,
row-major ordering is one of the simpler choices from the point of view of index
arithmetic, which makes our presentation easier to follow.
We've observed that even at -O2, GHC refuses to do
this, costing hugely in both time and space.
Which it isn't, even at -O2.
tasty-bench, our benchmarking framework of choice,
specifically says to avoid -threaded in its documentation for this
reason exactly.

HaskellPerformanceCellular AutomataStencil ComputationBenchmarkingmassivParallelism

Koz Ross

Our Performance is massiv: Getting the Most Out of Your Hardware in Haskell

First Know Thy Self: Understanding String Matching Performance With ASCII