Performance, pull arrays and Plut{arch, us}

13 Oct

Introduction

Many of our previous articles have discussed the capabilities, and performance advantages, of spatially local structures. Arrays in particular have been given significant attention: we have often discussed how they enable us to extract performance in ways that more 'functionally-oriented' structures simply cannot. While most of our examples have used Haskell, we have also demonstrated some packed structure techniques in the context of Plutus. Even in this situation, we were able to show algorithmic improvements.

Unfortunately, until recently, the viability of such techniques on chain has been extremely limited, due to a lack of an equivalent to arrays. In particular, the lack of constant-time indexing[1] makes many techniques over spatially local structures completely unviable onchain. Given their many benefits, and the extreme constraints imposed on scripts in terms of resources, this is a deficiency that was often felt, but seldom addressed. Recently, thanks to the implementation of CIP-138 into Plutus Core, we have finally gained the ability to work with a polymorphic packed structure. However, the API provided by CIP-138 was woefully inadequate to address the needs of applications that would benefit from the kinds of techniques we have discussed.

The goal of Plutarch as a project has always been performance, and we believe that performance should not only be achievable, but convenient too. We thus decided to design an API over CIP-138, using well-known functional programming techniques, which is both more convenient than CIP-138 and allows far more performance than use of intermediate lists. This represents a first-in-class solution to the problem of working with arrays onchain at the time of publication: no such API is provided by any other solution for writing onchain scripts. We conducted this work as part of funding for Plutarch's improvements via the funding of our maintenance and enhancement proposal.

In this article, we will discuss CIP-138 and its limitations, before talking about pull arrays: the technique we chose to use for Plutarch's new array API. After describing pull arrays in a more familiar Haskell context, we will describe how we implemented them in Plutarch, by way of an efficient encoding, showing both the representation we used and the operations over it. Lastly, we will demonstrate the performance improvements we were able to obtain, as well as discussing why we made the implementation choices we did, and what limitations these bring.

To enjoy this article fully, some familiarity with Plutarch is helpful. However, anyone with an understanding of the Cardano chain and working with UPLC in that context will find something to enjoy here. If you would like to see this code, as well as the benchmarks, you can find them in the treasury-milestone-2 branch of Plutarch.

The Plutus Core story: CIP-138

Until relatively recently, the Plutus default universe had no support for any polymorphic spatially local type. In particular, there was no linear collection type that supported constant-time indexing: the builtin list, being a singly-linked list, required Θ(n) time for indexing operations. CIP-138 aimed to address this limitation. Specifically, CIP-138 proposed the following additions to Plutus Core:

A new builtin array type in the default universe;
A way of converting a builtin list into a builtin array with the same contents;
A constant-time indexing function, using Integer indexes; and
A length accessor for builtin arrays.

CIP-138 has been implemented into Plutus Core since version 1.42.0.0. While technically, this means that we now have a spatially local data structure, with constant-time indexing, available to use, the API provided by CIP-138 lacks a lot when it comes to usability and performance:

Builtin arrays are not Data encoded, which means there is no way to pass them in a datum (or similar). You must use lists, and pay the conversion penalty each time.
Constructing arrays programmatically onchain involves constructing, then converting, a list. Aside from being incredibly tedious, it's also quite inefficient.
No array transformations are possible without an intermediate list. This wastes time and space: not only must we re-build on every change, we effectively have to do it twice, which means we pay the time penalty of reconstruction and the space penalty of an added intermediate that we throw away immediately.
There are no builtin provisions for slicing, permuting, or in general any form of subarray construction. While it is possible to build these atop of what CIP-138 provides, this involves a non-trivial penalty: at minimum, we have to track two additional Integers, and keep them current. This doesn't seem like much, but in the space-constrained environment onchain, these costs add up fast.

However, these concerns are all relatively minor relative the central issue with CIP-138: it completely fails to address, or even note, the difficulties that a pure functional approach has with arrays, or any spatially local structure. The core of the problem stems from the following series of observations:

True constant-time indexing requires a spatially local representation; that is, for us to be able to access elements by index quickly, they must be adjacent in memory.
Modifying a spatially local data structure requires either copying or in-place mutation; functional data structures must sacrifice spatial locality to reduce the general overhead of copying.
In a pure functional language, mutation must either be explicit and referentially transparent (as in the ST monad) or must be entirely invisible (as with functional data structures).

This leads to a trifecta of capabilities, from which we can only choose two:

Spatial locality (and thus, true constant-time indexing).
Non-linear-copying modification.
No visible mutability.

To consider some examples of these choices, we examine some solutions across different functional languages:

Vectors from the Haskell vector library choose 1 and 3, while trying to use stream fusion to elide as many copies as possible[7].
Vectors from Clojure choose 2 and 3, by relying on HAMTs, which aren't spatially local.
Vectors from R7RS Scheme choose 1 and 2, essentially providing a C-like API where mutation can happen at any time and without warning.

Thanks to this situation, a whole host of techniques to address working with spatially local structures (especially arrays) in a functional manner have been developed. These have a range of capabilities and tradeoffs, and while none can be seen as the solution to this problem, in any given situation, one or more of them can be seen as optimal for that situation. In this context, the API provided by CIP-138 is lacklustre: the only capability from the trifecta we gain is 1. Modifications to builtin arrays are impossible under its sematics: we can only convert an array to a list (slowly), modify the list (making yet another intermediate copy), then converting back to an array again, which makes 2 an impossibility, while 3 is basically given to us by the chain anyway, as there is no mutability there as such, visible or not[8]. This is a choice that satisfies nobody, significantly lacks both capabilities and performance, and appears to either disregard, or lack awareness of, existing solutions.

A different perspective: pull arrays

One solution to the trifecta described above are pull arrays. Intuitively, these are a combination of deferral of construction and the builder pattern, both of which are common in functional programming. By this combination, pull arrays are able to automatically eliminate intermediate structures, particularly when performing linear transformations on arrays (that is, modifications where each element of the array is modified exactly once, with none dropped or duplicated). Pull arrays are widely used in functional programming in general, and Haskell in particular: both vector and massiv make use of pull arrays, with massiv opting to be explicit, while vector uses them 'behind the scenes'.

More precisely, pull arrays begin by separating manifest arrays (which actually exist as blocks of spatially local elements in index order) versus delayed arrays (which do not). Whereas classically, arrays are manifested as soon as they are introduced, under the pull array approach, arrays are manifested only at the time they are eliminated; when introducing, or transforming, arrays are delayed. This is similar to how builders are used in libraries such as text and bytestring: we first (efficiently) perform modifications in the builder, then 'run' the builder to produce a final result that's easy to query.

To see this more clearly, consider the generate function from the vector library:

  
    generate :: forall a . Int -> (Int -> a) -> Vector a
  

In a conventional view, this would take a length len, and a function f from indexes to elements (an index producer), and construct the array where, at any index i from 0 to len - 1, the element is f i. Under pull arrays, this would instead produce a delayed array, which remembers len and f:

  
    data PullArray (a :: Type) = PullArray Int (Int -> a)
  

The advantage of this representation is twofold:

On any modification, instead of reconstructing a (potentially large) manifest array, we need only rebuild a relatively small structure; and
We benefit from fusion, eliminating intermediates automatically.

To see the second phenomenon in effect, we can see the Functor instance for PullArray:

  
    instance Functor PullArray where
    fmap f (PullArray len g) = PullArray len (f . g)

We can see that the new PullArray 'absorbs' f into its index producer. Thus, if we have a large chain of fmaps, all of their function arguments will combine into a single function, which will be called when manifesting. This happens without users of fmap having to do anything. This property extends to many other operations, including, but not limited to, zipping and slicing.

One minor inefficiency of the implementation of PPullArray above is that every modification requires reconstruction of the length field. While this is not a large cost, it is superfluous, especially since we can defer the length to manifestation time as well. We can achieve this by using the Boehm-Berrarducci encoding of PullArray:

  
    newtype PullArrayBB (a :: Type) = PullArrayBB (
    forall (r :: Type) . (Int -> (Int -> a) -> r) -> r
    )
  

Intuitively, the Boehm-Berrarducci encoding for a type is a form of inversion of control. When we have a PullArray argument to a function, the function's body gets direct access to both the length and the index producer. However, with a PullArrayBB argument, the function's body instead must provide an 'eliminator', which takes the length and index producer as arguments to provide whatever result the caller likes. While somewhat convoluted, this allows us to eliminate all intermediates, including the length of delayed arrays, no matter how large the number of modifications.

Furthermore, we don't lose any capabilities with this translation: PullArray and PullArrayBB both represent the same information. To see why this is, we can define two sides of an isomorphism between PullArray and PullArrayBB:

  
    toBB :: forall (a :: Type) . PullArray a -> PullArrayBB a
toBB (PullArray len f) = PullArrayBB $ \ k -> k len f

fromBB :: forall (a :: Type) . PullArrayBB a -> PullArray a
fromBB (PullArrayBB k) = k (\len f -> PullArray len f)

We can use equational reasoning to show that fromBB . toBB = id:

  
fromBB . toBB $ x -- initial
fromBB . (\(PullArray len f) -> PullArrayBB $ \k -> k len f) $ x -- definition of toBB
(\(PullArrayBB k') -> k' (\len' f' -> PullArray len' f')) . (\(PullArray len f) -> PullArrayBB $ \k -> k len f) $ x -- definition of fromBB
(\(PullArrayBB k') -> k' (\len' f' -> PullArray len' f')) (\(PullArray len f) -> PullArrayBB $ \k -> k len f) x) -- definition of (.) and ($)
(\(PullArray len f) -> (\k -> k len f) (\len' f' -> PullArray len' f')) x -- eliminate intermediate PullArrayBB newtype
(\(PullArray len f) -> PullArray len f) x -- apply lambda
x -- apply lambda

  

We can use a similar approach to show that toBB . fromBB = id. As a concrete example of how we can define the same capabilities for PullArrayBB, we demonstrate its Functor instance:

  
    instance Functor PullArrayBB where
    fmap f (PullArrayBB g) = PullArrayBB $ \ k -> g $ \ len h -> k len (f . h)

The Plutarch story: pull arrays

We implemented pull arrays in Plutarch using the Boehm-Berrarducci encoding form described in the previous section. Due to limitations with GHC's type system, we had to resort to an unusual helper PForall from Plutarch:

  
    type PForall :: (a -> S -> Type) -> S -> Type
newtype PForall (b :: a -> S -> Type) (s :: S) = PForall (forall (x :: a) . Term s (b x))

Intuitively, PForall takes a type argument representing a 'Term shape', which the right-hand-side of PForall traps behind a quantifier. 'The Term shape' must have the kind a -> S -> Type, which allows arbitrary Term structure inside. This allows us to define PPullArray. For convenience, we provide a newtype for our 'Term shape' as well:

  
    newtype PArrOf (a :: S -> Type) (r :: S -> Type) (s :: S)
  = PArrOf ((:-->) (PNatural :--> (PInteger :--> r) :--> r) r s)

newtype PPullArray (a :: S -> Type) (s :: S)
  = PPullArray (PForall (PArrOf a) s)

If we unravel and substitute, we see that what we end up with is effectively the following:

  
    -- This is not real Plutarch: won't compile!
newtype PPullArray (a :: S -> Type) (s :: S) = 
    PPullArray (forall (r :: S -> Type) . Term s (PNatural :--> (PInteger :--> r) :--> r) :--> r)

We use PNatural for the length of the array, as this means we can never have a negative value there. We keep PInteger as the type of array indexes for user convenience. This corresponds to (the onchain equivalent) of the Boehm-Berrarducci encoding of a pull array we showed previously: thus, every time you see PForall (PArrOf a) s, you can imagine that it is secretly forall r . Term s (PNatural :--> (PInteger :--> r) :--> r) :--> r. We will often use punsafeCoerce to move between these two types, and will note this every time we see it.

Before we can go much further, we have to make PPullArray an instance of PlutusType. Unfortunately, the built-in derivation mechanisms fail in the presence of PForall[2]], so we have to define the PlutusType instance manually. Luckily, this is straightforward, if a bit mechanical:

  
    instance PlutusType (PPullArray a) where
  type PInner (PPullArray a) = PForall (PArrOf a)
  pcon' (PPullArray t) = pcon t
  pmatch' t f = pmatch t $ \ t' -> f (PPullArray t')

If we squint, we can see that what we're really doing is 'borrowing' the PlutusType instance of :--> with extra steps. This is required both to deal with the PPullArray newtype wrapper, but also PForall's higher-rank polymorphism. Indeed, it is informative to see how this compares to the PlutusType instance for :-->:

  
    instance PlutusType (a :--> b) where
  type PInner (a :--> b) = a :--> b
  pcon' (PLam f) = plam' f
  pmatch' f g = plet f $ \ f' -> g (PLam (f' #))

We can see that the only real difference between these two types is the use of pcon and pmatch versus plam' and plet to deal with the PForall wrapper. Structurally, these instances are basically identical.

Introducing PPullArrays

The most basic introducer we could want is the Plutarch-level equivalent to generate:

  
    pgenerate ::
  forall (a :: S -> Type) (s :: S).
  Term s PNatural ->
  Term s (PInteger :--> a) ->
  Term s (PPullArray a)
pgenerate len f = pcon . PPullArray . PForall $ punsafeCoerce $ plam $ \ k ->
  k # len # f

  

Barring the various newtype constructors and the punsafeCoerce to turn the Plutarch lambda into a PArrOf, this looks identical to the Boehm-Berrarducci version of generate. We can also easily 'lift' builtin arrays to PPullArrays:

  
    pfromArray ::
  forall (a :: S -> Type) (s :: S).
  Term s (PArray a) ->
  Term s (PPullArray a)
pfromArray arr = pcon . PPullArray . PForall $ punsafeCoerce $ plam $ \ k ->
  k # punsafeCoerce (plengthOfArray # arr) # (pindexArray # arr)

  

There are two uses of punsafeCoerce here: the first to transform our newly built Plutarch lambda into a PArrOf, the second to zero-cost convert the PInteger we get from plengthOfArray to a PNatural. Both of these are safe: PArrOf is just a newtype with no representation change, while the result of plengthOfArray cannot be negative, and is only a PInteger because Plutus builtins must produce types in the Plutus default universe.

We can see that both of these introducers are effectively Θ(1) operations: we do not pay linear construction costs. This is consistent with how pull arrays would work offchain as well, as this cost is deferred to elimination time.

Transforming PPullArrays

The most basic transformation on PPullArray is the equivalent of fmap on offchain pull arrays:

  
    pmapArray ::
  forall (a :: S -> Type) (b :: S -> Type) (s :: S).
  Term s (a :--> b) ->
  Term s (PPullArray a) ->
  Term s (PPullArray b)
pmapArray f arr = pmatch arr $ \ (PPullArray (PForall g)) ->
  pcon . PPullArray . PForall $ punsafeCoerce $ plam $ \ k ->
    punsafeCoerce g # plam (\len h -> k # len # pcompose h f)

-- Helper
pcompose ::
  forall (a :: S -> Type) (b :: S -> Type) (c :: S -> Type) (s :: S).
  Term s (a :--> b) ->
  Term s (b :--> c) ->
  Term s (a :--> c)
pcompose f g = plam $ \ x -> g #$ f # x

  

Once again, the structure of pmapArray mirrors the way we wrote fmap for the Boehm-Berrarducci-encoded offchain pull array. As there isn't a built-in Plutarch equivalent to ., we use a helper: while it changes the order of the arguments relative the 'typical' order for ., the outcome is the same. We also have to make use of two punsafeCoerces there: the first to convert a Plutarch lambda into a PArrOf, the other to convert a PArrOf back into a Plutarch lambda.

We can also define both a version of take and drop, which, while not linear transformations, are convenient and efficient with pull arrays. Together, these can be used to make arbitrary slices.

  
    ptakeArray ::
  forall (a :: S -> Type) (s :: S).
  Term s PNatural ->
  Term s (PPullArray a) ->
  Term s (PPullArray a)
ptakeArray limit arr = pmatch arr $ \ (PPullArray (PForall f)) ->
  pcon . PPullArray . PForall $ punsafeCoerce $ plam $ \ k ->
    punsafeCoerce f # plam (\len g -> k # pmin limit len # g)

pdropArray ::
  forall (a :: S -> Type) (s :: S).
  Term s PNatural ->
  Term s (PPullArray a) ->
  Term s (PPullArray a)
pdropArray dropped arr = pmatch arr $ \ (PPullArray (PForall f)) ->
  pcon . PPullArray . PForall $ punsafeCoerce $ plam $ \ k ->
    punsafeCoerce f # plam (\len g -> k # pdoz len dropped # plam (\ix -> g # (ix #+ pupcast @PInteger dropped)))

-- Helpers

pdoz :: forall (s :: S) . Term s PNatural -> Term s PNatural -> Term s PNatural
pdoz x y = plet (pupcast @PInteger x #- pupcast y) $ \ result ->
  pif
    (result #<= (-1))
    pzero
    (punsafeCoerce result)

pmin ::
  forall (a :: S -> Type) (s :: S).
  POrd a =>
  Term s a ->
  Term s a ->
  Term s a
pmin x y = pif (x #<= y) x y

  

These show the advantage of Boehm-Berrarducci encodings. Both ptakeArray and pdropArray require modifying the length of an array. If we used the direct encoding of pull arrays, this would require additional memory for the modified lengths; with the Boehm-Berrarducci encoding, this cost is paid once, at elimination time. While it seems like a minor benefit, the tight memory constraints onchain force us to find all the efficiency we can.

We can also define the equivalent of zipWith:

  
    pzipWithArray ::
  forall (a :: S -> Type) (b :: S -> Type) (c :: S -> Type) (s :: S).
  Term s (a :--> b :--> c) ->
  Term s (PPullArray a) ->
  Term s (PPullArray b) ->
  Term s (PPullArray c)
pzipWithArray f arr1 arr2 = pmatch arr1 $ \ (PPullArray (PForall g1)) ->
  pmatch arr2 $ \ (PPullArray (PForall g2)) ->
    pcon . PPullArray . PForall $ punsafeCoerce $ plam $ \ k ->
      punsafeCoerce g1
        # plam
          ( \len1 g1 ->
              punsafeCoerce g2
                # plam
                  ( \len2 g2 ->
                      k # pmin @PNatural len1 len2 # pliftIndex f g1 g2
                  )
          )

-- Helper
pliftIndex ::
  forall (a :: S -> Type) (b :: S -> Type) (c :: S -> Type) (s :: S).
  Term s (a :--> b :--> c) ->
  Term s (PInteger :--> a) ->
  Term s (PInteger :--> b) ->
  Term s (PInteger :--> c)
pliftIndex f g1 g2 = plam $ \ i -> f # (g1 # i) # (g2 # i)

  

This definition again demonstrates the benefits of the Boehm-Berrarducci encoding, as the two intermediate lengths of the input pull arrays to pzipWithArray can be eliminated.

Eliminating PPullArrays

As Boehm-Berrarducci forms are a final encoding, we can directly define a fold[3] as an eliminator:

  
    pfoldlArray ::
  forall (a :: S -> Type) (b :: S -> Type) (s :: S).
  Term s (b :--> a :--> b) ->
  Term s b ->
  Term s (PPullArray a) ->
  Term s b
pfoldlArray f acc arr = pmatch arr $ \ (PPullArray (PForall g)) ->
  punsafeCoerce g # go
  where
    go :: Term s (PNatural :--> (PInteger :--> a) :--> b)
    go = plam $ \ len get ->
      phoistAcyclic (pfix # plam goInner) # f # get # pupcast len # 0 # acc

    goInner ::
      forall (s' :: S).
      Term s' ((b :--> a :--> b) :--> (PInteger :--> a) :--> PInteger :--> PInteger :--> b :--> b) ->
      Term s' (b :--> a :--> b) ->
      Term s' (PInteger :--> a) ->
      Term s' PInteger ->
      Term s' PInteger ->
      Term s' b ->
      Term s' b
    goInner self combine get limit currIx acc' =
      pif
        (currIx #== limit)
        acc'
        (self # combine # get # limit # (currIx + 1) #$ combine # acc' #$ get # currIx)

  

The intent of the function above is obscured a bit by Plutarch requiring explicit fixed points instead of recursion, as well as using a counter, rather than a lazy list as would be the norm in regular Haskell. However, if we look at the offchain equivalent against PullArrayBB, we see that intent more clearly:

  
    pfoldlPullArrayBB :: forall (a :: Type) (b :: Type) . 
    (b -> a -> b)
    b -> 
    PullArrayBB a -> 
    b
pfoldlPullArrayBB f acc (PullArrayBB k) = k go
    where
        go :: Int -> (Int -> a) -> b
        -- Note: we could use foldl' here, but don't for clarity
        go len get = goInner acc get 0 len

        goInner :: b -> (Int -> a) -> Int -> Int -> b
        goInner acc' get currIx limit = 
            if currIx == limit
            then acc'
            else goInner (f acc' (get currIx)) get (currIx + 1) limit

  

Here, we see that we have to pay the linear cost of allocation deferred by the Boehm-Berrarducci form from the point of introduction. However, we can also see that the fold doesn't have to allocate an array specifically; instead, it uses the index producer, together with a 'combiner' function, over every valid index of the array that would have been produced, as given by the length.

Due to the universality of folds, we can technically construct any result we want using the right arguments to pfoldlArray. However, for reasons of efficiency, we provide efficient conversion procedures both into builtin lists and builtin arrays. This is needed for two reasons:

As builtin lists must be constructed as a sequence of conses, we want to construct them in reverse index order for efficiency; and
The only current way to construct a builtin array is to convert to a builtin list first, then convert to a builtin array using a CIP-138 primitive.

As these are tedious, we decided to provide them directly.

"But are we fast yet?"

One important question to ask of our design is whether it is an improvement on what already exists in Plutarch. In particular, Plutarch has a rich API for manipulating builtin lists by way of the PListLike type class; in theory, we could combine this with the new CIP-138 primitives[4] to do whatever work with arrays we wanted to do.

To demonstrate that our code can out-perform this combination, we wrote some benchmarks using helpers from plutarch-testlib[5]. For data inputs to these benchmarks, we used arrays of PIntegers, of length 20, which we introduced via pconstant. The benchmarks consisted of the following:

Two map operations on a single array in sequence;
Zipping together two arrays, followed by a map on the result;
Mapping an operation over two arrays, followed by zipping the results together.

In all cases, every map and zip operation produced PIntegers, and the final results were builtin Plutus lists. We compared two different implementations of these three benchmarks:

Using a combination of CIP-138 primitives and PIsList-based functions only; and
Using the new PPullArray interface with a final conversion to a builtin list at the end.

The results are striking:

  
      map twice:                       OK
      CPU  | 51694858
      MEM  | 190046  
      SIZE | 154     
    with PPullArray:                 OK
      CPU  | 34782484 | 0.67x
      MEM  | 128854   | 0.68x
      SIZE | 142      | 0.92x
    zip-map:                         OK
      CPU  | 56587078
      MEM  | 196198  
      SIZE | 194     
    with PPullArray:                 OK
      CPU  | 38730704 | 0.68x
      MEM  | 129106   | 0.66x
      SIZE | 180      | 0.93x
    map-zip:                         OK
      CPU  | 88124122
      MEM  | 307680  
      SIZE | 213     
    with PPullArray:                 OK
      CPU  | 44146864 | 0.50x
      MEM  | 150346   | 0.49x
      SIZE | 194      | 0.91x

  

In all situations, whether with regard to code size, execution units or memory, the PPullArray implementations out-perform ones using the PIsList interface. The improvements are notable, as even the smallest improvement to execution units and memory use is around 33%, and the code size reduction is about 8%. The 'map-zip' benchmark is the most interesting: while it performs more operations (and thus, has higher costs), the reduction we obtain relative the PIsList-based implementation is larger. This shows that the longer the processing pipeline based on PPullArrays is, the more performance benefit we can extract. This is not surprising: as PPullArrays eliminate intermediate structures automatically, we benefit more with more operations, as more elimination of intermediates becomes possible.

Limitations, alternatives, and why we did what we did

While our design is both convenient and efficient, it has a multitude of limitations. Most notably, by choosing pull arrays as the implementation, we end up lacking support for affine transformations, as well as being limited in support for indexing and array concatenation (among other operations). Alternatives to our choice exist, namely push arrays and staging: we will discuss these, and why we ultimately chose not to use them.

One major deficiency in our design is the case of affine transformations: modifications to the array that can drop elements at unpredictable positions. Two classic examples of affine transformations are mapMaybe and filter: these can both leave 'holes' in an array in ways that are impossible to predict ahead of time. This is different to slicing operations, as these benefit from two invariants:

Consecutive indexes in an array before it is sliced will remain consecutive in the result of slicing; and
The length of a sliced array is known, or at least predictable, without evaluation.

Affine transformations make both of these invariants invalid, as we don't know whether a given index will be included or not until runtime. This means we can both leave 'holes' in the array, requiring index adjustment, and the length of the result cannot be known predictably (other than an upper bound).

This has two primary implications. Firstly, the combination of slicing and affine transformations in the same pull array pipeline requires propagating size bounds. Secondly, index-aware transforms, even linear ones, become more difficult, as we can no longer assume that indexes are consecutive. While it is likely possible to solve both of these problems, doing so efficiently is a difficult question, and we chose to avoid it for now.

Furthermore, a pull array is not identical to an actual array: it can be thought of as a builder, similar to the kinds provided by libraries such as bytestring. Thus, it is intended to act as an intermediate for efficient transformations, then 'run' to produce an array (or some other structure) suitable for queries (such as indexing operations). Furthermore, even though pull arrays are efficient for some operations, others are much less so, even as modifications. We will discuss two examples in particular: indexing and array concatenation.

The primary reason for having arrays (or spatially local structures in general) is constant-time indexing[6]. In fact, the major justification given for CIP-138 builtin arrays was constant-time indexing. Theoretically, PPullArray maintains this capability, as we could define something akin to the following:

  
    pindexArray :: forall (a :: S -> Type) (s :: S) . 
    Term s (PPullArray a) -> 
    Term s PInteger -> 
    Term s a
pindexArray arr ix = pmatch arr $ \ (PPullArray (PForall g)) -> 
    punsafeCoerce g # go
    where
        go :: Term s (PNatural :--> (PInteger :--> a) :--> a)
        go = plam $ \ _ get -> get # ix

  

However, careless use of this function can lead to inefficiencies, as observed by massiv. When an array is 'manifest' (that is, actually evaluated), indexing is fast not only because of spatial locality, but because the array acts as a cache: more precisely, we don't have to re-evaluate the computation necessary to produce an element at a particular index every time we access it. However, every time we use a function like pindexArray, the element at the requested index must be recomputed. If this is done often enough, we can lose all the performance benefits we gained from pull arrays in the first place.

Array concatenation is another operation which we would like to be able to do efficiently, but that pull arrays struggle with. The reason for this has to do with how index producers respond to the operations our current API does provide, namely linear transformations and slicing. When we use either of these types of operation on an existing PPullArray, the result's index producer does not gain any branching (in the form of pif or similar) if none of the input PPullArrays had it. This is important, as branching has costs in both code size and performance, especially as pull array pipelines become longer. However, to do array concatenation, we must introduce branching: the index producer of the result must know which of the two index producers of its arguments to use when we evaluate at a particular index. Much as with indexing, one or two such operations would likely not be harmful to performance, but many concatenations of pull arrays can easily destroy performance. This is not a theoretical problem: this issue led to the development of one of the alternatives to pull arrays we are about to discuss.

We can address some of these issues using one of two popular alternatives in the literature to pull arrays. These are:

Push arrays (and their defunctionalized variant); and
Staging

Briefly, a push array is a code generation monad, which, when run, produces the array using some mutable effect stack behind the scenes, while staging makes use of compile-time metaprogramming to eliminate intermediate array computations with code analysis and transformations. Both of these alternatives offer improvements on some of the limitations of pull arrays:

Push arrays are more effective for operations that can't be described as a linear transformation or slice. For example, push array concatenation is both easy and efficient, while being tricky and inefficient on pull arrays. Their defunctionalized variant brings more advantages, as they are also efficient in similar ways to pull arrays.
Staging gives guaranteed and inspectable fusion similar to that of the pull array, but can perform a host of other optimizations. An example of such an optimization is common subexpression elimination (or CSE), even for arrays under linear transformation, which would be extremely beneficial in the space-constrained environment onchain.

We could have chosen either push arrays (defunctionalized or not) or staging instead of pull arrays for our mechanism, potentially resolving some, or all, of the limitations we previously mentioned. However, we ultimately found that push arrays (defunctionalized or not) were impossible to implement efficiently onchain, while staging would be highly challenging in the context of Plutarch, if theoretically possible.

For push arrays, defunctionalized or not, we would have to operate in a monadic stack which, when 'run', would produce a sequence of instructions that would generate the array computation we want. This is done at runtime, with the improvements being made possible due to the ability to analyze the description of the computation. This is certainly a powerful technique, but would have unacceptable cost onchain: we would not only need to pay the cost of the (optimized) array computation, but the cost of building and analyzing it in the first place. This would be even worse in the defunctionalized variant, due to the additional level of indirection it would require. Thus, even if the theoretical speedups we could achieve were large, the opportunity cost of doing so would completely dwarf them onchain. For this reason, we decided not to use push arrays in either form.

Staging potentially avoids the problems with pull arrays by moving the code generation and analysis to compile time. Given that Plutarch is an eDSL, which already functionally acts as a compile-time metaprogramming framework for UPLC, it theoretically could support staging of array computations. However, the complexity of this would be significant, especially when we consider that Plutarch's optimization and code generation pipeline is fairly limited at this stage. To support staging of array computations, we would have to modify Plutarch's internals and compilation pipeline significantly, firstly to give ourselves 'hooks' to describe array computations, and secondly to perform the necessary analysis to generate optimized UPLC. PPullArray, on the other hand, can be implemented entirely in Plutarch itself, leaving the internals unchanged. While staging could be a practical way forward to gain larger improvements to performance and more capabilities, these gains would be hard-won, and we felt that given the improvements we already obtained with PPullArray, this work was not warranted at this time.

Conclusion

CIP-138 represents a sea change in capabilities for onchain scripts, as for the first time, we have a polymorphic, spatially local structure available for us to use. Having first-hand experience with the kinds of performance improvements that are possible with spatial locality offchain, we are excited about the possibilities that this enables. However, we felt that CIP-138's API was rather lacklustre for real work with arrays onchain, and that a large performance gap was left unaddressed as well.

To that end, we used tried and tested functional techniques, as well as our knowledge of onchain performance, to provide a higher-level array-focused API in Plutarch. Not only is this API far more convenient to work with, it is also much more efficient, especially with larger or more complex processing pipelines. Furthermore, we know that this work is just the beginning: many possibilities for yet more performance, yet more convenience, and yet more expressiveness, remain open to us. We believe strongly that this is a worthwhile improvement to Plutarch, and look forward to the results of the Cardano community using our new API to achieve what they previously couldn't.

While our choices are not perfect, and we could have done more, we believe that the best way to discover how something can improve is by letting others use it. Even in its current state, the benefits of our new pull-array-based API in Plutarch are measurable and significant: with time and refinement, as well as community feedback, we can continue to be first in this class. Ultimately, this is better for Plutarch, better for Cardano, and better for us.

If you are interested in this API, would like to suggest changes or improvements, or simply are curious to see the code in its entirety, you can find our work on the treasury-milestone-2 branch of Plutarch's repository. Finally, if you enjoyed this article and it sounds relevant to your business, don’t hesitate to get in touch.

Insofar as this is possible at all in our current real universe.
It may be possible to extend those mechanisms to PForall, but due to its low degree of usage, we haven't done this yet. Let us know if that would interest you!
In this case, a left fold, but we could define a right fold in a similar way.
Which the newest Plutarch version already has wrappers for.
These helpers are also now available publically for the first time, including via CHaP, in the latest Plutarch version.
There are other reasons besides this one to favour spatially local representations. We have discussed, and demonstrated, these advantages in previous articles; however, in this case, they ultimately don't matter, as these advantages are essentially non-existent onchain.
Technically we also have MVectors available, which instead choose 1 and 2, with an added layer of referential transparency.
Being slightly pedantic, there is nothing stopping the CEK machine that runs onchain computations from having mutability in its implementation, and indeed, it makes use of it in at least some cases. However, this capability is unavailable to any script even implicitly, so from the point of view of semantics, UPLC is effectively a language without mutation.

PerformanceCIP-138OnchainPlutarchUPLC OptimizationPull ArraysCardano Smart Contracts

Koz Ross

Performance, pull arrays and Plut{arch, us}

Patterns and Paradoxes: The Logic of Pattern Synonyms

Introducing cardano.nix: Simplifying Cardano Infrastructure Deployment