Quasiquoting for Fun, Profit, Expressions and Patterns

Haskell

8 May

Introduction

Smart constructors are a well-known, and well-worn, technique in Haskell. While they offer a range of different benefits, their biggest advantage is the separation between data representation and data interfaces. Ideally, data representations should be both efficient and easily replaceable, while data interfaces should be safe and stable. Smart constructors make this kind of separation possible, in a way 'raw' data constructors cannot. One example of a smart constructor is %, which is provided by base, for the Ratio, type. This allows base, to keep the data constructor of Ratio hidden, protecting its internal invariants, while also allowing construction of Ratios safely:

aHalf :: Ratio Integer
aHalf = 1 % 2

-- This will automatically reduce, without us having to do anything
alsoAHalf :: Ratio Integer
alsoAHalf = 40 % 80

However, smart constructors are not as capable as we would like. Their first limitation is that they work at runtime. This means that even if their arguments are known constants, all of whose relevant properties are known at compile time, the constructor must check them at runtime anyway. For example, if we wrote:

aNonsenseFraction = 1 % 0

this would not be checked until runtime. This would cause a runtime error in this case; other smart constructors instead work in Maybe or a similar notion of failure. While for runtime values, this kind of check is unavoidable, for compile-time constants, we would prefer if these checks were done at compile time. This would avoid both unnecessary work at runtime and the inconvenience of having to deal with errors where they're not possible.

Their second limitation is that, unlike regular data constructors, smart constructors are functions, which means they can't be used to pattern match. For example, Just, being a regular data constructor, can be used in both ways:

-- To introduce a value
peopleAtTheParty = Just (you .&. me)

-- To eliminate a value
case (lookup needle haystack) of
    Just x -> ....

However, %, being a smart constructor, can be used only to introduce Ratios, not to eliminate them:

-- can't do this
ifOnly ratio = case ratio of
    1 % 2 -> ...
    _ -> ...

The third limitation of smart constructors is that they cannot overload existing syntax, such as numbers and string literals. This functionality is instead performed by fromInteger, fromRational and fromString, which are unguarded₁. However, because this form of syntax is both concise and familiar, the functionality of fromInteger and similar is often abused, with regrettable results. One of the worst examples of this is the IsString instance of ByteString, which has caused no shortage of confusion and bugs.

Quasiquoters are a (limited) form of Template Haskell, which enables us to have smart constructor–like functionality, but without the limitations above. As quasiquoters operate at compile time, their checks are at compile time as well: this means situations like 1 % 0 are not deferred to runtime anymore. Furthermore, quasiquoters can be designed to provide not only introduction but also elimination via pattern matching. Lastly, due to their use of Template Haskell, they can overload (or more exactly, use) any syntax we like. Thanks to these capabilities, quasiquoters can define an improved version of smart constructors giving us the ability to be both safe and efficient.

In this article, we will describe a problem where quasiquoters have advantages over smart constructors. We will then implement a quasiquoter usable for both construction and pattern matching to solve this problem. By doing this, we will demonstrate the capabilities of quasiquoters to evade all the problems described above in the context of our problem, as well as go into some detail of how they work and can be written. We will introduce as much Template Haskell as we will need, and you don't need to be an expert in Template Haskell to enjoy this article. If you feel you need more of an introduction to Template Haskell in general, we can only recommend Mark Karpov’s tutorial, which is more focused on Template Haskell in general.

A problem playground

ASCII text is a venerable part of computing, and remains useful today despite its age, such as in the HTTP protocol. While Text (and even String) are more than capable of handling ASCII text, we want something with stronger guarantees, as this would not only be more efficient but also allow operations that don't really make sense with non-ASCII text. Thus, we need a data type Ascii that guarantees that it represents an ASCII string, along with ways to work with it₂.

As ASCII strings are packed arrays of bytes, we have an embarrassment of riches in GHC Haskell in terms of ways to represent one:

Strict ByteString from bytestring
ByteArray (or its corresponding unboxed variant)
UArray i w from array, where w is any type with a FiniteBits instance
Unboxed Vectors from vector of any type with a FiniteBits instance (or of Bit from bitvec)
Storable versions of all the above
massiv array versions of all the above

While for demonstration purposes any of these would work, we will use ByteArray here for several reasons:

It is part of base, with enough of an API to be useful.
It is efficient for lots of small allocations₃.
text-ascii uses it, which others can use as an ‘industrial strength’ reference.
It has a mutable counterpart which can be used safely outside of IO for internal operations for speed.

How ByteArrays work exactly is not critical here. The only important interface for us is fromListN, which, when given an Int length and Word8 contents, constructs a ByteArray of the given length, consisting of the given contents.

The definition of our type is straightforward:

newtype Ascii = Ascii ByteArray

The intent is that Ascii is a 'closed' newtype; instead of exporting the data constructor, we instead provide ways to construct and manipulate Ascii values safely. This ensures that our invariant (that any Ascii stores only ASCII strings) is maintained. We can start by defining a smart constructor:

stringToAscii :: String -> Either Int Ascii
stringToAscii input = do
    (verified, len) <- runStateT (traverse countAsciiByte input) 0
    pure . Ascii . fromListN len $ verified
  where
    countAsciiByte :: Char -> StateT Int (Either Int) Word8
    countAsciiByte c = if isAscii c
                       then (fromIntegral . ord $ c) <$ modify (+1)
                       else get >>= throwError

As fromListN requires a length, and lists in Haskell are not counted, we combine the process of length counting and validation. We do this inside of StateT, as this allows us to keep track of the count 'in the background'. By doing this count together with validation, we can also produce a more useful error, indicating not only that we failed, but provide an index into the input where we failed.

While stringToAscii certainly works, it very much suffers from all the problems identified in the introduction. We don't have the useful string syntax for Ascii literals, pattern matching on Ascii is currently not possible, and even if we have a compile-time String constant which we know to be ASCII, we are forced to work in Either or resort to unsafe operations.

Introducing quasiquoters

Quasiquoters are a form of Template Haskell which, when given arbitrary source input (as Strings), can transform them into expressions, patterns, types, or declarations. As they use Template Haskell, they run at compile time, and thus have the ability to fail at compile time; as they take arbitrary Strings as input, they can parse and use whatever syntax you want, and once written, are straightforward to use, requiring no real knowledge of Template Haskell. Furthermore, a single quasiquoter is not limited to only producing one of the four things listed above: you can potentially define a quasiquoter which produces expressions in expression contexts, patterns in pattern contexts, types in type contexts and declarations in declaration contexts if you want!

To define a quasiquoter, we need to construct a value of the QuasiQuoter type, which you can find in the template-Haskell package, in the Language.Haskell.TH.Quote module. Its definition is roughly as follows:

-- Defined in Language.Haskell.TH.Quote
data QuasiQuoter = QuasiQuoter (String -> Q Exp)
                                   (String -> Q Pat)
                                   (String -> Q Type)
                                   (String -> Q [Dec])

Each of the fields corresponds to what the quasiquoter is meant to do in any given context that it's used in. In all cases, the String argument is provided by whoever uses the quasiquoter: we will describe how this happens in a moment. The Q monad is part of Template Haskell, and allows various capabilities, which include looking up information about identifiers and generating fresh names. The result types represent different constructs in the GHC Haskell language in the form of data types:

Exp represents expressions; its data constructors are suffixed with E.
Pat represents patterns; its data constructors are suffixed with P.
Type represents types; its data constructors are suffixed with T.
Dec represents (a list of) declarations. Dec data constructors are suffixed with D.

Lastly, and most importantly for us, Q is also an instance of MonadFail. We use the fail method to provide compile-time error messages when a quasiquoter is used improperly, or given something we can't use.

Before we talk about how we will define our quasiquoter, we will show we can use quasiquoters. If we imported a quasiquoter foo, we use it as follows:

-- Enable the QuasiQuotes extension

aFoo = [foo| your input here |]

In this situation, we are using foo in an expression context. Thus, the first field of the definition of foo (or rather, the QuasiQuoter bound to foo) would receive the String with the contents your input here (with a space either side of the text). Note that whitespace is included; had we written something like this:

anotherFoo = [foo|some other input|]

then the argument contents would be some other input (without any whitespace at the start and the end) instead. In general, whatever is between the | is fed directly to the corresponding field of QuasiQuoter, and can look like absolutely anything, as anything between the | is ignored by the GHC parser. In effect, the QuasiQuoter informs GHC how to interpret the data between the | by translating it into something it understands. In the example here, we would be translating into Q Exp, as the quasiquoter foo is being used in a context expecting an expression.

Quasiquoters can be used in other contexts as well. The following is an example of the use of foo in a pattern context:

aPatternMatch x = case x of
    [foo|a match target|] -> ...

In this case, the String with the contents a match target would be fed to the second field of the QuasiQuoter bound to foo. Here, we would be translating into Q Pat.

Parsing expression

Before we define our quasiquoter, we should consider what we expect the translation to look like. This requires answering two questions:

What visible syntax input to the quasiquoter is considered valid?
What should valid visible syntax input translate into?

To keep with the familiar string literal syntax used for String, Text and similar data types, our visible syntax input will be ASCII text, surrounded by double quotes. Whitespace before the opening double quote, and after the closing double quote, will be ignored, but anything other than whitespace outside of the opening and closing double quotes will be treated as invalid input. Thus, these are examples of valid inputs:

”foo”
”bar “
”baz-quux-1234 ”

Whereas these are examples of invalid inputs:

, as it lacks either an opening or closing double quote
”foo, as it lacks a closing double quote
bar ”, as it lacks an opening double quote
”baz quux” 1, as it has non-whitespace after the closing double quote
”Citroën”, as it has non-ASCII text in the input

We note here that we are considering a simplified case. We don't deal with escape sequences (including embedded double quotes and non-printable characters), which is somewhat unrealistic. We made the choice to not concern ourselves with escape sequences to help focus the article on quasiquoters. If you plan to do this kind of work in your own quasiquoters, we recommend the use of a parsing library or framework₅.

Having decided on the structural checks we need to perform, we can write a helper to handle structural verification. We will work in an arbitrary MonadFail, rather than Q specifically, as we don't need any functionality of Q other than the part provided by MonadFail.

stripQuotesAndVerify :: MonadFail m => String -> m ([Word8], Int)
stripQuotesAndVerify input = do
    stripped <- stripQuotes input
    runStateT (traverse toCountedAsciiByte stripped) 0
    where
        stripQuotes :: String -> m String
        stripQuotes s = case dropWhile isSpace input of 
            [] -> fail "ascii: No input."
            (x : xs) -> case x of 
                '\"' -> case span (/= '\"') xs of 
                    (strippedInput, rest) -> case rest of 
                        [] -> fail "ascii: No closing quotes found."
                        (_ : ys) -> case find (not . isSpace) ys of 
                            Nothing -> pure strippedInput
                            Just _ -> fail "ascii: Non-whitespace after closing quotes."
                _ -> fail "ascii: No opening quotes found."
        toCountedAsciiByte :: Char -> StateT Int m Word8
        toCountedAsciiByte c = if isAscii c
                               then (fromIntegral . ord $ c) <$ modify (+ 1)
                               else fail $ "ascii: Not ASCII: \'" <> [c] <> "\\'."

The use of fail we can see in stripQuotesAndVerify is typical of how we indicate compile-time errors in quasiquoters. We can make use of as much, or as little, formatting as we want, as the String argument to fail corresponds to the error message that GHC will display when the quasiquoter is (mis)used in a way that triggers that code path. We use the same technique for verifying and counting in one pass as we used for stringToAscii previously.

Now that we have data that is verified structurally correct, we now need to consider what we translate this into. If we consider stringToAscii from before, the process is essentially as follows:

Ensure that the input String contains only ASCII characters, while also counting its length.
Convert the result into a ByteArray using both the length and data from 1 with fromListN.
Wrap the result of 2 into the Ascii data constructor.

This process operates at runtime, which means that the String argument to stringToAscii represents the intended ASCII data. However, our quasiquoter will operate at compile time, which means that its String argument represents the raw input given between |s. This means that the equivalent step 1 in the quasiquoter needs additional work, which we perform in stripQuotesAndVerify. However, steps 2 and 3 are effectively the same as for stringToAscii. However, instead of calling the relevant functions directly, we must construct the Haskell code that will perform such a call using Template Haskell.

To build the calls equivalent to steps 2 and 3 requires some effort. First, we need to construct a call to fromListN, using the length and data we produced from stripQuotesAndVerify. As far as GHC is concerned, these are literals, and thus, must be built as such. The type of literals in Template Haskell is Lit, with the data constructors for this type being suffixed with L.

As our first argument to fromListN is an Int literal, and our second argument is a list of Word8 literals, we make use of IntegerL throughout. GHC uses IntegerL for anything that has a fromInteger method, which both Int and Word8 do. To make these into expressions (which we will need them to be), we also have to make use of constructors from the Exp data type. We show how we do this in the following helpers:

intToExp :: Int -> Exp
intToExp = LitE . IntegerL . fromIntegral

bytesToExp :: [Word8] -> Exp
bytesToExp = ListE . fmap (LitE . IntegerL . fromIntegral)

Both of these involve the use of IntegerL and fromIntegral to convert Int and Word8 to integer literals as stated before. The difference is what kind of expression we want in the end. For intToExp, we want the LitE data constructor, which corresponds to non-list literals, whereas for bytesToExp, we instead want ListE, which is used for list literals. This sets a pattern that we will use frequently, and is at the heart of how Template Haskell assembles expressions.

We will also need to assemble applications of both fromListN and the Ascii data constructor. To do this, we first need expressions corresponding to the function fromListN, and the data constructor Ascii, both of which will need to be in scope where our quasiquoter is defined₆. These are defined as follows:

asciiExp :: Exp
asciiExp = ConE 'Ascii

fromListNExp :: Exp
fromListNExp = VarE 'fromListN

In order to construct the needed expressions, we need a Template Haskell Name, which represents an identifier. While we can programmatically construct Names, and even generate completely fresh ones, in Q, in our case, we know what names we need directly. Thus, we can use quoting to obtain them: given an in-scope identifier foo, we can get its Name by writing foo. We do this for both the Ascii data constructor, and the fromListN function, which we then wrap into ConE and VarE respectively to produce the expressions we want. ConE corresponds to a data constructor expression, while VarE is a variable expression.

To complete our work, we need to construct application expressions. These are made using AppE; this data constructor has two fields, the first for whatever we are applying to, while the second is for whatever is being applied. This stems from Haskell's origins in the lambda cube, where all functions take exactly one argument. Thus, in order to apply more than one argument, we must nest uses of AppE. To make our lives easier, we will define the following helper:

appExp :: Exp -> [Exp] -> Exp
appExp f = \case
    [] -> f
    (arg : args) -> foldl' AppE (AppE f arg) args

To more clearly see what this does, here are some examples:

appExp f [] -> f
appExp f [x1] -> AppE f x1
appExp f [x1, x2] -> AppE (AppE f x1) x2
appExp f [x1, x2, x3] -> AppE (AppE (AppE f x1) x2) x3

Essentially, appExp performs currying. Putting all these together, we get the following definition:

asciiExp :: String -> Q Exp
asciiExp input = do
    (verifiedAscii, len) <- stripQuotesAndVerify
    let lenArg = intToExp len
    let bytesArg = bytesToExp verifiedAscii
    pure . appExp asciiExp $ [appExp fromListNExp [lenArg, bytesArg]]

It is worth reviewing all the steps we took to get here. The quasiquoter definition begins by parsing our input, checking whether we have ASCII, and also removing any whitespace we don't need. We did this with similar functionality to the stringToAscii smart constructor, but instead of doing this at runtime, we did it at compile time, replacing failure in Either to failure in fail causing a compile-time error. Then, having obtained both the length of our (stripped) input and the ASCII bytes comprising it, we converted them both into Exps, suitable for assembly into an expression with Template Haskell. Lastly, we put together two function application expressions: one to apply the Ascii data constructor, the other to call fromListN with the length and bytes we parsed.

To see the similarity between stringToAscii and asciiExp, we note that the last line is equivalent to pure . Ascii $ fromListN len verified. The main difference between stringToAscii and asciiExp is when they execute: stringToAscii is done at runtime (and thus, its errors occur at runtime too), but asciiExp is done at compile time instead.

Parsing patterns

Thanks to the asciiExp function we just defined, we now have the means to handle expressions involving Ascii in a convenient way. However, we would also like to pattern match on Ascii using the same syntax, just as we could with regular string-like literals or data constructors. We can use very similar techniques to asciiExp to achieve this, but instead of constructing an Exp, we want to construct a Pat. To understand how to do this, we first need to consider how Ascii is implemented.

As we said previously, Ascii is a closed newtype, as it has internal invariants to preserve. The same also applies to ByteArray, which Ascii wraps, for similar reasons. However, thanks to ByteArray being an instance of IsList, we are able to pattern match on it if we enable the OverloadedLists extension. With OverloadedLists enabled, ByteArray can be treated as if it were a list of Word8s: this is functionality which we could extend to Ascii by making it an instance of IsList as well₇. However, this is neither safe nor convenient:

IsList allows use of list syntax to both construct and pattern match, which would completely invalidate our earlier work on asciiExp.
If we wanted to match on the literal ASCII string "cat", we would have to write the pattern as [99, 97, 116], or its hex equivalent.

What we really want is to have access to only the pattern matching capabilities of the underlying ByteArray's IsList instance, while handling the necessary unwrapping and re-wrapping into Ascii, all with more familiar and pleasant syntax. Luckily for us, quasiquoters are capable of exactly this, by providing a way of constructing a Pat given some input in a pattern context.

Pat in many ways mirrors Exp: we have constructor patterns (ConP), list patterns (ListP) and literal patterns (LitP). The main difference between how we construct Pats and how we construct Exps is that there is no analog to applications: the ConP data constructor takes a list of patterns corresponding to its fields.

Here is what the necessary quasiquoter component would look like:

asciiPat :: String -> Q Pat
asciiPat input = do 
    (verifiedAscii, _) <- stripQuotesAndVerify input
    let fieldsP = [ListP . fmap (LitP . IntegerL . fromIntegral) $ verifiedAscii]
    pure . ConP 'Ascii [] $ fieldsP

We begin similarly to asciiExp, but we no longer need the counted length, as we don't rely on fromListN. Then, to construct a match against the underlying ByteArray, we build a list pattern, consisting of all the ASCII bytes in the verified input as integer literals, corresponding to their numeric codes. We then construct a ConP corresponding to the Ascii constructor applied to that list pattern. We note that ConP takes two lists as arguments to its data constructor, but we are only interested in the second. The first list allows us to provide a list of type bindings, but as we don't need any, we leave it empty.

asciiPat thus allows us to expose only the patternmatching functionality of the IsList instance for ByteArray, as it would only be called in a pattern context. For expression contexts we would instead use asciiExp, which ensures we never construct anything invalid. Furthermore, because the input to asciiPat is parsed as an ASCII string (identically to asciiExp), we now don't have to 'spell' the pattern using lists of ASCII codes. This gives us both safety and convenience, but also consistency: someone who understands proper inputs to asciiExp will also know what proper inputs for asciiPat are.

Putting it all in a quasiquoter

We now have the pieces we need to complete the quasiquoter. To finish the job, we need to define the following, in a separate module from the definition of Ascii₈:

ascii :: QuasiQuoter
ascii = QuasiQuoter asciiExp
                    asciiPat
                    (\_ -> fail "Cannot use 'ascii' in a type context.")
                    (\_ -> fail "Cannot use 'ascii' in a declaration context.")

The name of the identifier here is what will be used by people who want to use the quasiquoter. We chose ascii, as it is both clear and concise. We also added error messages for when the quasiquoter is used in a type or declaration context.

We can now use the quasiquoter if we import it, and enable the QuasiQuotes and OverloadedLists extensions:

aCat :: Ascii
aCat = [ascii| "cat" |]

isACat :: Ascii -> Bool
isACat = \case
    [ascii| "cat" |] -> True
    _ -> False

It is worth examining the pattern context use of ascii in the above example. Technically, we are performing a match against ByteArray, by way of its IsList instance, which requires converting the string literal provided into the corresponding list literal. However, because this work is performed at compile time, this conversion only occurs once per specific string literal, not once per match. This means that we aren't pushing unnecessary work to runtime, on top of not forcing our users to handle failure in cases where we know failure cannot happen. In this way, quasiquoters can also be more efficient, as they don't force us to perform work at runtime we don't need.

You don't have to take our word for it either! If you add {-# OPTIONS_GHC -ddump-splices #-} to the top of the module with the above code snippets, you will see something similar to the following when you compile it:

src/Text/Ascii.hs:28:15-23: Splicing expression
    Language.Haskell.TH.Quote.quoteExp ascii " \"cat\" "
  ======>
    Ascii (fromListN 3 [99, 97, 116])
src/Text/Ascii.hs:32:10-18: Splicing pattern
    Language.Haskell.TH.Quote.quotePat ascii " \"cat\" "
  ======>
    Ascii [99, 97, 116]

We can see that the first quasiquoter use (in aCat) emits a literal call to the Ascii constructor, whose argument is a call to fromListN with two literals as arguments. Meanwhile, the second quasiquoter use (in isACat) doesn't involve any (visible) calls to anything₉. This is an advantage of quasiquoters, and Template Haskell in general: the truth is just one -ddump-splices away.

 Going further

Through our examples with the Ascii type, we have seen how quasiquoters can address three common issues with smart constructors:

Improper constant input now becomes a compile-time error, not a runtime one;
We can pattern match, as well as construct; and
The syntax we can choose becomes much more free.

While quasiquoters can be somewhat confusing to write due to relying on Template Haskell, we have show that using them once written is straightforward. Lastly, if there are any doubts over what quasiquoters are 'really doing', they can be checked easily with the use of -ddump-splices.

The utility of quasiquoters isn't limited just to fixing issues with smart constructors. For an example of just how much they can achieve, we recommend having a look at inline-c, which makes heavy use of quasiquoters to embed C code directly into Haskell₄. We hope that, through this article, quasiquoter use in the Haskell community will grow, even if only to better address the limitations of the popular smart constructor.

While we could call error inside implementations of these methods, this is deeply unsatisfying, and also quite unsafe, as these methods can be used at runtime (and often are).
If you need an industrial strength library for working with ASCII text, we recommend this one. In fact, one of the examples for this article is lifted almost directly from it!
This efficiency stems from the use of unpinned memory, unlike ByteString, which means the garbage collector is allowed to move it.
Or, for an even more extreme example, inline-asm, which does the same, but for assembly.
text-ascii uses megaparsec for this purpose.
But not where it is used, which allows us to ensure that the data constructor of the Ascii type is only called in ways that are safe, whether at compile time via quasiquoter, or runtime via smart constructors.
Technically, we could also use pattern synonyms, but if we wanted to allow the use of string-like syntax in patterns, we would be forced to convert the contents of Ascii into something with an IsString instance on every match, which would get inefficient very fast.
The reason we can't put the quasiquoter in the same module as the definition of Ascii is called the 'Template Haskell staging restriction'. While somewhat annoying, it is usually avoided by having a separate .TH module for quasiquoters, which then gets re-exported.
Technically there is a hidden call to fromList in there, but this would be needed even if we were matching on ByteArray directly, so we're at least not adding costs as a result of our abstraction.

HaskellFunctional ProgrammingQuasiquotersTemplate HaskellMetaprogramming

Koz Ross

Quasiquoting for Fun, Profit, Expressions and Patterns

The 'A' is for 'Accelerated': Checking ASCII with SWAR

An Introduction to the Concepts Behind YTxP Architecture