Micro C, Part 3: Generating LLVM

Lastmod: 2020-05-13

In part 2, we completed the user facing part of the compiler with semantic analysis. Now we're free to focus on the backend, generating LLVM. To do so, we'll use the llvm-hs-pure library, which embeds LLVM into a Haskell ADT, allowing us to manipulate it without having to set up any FFI. We will then pretty print the generated bytecode with llvm-hs-pretty and call clang on it to generate machine code for our preferred target.¹ Some things to keep in mind when writing LLVM generating code:

There is very little type safety. The usual haskell guarantees of "if it typechecks it probably works" do not apply here at all.² After all, we are writing basically writing assembly, so even if we get our code through the LLVM type checker, there's still a good chance that it won't do what we want it to.³ This is where having a comprehensive test suite is invaluable.
Documentation for LLVM in general is quite scant. The official language reference manual is comprehensive, but can be quite terse and provides little in the way of examples, although this gets better all the time and is not nearly as bad now as it was a few years ago. Documentation for llvm-hs specifically is even scarcer: basically just some tiny examples and the haskell port of the kaleidoscope tutorial, which is by now very outdated. This post will hopefully help matters somewhat but it leaves many parts of the LLVM IR and ecosystem unexplored.
Let clang help you. You can run it with -emit-llvm on a C file to see what it generates. This is helpful even when you aren't writing a C-like language as you can see how certain high-level language concepts map onto LLVM.

LLVM Basics

The following is a simplified version of how LLVM works in order to create a basic mental model. There are many omissions.

LLVM programs are split into modules containing toplevel definitions like functions and global variables. For our purposes, we are concerned with emitting a single module containing all of our functions, global variables, and typedefs for our structs. Within functions, code is contained in basic blocks. A basic block is a list of sequential instructions in SSA (Single Static Assignment) form that is terminated in a conditional or unconditional branch to another block or in a return instruction.

LLVM, unlike many forms of assembly, is typed. Its type system resembles C's in many ways, but there are some notable differences. Instead of C's confusing and target dependent integer types short, long, long long, etc., LLVM has arbitrary N bit width integers denoted iN. Also, void* is banned in LLVM, so char* or i8* is used instead. LLVM also has native vector types for SIMD instructions, but those won't factor into our discussion. There are other exotic features in the type system that help optimization like poison values but we won't discuss them here.

Codegen.hs

With that out of the way, we can start writing code. This module has quite a few imports.

{-# LANGUAGE RecursiveDo #-}
{-# LANGUAGE TupleSections #-}
{-# LANGUAGE FlexibleContexts #-}
-- We need these to write a ConvertibleStrings instance for
-- ShortByteString
{-# LANGUAGE MultiParamTypeClasses #-}
{-# OPTIONS_GHC -fno-warn-orphans #-}
module Microc.Codegen
  ( codegenProgram
  )
where

import qualified LLVM.AST.IntegerPredicate     as IP
import qualified LLVM.AST.FloatingPointPredicate
                                               as FP
import           LLVM.AST                       ( Operand )
import qualified LLVM.AST                      as AST
import qualified LLVM.AST.Type                 as AST
import qualified LLVM.AST.Constant             as C
import           LLVM.AST.Name
import           LLVM.AST.Typed                 ( typeOf )

import qualified LLVM.IRBuilder.Module         as L
import qualified LLVM.IRBuilder.Monad          as L
import qualified LLVM.IRBuilder.Instruction    as L
import qualified LLVM.IRBuilder.Constant       as L
import           LLVM.Prelude                   ( ShortByteString )

import qualified Data.Map                      as M
import           Control.Monad.State
import           Data.String                    ( fromString )

import           Microc.Sast
import           Microc.Ast                     ( Type(..)
                                                , Op(..)
                                                , Uop(..)
                                                , Bind(..)
                                                , Struct(..)
                                                )

import           Data.String.Conversions
import qualified Data.Text                     as T
import           Data.Text                      ( Text )
import           Data.Word                      ( Word32 )
import           Data.List                      ( find )

In llvm-hs, values that can be passed as arguments to LLVM instructions have type Operand. This includes basically all variables as well as declared functions. Just as in Semant, we define an Env type to hold information about our generated operands and how they correspond to the names of their corresponding source variables. Besides the operands, we need to keep track of all struct declarations and all string literals in the source, as we emit a unique global variable for each unique string literal.⁴

data Env = Env { operands :: M.Map Text Operand
               , structs :: [ Struct ]
               , strings :: M.Map Text Operand
               }
  deriving (Eq, Show)

registerOperand :: MonadState Env m => Text -> Operand -> m ()
registerOperand name op =
  modify $ \env -> env { operands = M.insert name op (operands env) }

Utilities

Working with LLVM bindings in other languages usually involves passing mutable builder and module context objects to all instruction-emitting functions in order to ensure that all variables have unique names and to maintain the integrity of the module. This is important to ensure that code remains in SSA form.

Of course, since we're not working in other languages, this approach of passing around mutable objects would be severely un-ergonomic, at best. Fortunately, llvm-hs provides us with monads that emulate this behavior, ModuleBuilderT for the module context, and IRBuilderT for the builder object. We'll establish two type synonyms, LLVM for generating top level entities and Codegen ⁵ for generating basic blocks.

type LLVM = L.ModuleBuilderT (State Env)
type Codegen = L.IRBuilderT LLVM

We'll also write some utilities to query struct fields defined in our Env, to convert from MicroC types to LLVM types, and to calculate the sizes of MicroC types. For structs, we emit packed fields, which is pretty bad for performance, but makes calculating sizes very easy. Note that by this phase of the compiler, we no longer report errors to the user, so if anything goes wrong, we'll just crash.

getFields :: MonadState Env m => Text -> m [Bind]
getFields name = do
  ss <- gets structs
  case find (\s -> structName s == name) ss of
    Nothing               -> error "Internal error - struct not found"
    Just (Struct _ binds) -> pure binds

charStar :: AST.Type
charStar = AST.ptr AST.i8

-- llvm-hs uses ShortByteString for names, but we want
-- easy conversion to Text with cs from Data.String.Conversions
instance ConvertibleStrings Text ShortByteString where
  convertString = fromString . T.unpack

ltypeOfTyp :: MonadState Env m => Type -> m AST.Type
ltypeOfTyp = \case
  TyVoid         -> pure AST.void
  TyInt          -> pure AST.i32
  TyChar         -> pure AST.i8
  TyFloat        -> pure AST.double
  TyBool         -> pure AST.i1
  -- (void *) is invalid LLVM
  Pointer TyVoid -> pure $ charStar
  -- special case to handle recursively defined structures
  -- TODO: add real cycle checking so that improperly defined
  -- recursive types case the compiler to hang forever
  Pointer (TyStruct n) ->
    pure $ AST.ptr (AST.NamedTypeReference (mkName $ cs ("struct." <> n)))
  Pointer  t -> fmap AST.ptr (ltypeOfTyp t)
  TyStruct n -> do
    fields <- getFields n
    typs   <- mapM (ltypeOfTyp . bindType) fields
    -- Packed structs aren't great for performance but very easy to code for now
    pure $ AST.StructureType { AST.isPacked = True, AST.elementTypes = typs }

sizeof :: MonadState Env m => Type -> m Word32
sizeof = \case
  TyBool     -> pure 1
  TyChar     -> pure 1
  TyInt      -> pure 4
  TyFloat    -> pure 8
  TyVoid     -> pure 0
  Pointer  _ -> pure 8
  TyStruct n -> do
    fields <- getFields n
    sizes <- mapM (sizeof . bindType) fields
    pure (sum sizes)

Expressions

LVals

Now, we're ready to generate code for expressions. First, LVal s. When generating an LVal, we generate an Operand corresponding to the address of the value. That way, we can use it as an argument to the store instruction. For variables, we simply look up the variable name in the Env.

codegenLVal :: LValue -> Codegen Operand
codegenLVal (SId    name) = gets ((M.! name) . operands)

Since we are generating addresses, dereferencing is essentially the inverse of this, so we just generate code for the underlying expression.

codegenLVal (SDeref e   ) = codegenSexpr e

For struct access, we get to use the fascinating (read: confusing) getelementptr instruction. The instruction only calculates addresses, it doesn't load memory, so it's a perfect fit for the semantics of codegenLVal. We generate the address of the left hand side of the access and then have to pass two arguments to gep, a zero to access the memory pointed to by the address we just calculated, then the offset of the struct field we want to access, which we calculated in semant. Note that getelementptr handles calculating alignment, so we don't need to do it ourselves.

codegenLVal (SAccess e i) = do
  e' <- codegenLVal e
  let zero   = L.int32 0
      offset = L.int32 (fromIntegral i)
  L.gep e' [zero, offset]

Literals

Most literals, as usual, are straightforward.

codegenSexpr :: SExpr -> Codegen Operand
codegenSexpr (TyInt  , SLiteral i ) = pure $ L.int32 (fromIntegral i)
codegenSexpr (TyFloat, SFliteral f) = pure $ L.double f
codegenSexpr (TyBool , SBoolLit b ) = pure $ L.bit (if b then 1 else 0)
codegenSexpr (TyChar , SCharLit c ) = pure $ L.int8 (fromIntegral c)

Strings, however, are not. We look up the string literal in the Env to see if we've generated a global variable for it before. If so, we just return that. Otherwise, we use globalStringPtr ⁶ to generate a pointer to a global string variable. We name each variable "0.str", "1.str" etc., since mkName crashes with non-ASCII input, which we haven't explicitly forbidden in our string literals. Note that globalStringPtr returns a Constant which is distinct from an Operand, so we need to promote it with AST.ConstantOperand.

codegenSexpr (Pointer TyChar, SStrLit s  ) = do
  -- Generate a new unique global variable for every string literal we see
  strs <- gets strings
  case M.lookup s strs of
    Nothing -> do
      let nm = mkName (show (M.size strs) <> ".str")
      op <- L.globalStringPtr (cs s) nm
      modify $ \env -> env { strings = M.insert s (AST.ConstantOperand op) strs }
      pure (AST.ConstantOperand op)
    Just op -> pure op

Null pointers are generated with inttoptr.

codegenSexpr (t, SNull) = L.inttoptr (L.int64 0) =<< ltypeOfTyp t

Sizeof is calculated with the sizeof function we wrote earlier.

codegenSexpr (TyInt, SSizeof t) = L.int32 . fromIntegral <$> sizeof t

The & operator finds the address of an LVal, which is already taken care of by codegenLVal.

codegenSexpr (_, SAddr e) = codegenLVal e

Binary operators

For assignment, we calculate the address of the left hand side, the value of the right hand side, and then store said value at the address, returning the value.

codegenSexpr (_, SAssign lhs rhs) = do
  rhs' <- codegenSexpr rhs
  lhs' <- codegenLVal lhs
  L.store lhs' 0 rhs'
  return rhs'

For the Binop constructor, we begin by generating code for the left and right and sides.

codegenSexpr (t, SBinop op lhs rhs) = do
  lhs' <- codegenSexpr lhs
  rhs' <- codegenSexpr rhs
  case op of

For addition on int s and float s, we simply generate the corresponding machine instruction. For pointer addition, getElementPtr takes care of calculating the offset for each pointer type so we don't have to worry about it.

    Add -> case (fst lhs, fst rhs) of
      (Pointer _, TyInt    ) -> L.gep lhs' [rhs']
      (TyInt    , Pointer _) -> L.gep rhs' [lhs']
      (TyInt    , TyInt    ) -> L.add lhs' rhs'
      (TyFloat  , TyFloat  ) -> L.fadd lhs' rhs'
      _ -> error "Internal error - semant failed"

For pointer subtraction, we do actually have to calculate the pointer width ourselves and divide the difference in addresses by it.

    Sub -> let zero = L.int64 0 in case (fst lhs, fst rhs) of
      (Pointer typ, Pointer typ') ->
        if typ' /= typ then error "Internal error - semant failed" else do
          lhs'' <- L.ptrtoint lhs' AST.i64
          rhs'' <- L.ptrtoint rhs' AST.i64
          diff  <- L.sub lhs'' rhs''
          width <- L.int64 . fromIntegral <$> sizeof typ
          L.sdiv diff width

Subtracting int s from pointers is similar to adding them, except that we negate the int before passing it to getElementPtr.

      (Pointer _, TyInt) -> do
        rhs'' <- L.sub zero rhs'
        L.gep lhs' [rhs'']

For int s and float s, we again dispatch to the corresponding machine instruction.

      (TyInt  , TyInt  ) -> L.sub lhs' rhs'
      (TyFloat, TyFloat) -> L.fsub lhs' rhs'
      _ -> error "Internal error - semant failed"

Multiplication and division are easy.

    Mult -> case t of
      TyInt   -> L.mul lhs' rhs'
      TyFloat -> L.fmul lhs' rhs'
      _       -> error "Internal error - semant failed"
    Div -> case t of
      TyInt   -> L.sdiv lhs' rhs'
      TyFloat -> L.fdiv lhs' rhs'
      _       -> error "Internal error - semant failed"

For the exponentiation operator, all remaining cases are raising int s to int s. We can take this opportunity to write some non-trivial LLVM and implement exponentiation as repeated multiplication directly in the IR. In haskell, the algorithm would be

-- We can obviously be more terse but this form maps better onto LLVM
raise lhs rhs = go 1 rhs where
  go acc expt =
    if expt == 0 then acc
    else let nextAcc = lhs * acc
             nextExpt = expt - 1
         in  go nextAcc nextExpt

In order to marry SSA with conditionals, LLVM uses phi nodes. Phi nodes must all appear at the very beginning of a basic block. There cannot be any non-phi instructions preceding them. The phi instruction takes a list of pairs. The first element of each pair is a value and the second element is the label of a basic block which has an outgoing branch to the block with phi nodes.

First, we need to get the label of the enclosing block so that we can start our new block. We then set acc and expt to phi nodes, such that if control flow proceeds into the loop_pow block from the enclosing scope, they are initialized to 1 and rhs, respectively, and if control flow is from continue, they are set to nextAcc and nextExpt. The if clause is handled by issuing a condBr if the expt has reached 0, at which point we either return the acc or jump back to loop_pow. ⁷

Note that we use mdo, courtesy of {-# LANGUAGE RecursiveDo #-}, instead of do, as we need to forward-reference the doneBlock and continueBlock s in our branch instruction. We can't define our blocks with L.block and then branch to them because calling L.block ends the current block and starts a new one. When using other LLVM bindings, one usually has to create all of the blocks and then manually position the builder at the correct location before emitting instructions. However, haskell's laziness allows us to avoid this inelegance and write branching code much more naturally.

    Power -> mdo
      enclosing <- L.currentBlock
      L.br loop
      loop <- L.block `L.named` "loop_pow"
      acc <- L.phi [(L.int32 1, enclosing), (nextAcc, continueBlock)] `L.named` "acc"
      expt <- L.phi [(rhs', enclosing), (nextExpt, continueBlock)] `L.named` "expt"
      done <- L.icmp IP.EQ expt (L.int32 0)
      L.condBr done doneBlock continueBlock
      continueBlock <- L.block `L.named` "continue"
      nextAcc       <- L.mul acc lhs' `L.named` "next_acc"
      nextExpt      <- L.sub expt (L.int32 1) `L.named` "next_expt"
      L.br loop
      doneBlock <- L.block `L.named` "done"
      pure acc

(It is left as an exercise for the reader to implement a more efficient exponentiation algorithm in LLVM.)

The remaining binary operators all map directly onto their LLVM counterparts.

    Equal -> case fst lhs of
      TyInt     -> L.icmp IP.EQ lhs' rhs'
      TyBool    -> L.icmp IP.EQ lhs' rhs'
      TyChar    -> L.icmp IP.EQ lhs' rhs'
      Pointer _ -> L.icmp IP.EQ lhs' rhs'
      TyFloat   -> L.fcmp FP.OEQ lhs' rhs'
      _         -> error "Internal error - semant failed"
    Neq -> case fst lhs of
      TyInt     -> L.icmp IP.NE lhs' rhs'
      TyBool    -> L.icmp IP.NE lhs' rhs'
      TyChar    -> L.icmp IP.NE lhs' rhs'
      Pointer _ -> L.icmp IP.NE lhs' rhs'
      TyFloat   -> L.fcmp FP.ONE lhs' rhs'
      _         -> error "Internal error - semant failed"
    Less -> case fst lhs of
      TyInt   -> L.icmp IP.SLT lhs' rhs'
      TyBool  -> L.icmp IP.SLT lhs' rhs'
      TyChar  -> L.icmp IP.ULT lhs' rhs'
      TyFloat -> L.fcmp FP.OLT lhs' rhs'
      _       -> error "Internal error - semant failed"
    Leq -> case fst lhs of
      TyInt   -> L.icmp IP.SLE lhs' rhs'
      TyBool  -> L.icmp IP.SLE lhs' rhs'
      TyChar  -> L.icmp IP.ULE lhs' rhs'
      TyFloat -> L.fcmp FP.OLE lhs' rhs'
      _       -> error "Internal error - semant failed"
    Greater -> case fst lhs of
      TyInt   -> L.icmp IP.SGT lhs' rhs'
      TyBool  -> L.icmp IP.SGT lhs' rhs'
      TyChar  -> L.icmp IP.UGT lhs' rhs'
      TyFloat -> L.fcmp FP.OGT lhs' rhs'
      _       -> error "Internal error - semant failed"
    Geq -> case fst lhs of
      TyInt   -> L.icmp IP.SGE lhs' rhs'
      TyBool  -> L.icmp IP.SGE lhs' rhs'
      TyChar  -> L.icmp IP.UGE lhs' rhs'
      TyFloat -> L.fcmp FP.OGE lhs' rhs'
      _       -> error "Internal error - semant failed"
    And    -> L.and lhs' rhs'
    Or     -> L.or lhs' rhs'
    BitAnd -> L.and lhs' rhs'
    BitOr  -> L.or lhs' rhs'

Unary operators

There aren't any negation intrinsics in LLVM, but it's easy enough to implement them ourselves.

codegenSexpr (t, SUnop op e) = do
  e' <- codegenSexpr e
  case op of
    Neg -> case t of
      TyInt   -> L.sub (L.int32 0) e'
      TyFloat -> L.fsub (L.double 0) e'
      _       -> error "Internal error - semant failed"
    Not -> case t of
      TyBool -> L.xor e' (L.bit 1)
      _      -> error "Internal error - semant failed"

Function calls

For function calls, we generate code for each argument, look up the function in our Env, and then emit the call instruction. Note that we add an empty list to each argument. LLVM allows us to emit parameter attributes attached to each argument, which we don't really care about.

codegenSexpr (_, SCall fun es) = do
  es' <- mapM (fmap (, []) . codegenSexpr) es
  f <- gets ((M.! fun) . operands)
  L.call f es'

Casts

For casts from type t to t', we simply use the corresponding instruction.

codegenSexpr (_, SCast t' e@(t, _)) = do
  e'       <- codegenSexpr e
  llvmType <- ltypeOfTyp t'
  case (t', t) of
    (Pointer _, Pointer _) -> L.bitcast e' llvmType
    (Pointer _, TyInt    ) -> L.inttoptr e' llvmType
    (TyInt    , Pointer _) -> L.ptrtoint e' llvmType
    -- Signed Int to Floating Point
    (TyFloat  , TyInt    ) -> L.sitofp e' llvmType
    _ -> error "Internal error - semant failed"

Finally, for SNoexpr we just generate a 0, and if something got by the semantic checker, we crash.

codegenSexpr (_, SNoexpr) = pure $ L.int32 0

-- Final catchall
codegenSexpr sx =
  error $ "Internal error - semant failed. Invalid sexpr " <> show sx

Statements

Codegen for statements isn't too bad. In the case of naked expressions, returns, and blocks, we simply reuse the work from codegenSexpr.

codegenStatement :: SStatement -> Codegen ()
codegenStatement (SExpr   e) = void $ codegenSexpr e
codegenStatement (SReturn e) = case e of
  (TyVoid, SNoexpr) -> L.retVoid
  _ -> L.ret =<< codegenSexpr e
codegenStatement (SBlock ss) = mapM_ codegenStatement ss

For conditionals, we follow a similar strategy as we did in implementing integer exponentiation. We generate the condition, branch on it, generate statements for each alternative in the correct block, and then issue an unconditional branch to a merge block. One subtlety that we have to keep in mind is the possibility of a return inside of the if statement. LLVM only allows one kind of terminator in a block, so we can use the mkTerminator ⁸ helper to check if that is the case and if so, do nothing.

codegenStatement (SIf p cons alt) = mdo
  bool <- codegenSexpr p
  L.condBr bool thenBlock elseBlock

  thenBlock <- L.block `L.named` "then"
  codegenStatement cons
  mkTerminator $ L.br mergeBlock

  elseBlock <- L.block `L.named` "else"
  codegenStatement alt
  mkTerminator $ L.br mergeBlock

  mergeBlock <- L.block `L.named` "merge"
  return ()

/images/compiler-if.png

For do while loops, we immediately branch into the while block, generate the code for the condition and the body, then conditionally branch into either the while or merge blocks.

codegenStatement (SDoWhile p body) = mdo
  L.br whileBlock
  whileBlock <- L.block `L.named` "while_body"
  codegenStatement body
  continue <- codegenSexpr p
  mkTerminator $ L.condBr continue whileBlock mergeBlock

  mergeBlock <- L.block `L.named` "merge"
  return ()

/images/compiler-dowhile.png

Functions

To generate function code, we use the function function (who said haskellers were bad at naming?!). We actually need to insert it into the Env before generating code for it in case it calls itself recursively. Fortunately, we have our trusty mdo. After generating the body, we have to re-insert all the strings we encountered back into the global Env so that they can be reused across functions (this is ugly and should be refactored.)

codegenFunc :: SFunction -> LLVM ()
codegenFunc f = mdo
  registerOperand (sname f) function
  (function, strs) <- locally $ do
    retty <- ltypeOfTyp (styp f)
    params  <- mapM mkParam (sformals f)
    fun <- L.function name params retty genBody
    strings' <- gets strings
    pure (fun, strings')
  modify $ \e -> e { strings = strs }

The L.function call merits further discussion. It has type

:: MonadModuleBuilder m => Name -> [(Type, ParameterName)] -> Type -> ([Operand] -> IRBuilderT m () -> m Operand)

which specializes to

:: Name -> [(Type, ParameterName)] -> Type -> ([Operand] -> Codegen ()) -> LLVM Operand (Now we understand why Codegen and LLVM are defined the way they are.)

The name is easy.

 where
  name = mkName (cs $ sname f)

To make parameters, we just find the corresponding LLVM type and suggest the name as it appears in the source file.

  mkParam (Bind t n) = (,) <$> ltypeOfTyp t <*> pure (L.ParameterName (cs n))

To generate the body, we first create an entry block.

  genBody :: [Operand] -> Codegen ()
  genBody ops = do
    _entry <- L.block `L.named` "entry"

Then, for each of the Operand s that the function takes, we allocate space on the stack with alloca, store that Operand in that memory, and register the memory in our Env. ⁹

    forM_ (zip ops (sformals f)) $ \(op, Bind _ n) -> do
      -- typeOf is defined in LLVM.AST.Typed
      addr <- L.alloca (typeOf op) Nothing 0
      L.store addr 0 op
      registerOperand n addr

Local variables are treated similarly, except that we can leave them as uninitialized memory.

    forM_ (slocals f) $ \(Bind t n) -> do
      ltype <- ltypeOfTyp t
      addr  <- L.alloca ltype Nothing 0
      registerOperand n addr

Finally, we generate the body of the function.

    codegenStatement (sbody f)

For built in functions, we can use extern to indicate to the linker to insert them into the final program.

emitBuiltIn :: (String, [AST.Type], AST.Type) -> LLVM ()
emitBuiltIn (name, argtys, retty) = do
  func <- L.extern (mkName name) argtys retty
  registerOperand (cs name) func

-- Printf has varargs so we treat it separately
builtIns :: [(String, [AST.Type], AST.Type)]
builtIns =
  [ ("printbig"     , [AST.i32]               , AST.void)
  , ("llvm.pow.f64" , [AST.double, AST.double], AST.double)
  , ("llvm.powi.i32", [AST.double, AST.i32]   , AST.double)
  , ("malloc"       , [AST.i32]               , AST.ptr AST.i8)
  , ("free"         , [AST.ptr AST.i8]        , AST.void)
  ]

Globals

For global variables, we simply call global with a dummy 0 initial value and add the variable to our Env.

codegenGlobal :: Bind -> LLVM ()
codegenGlobal (Bind t n) = do
  let name    = mkName $ cs n
      initVal = C.Int 0 0
  typ <- ltypeOfTyp t
  var <- L.global name typ initVal
  registerOperand n var

For toplevel structs, we register typedef s at the module level with L.typedef.

emitTypeDef :: Struct -> LLVM AST.Type
emitTypeDef (Struct name _) = do
  typ <- ltypeOfTyp (TyStruct name)
  L.typedef (mkName (cs ("struct." <> name))) (Just typ)

Finally, we generate the entire SProgram by emitting all of the builtin functions, toplevel structs, global variables, and functions sequentially.

codegenProgram :: SProgram -> AST.Module
codegenProgram (structs, globals, funcs) =
  flip evalState (Env { operands = M.empty, structs = structs, strings = M.empty })
    $ L.buildModuleT "microc"
    $ do
        printf <- L.externVarArgs (mkName "printf") [charStar] AST.i32
        registerOperand "printf" printf
        mapM_ emitBuiltIn builtIns
        mapM_ emitTypeDef   structs
        mapM_ codegenGlobal globals
        mapM_ codegenFunc   funcs

(Full source for Codegen.hs here.)

The runtime

There's really very little runtime to speak of, but for demonstration purposes there's a runtime.c file that we link with all of our executables to provide the builtin functions.

#include <stdio.h>
#include <stdlib.h>
void printbig(int c)
{
  // elided...
}

(Full source for runtime.c here).

Linking

Our "linker" is just a thin wrapper around clang. We'll create a Microc.Toplevel module to handle the details.

module Microc.Toplevel where

import           LLVM.AST
import           LLVM.Pretty

import           Data.String.Conversions
import           Data.Text                      ( Text )
import qualified Data.Text.IO                  as T

import           System.IO
import           System.Directory
import           System.Process
import           System.Posix.Temp

import           Control.Exception              ( bracket )

The compile function, given a Module, generates an executable at the supplied path. We call ppllvm from llvm-hs-pretty to dump the textual representation of the Module into a file so that we can call clang on it. We use bracket to make sure that the build artifacts get cleaned up properly.

compile :: Module -> FilePath -> IO ()
compile llvmModule outfile =
  bracket (mkdtemp "build") removePathForcibly $ \buildDir ->
    withCurrentDirectory buildDir $ do
      -- create temporary file for "output.ll"
      (llvm, llvmHandle) <- mkstemps "output" ".ll"
      let runtime = "../src/runtime.c"
      -- write the llvmModule to a file
      T.hPutStrLn llvmHandle (cs $ ppllvm llvmModule)
      hClose llvmHandle
      -- link the runtime with the assembly
      callProcess
        "clang"
        ["-Wno-override-module", "-lm", llvm, runtime, "-o", "../" <> outfile]

We also provide a run function that simply generates an executable, reads its output, then deletes it.

run :: Module -> IO Text
run llvmModule = do
  compile llvmModule "./a.out"
  result <- cs <$> readProcess "./a.out" [] []
  removePathForcibly "./a.out"
  return result

(Full source for Toplevel.hs here.)

Now we can finally finish writing runOpts in Main.hs.

runOpts :: Options -> IO ()
runOpts (Options action infile ptype) = do
  program <- T.readFile infile
  let parseTree = case ptype of
        Combinator -> runParser programP infile program
        Generator  -> Right $ parse . alexScanTokens $ T.unpack program
  case parseTree of
    Left  err -> putStrLn $ errorBundlePretty err
    Right ast -> case action of
      Ast -> putDoc $ pretty ast <> "\n"
      _   -> case checkProgram ast of
        Left err -> putDoc $ pretty err <> "\n"
        Right sast ->
          let llvm = codegenProgram sast
          in  case action of
                Sast            -> pPrint sast
                LLVM            -> T.putStrLn . cs . ppllvm $ llvm
                Compile outfile -> compile llvm outfile
                Run             -> run llvm >>= T.putStr
                Ast             -> error "unreachable"

(Full source for Main.hs here.)

Testing

For testing, we write a similar runner that takes a filepath and returns what would be the result of calling mcc <path> on it.

runFile :: FilePath -> IO Text
runFile infile = do
  program <- T.readFile infile
  let parseTree = runParser programP (cs infile) program
  case parseTree of
    Left  e   -> return . cs $ errorBundlePretty e
    Right ast -> case checkProgram ast of
      Left err ->
        return . renderStrict $ layoutPretty defaultLayoutOptions (pretty err)
      Right sast -> run (codegenProgram sast)

We now have enough code to write passing tests for our compiler. They look very similar to the failing tests from part 2.

passing :: IO TestTree
passing = do
  mcFiles <- findByExtension [".mc"] "tests/pass"
  return $ testGroup
    "pass"
    [ goldenVsString (takeBaseName mcFile) outfile (cs <$> runFile mcFile)
    | mcFile <- mcFiles
    , let outfile = replaceExtension mcFile ".golden"
    ]

main :: IO ()
main = defaultMain =<< goldenTests

goldenTests :: IO TestTree
goldenTests = testGroup "all" <$> sequence [passing, failing, parsing]

(Full source for Testall.hs here.)

Conclusion and Acknowledgments

The compiler is finished! In just 1500 or so lines of haskell, we've implemented a significant amount of the C programming language! Thanks to everyone who's been reading along. I've had fun writing and revisiting my old code. In particular, I'd like to thank

Théophile Choutri who's been sponsoring the series
Moritz Kiefer for maintaining llvm-hs and approving my PR's to the library
Stephen Edwards and Richard Townsend, whose compiler class I took in college; my team's project from that semester, a small, poorly-named functional array language that compiles to LLVM, is on github

We don't use the FFI here mainly because it's very challenging to set up properly. I've just recently figured out how to get everything to compile with nix, though, so we'll explore using the FFI in a future post.

A typed version of the LLVM haskell bindings exists, but I don't know how to use it, and the repo hasn't been active for a while so I'd hesitate to recommend it.

There's a reason that compilers were some of the earliest programs invented. Writing assembly is difficult and error prone and as humans we like to avoid it.

⁴

Why do we do this, you might ask? Because that's what clang does when it compiles code with string literals. It might be architecturally cleaner to collect all of the unique string literals during semantic analysis instead of doing it during codegen but that can be refactored later.

⁵

For reasons pertaining to the type of function from =LLVM.IRBuilder.Module=, Codegen needs to be a monad transformer over LLVM. I don't have a better explanation for this other than that without it, code generation for function bodies doesn't compile.

⁶

Fun fact, I contributed the original version of globalStringPtr to llvm-hs back in 2018, although it's been since improved upon.

⁷

Using `L.named` for variables and blocks isn't strictly necessary, as L.block will choose fresh, non-conflicting names for the block labels, but it makes debugging the generated output significantly easier if they have meaningful names rather than numbers.

⁸

mkTerminator :: Codegen () -> Codegen ()
mkTerminator instr = do
 check <- L.hasTerminator
 unless check instr

⁹

We'll depend heavily on mem2reg to optimize away stack allocations. It would be very hard for us to store all of our variables in registers because our language allows for unrestricted mutation, which plays poorly with SSA. Loading and storing memory doesn't violate SSA, so it's an easy way to implement mutability, but if we were to use registers, we'd have to insert phi nodes everywhere where mutation happened, which would require very extensive analysis that we don't do. At that point, though, we'd just be duplicating optimizations that LLVM already has in its pipeline.