In part 2, we completed the user facing part of the compiler with semantic analysis. Now we're free to focus on the backend, generating LLVM. To do so, we'll use the llvm-hs-pure library, which embeds LLVM into a Haskell ADT, allowing us to manipulate it without having to set up any FFI. We will then pretty print the generated bytecode with llvm-hs-pretty and call clang
on it to generate machine code for our preferred target.1
Some things to keep in mind when writing LLVM generating code:
There is very little type safety. The usual haskell guarantees of "if it typechecks it probably works" do not apply here at all.2 After all, we are writing basically writing assembly, so even if we get our code through the LLVM type checker, there's still a good chance that it won't do what we want it to.3 This is where having a comprehensive test suite is invaluable.
Documentation for LLVM in general is quite scant. The official language reference manual is comprehensive, but can be quite terse and provides little in the way of examples, although this gets better all the time and is not nearly as bad now as it was a few years ago. Documentation for llvm-hs specifically is even scarcer: basically just some tiny examples and the haskell port of the kaleidoscope tutorial, which is by now very outdated. This post will hopefully help matters somewhat but it leaves many parts of the LLVM IR and ecosystem unexplored.
Let
clang
help you. You can run it with-emit-llvm
on a C file to see what it generates. This is helpful even when you aren't writing a C-like language as you can see how certain high-level language concepts map onto LLVM.
LLVM Basics
The following is a simplified version of how LLVM works in order to create a basic mental model. There are many omissions.
LLVM programs are split into modules containing toplevel definitions like functions and global variables. For our purposes, we are concerned with emitting a single module containing all of our functions, global variables, and typedefs for our structs. Within functions, code is contained in basic blocks. A basic block is a list of sequential instructions in SSA (Single Static Assignment) form that is terminated in a conditional or unconditional branch to another block or in a return instruction.
LLVM, unlike many forms of assembly, is typed. Its type system resembles C's in many ways, but there are some notable differences. Instead of C's confusing and target dependent integer types short
, long
, long long
, etc., LLVM has arbitrary N bit width integers denoted iN
. Also, void*
is banned in LLVM, so char*
or i8*
is used instead. LLVM also has native vector types for SIMD instructions, but those won't factor into our discussion. There are other exotic features in the type system that help optimization like poison values but we won't discuss them here.
Codegen.hs
With that out of the way, we can start writing code. This module has quite a few imports.
{-# LANGUAGE RecursiveDo #-}
{-# LANGUAGE TupleSections #-}
{-# LANGUAGE FlexibleContexts #-}
-- We need these to write a ConvertibleStrings instance for
-- ShortByteString
{-# LANGUAGE MultiParamTypeClasses #-}
{-# OPTIONS_GHC -fno-warn-orphans #-}
module Microc.Codegen
( codegenProgram
)
where
import qualified LLVM.AST.IntegerPredicate as IP
import qualified LLVM.AST.FloatingPointPredicate
as FP
import LLVM.AST ( Operand )
import qualified LLVM.AST as AST
import qualified LLVM.AST.Type as AST
import qualified LLVM.AST.Constant as C
import LLVM.AST.Name
import LLVM.AST.Typed ( typeOf )
import qualified LLVM.IRBuilder.Module as L
import qualified LLVM.IRBuilder.Monad as L
import qualified LLVM.IRBuilder.Instruction as L
import qualified LLVM.IRBuilder.Constant as L
import LLVM.Prelude ( ShortByteString )
import qualified Data.Map as M
import Control.Monad.State
import Data.String ( fromString )
import Microc.Sast
import Microc.Ast ( Type(..)
, Op(..)
, Uop(..)
, Bind(..)
, Struct(..)
)
import Data.String.Conversions
import qualified Data.Text as T
import Data.Text ( Text )
import Data.Word ( Word32 )
import Data.List ( find )
In llvm-hs
, values that can be passed as arguments to LLVM instructions have type Operand
. This includes basically all variables as well as declared functions. Just as in Semant
, we define an Env
type to hold information about our generated operands and how they correspond to the names of their corresponding source variables. Besides the operands, we need to keep track of all struct declarations and all string literals in the source, as we emit a unique global variable for each unique string literal.4
data Env = Env { operands :: M.Map Text Operand
, structs :: [ Struct ]
, strings :: M.Map Text Operand
}
deriving (Eq, Show)
registerOperand :: MonadState Env m => Text -> Operand -> m ()
registerOperand name op =
modify $ \env -> env { operands = M.insert name op (operands env) }
Utilities
Working with LLVM bindings in other languages usually involves passing mutable builder and module context objects to all instruction-emitting functions in order to ensure that all variables have unique names and to maintain the integrity of the module. This is important to ensure that code remains in SSA form.
Of course, since we're not working in other languages, this approach of passing around mutable objects would be severely un-ergonomic, at best. Fortunately, llvm-hs
provides us with monads that emulate this behavior, ModuleBuilderT
for the module context, and IRBuilderT
for the builder object. We'll establish two type synonyms, LLVM
for generating top level entities and Codegen
5 for generating basic blocks.
type LLVM = L.ModuleBuilderT (State Env)
type Codegen = L.IRBuilderT LLVM
We'll also write some utilities to query struct fields defined in our Env
, to convert from MicroC types to LLVM types, and to calculate the sizes of MicroC types. For structs, we emit packed fields, which is pretty bad for performance, but makes calculating sizes very easy. Note that by this phase of the compiler, we no longer report errors to the user, so if anything goes wrong, we'll just crash.
getFields :: MonadState Env m => Text -> m [Bind]
getFields name = do
ss <- gets structs
case find (\s -> structName s == name) ss of
Nothing -> error "Internal error - struct not found"
Just (Struct _ binds) -> pure binds
charStar :: AST.Type
charStar = AST.ptr AST.i8
-- llvm-hs uses ShortByteString for names, but we want
-- easy conversion to Text with cs from Data.String.Conversions
instance ConvertibleStrings Text ShortByteString where
convertString = fromString . T.unpack
ltypeOfTyp :: MonadState Env m => Type -> m AST.Type
ltypeOfTyp = \case
TyVoid -> pure AST.void
TyInt -> pure AST.i32
TyChar -> pure AST.i8
TyFloat -> pure AST.double
TyBool -> pure AST.i1
-- (void *) is invalid LLVM
Pointer TyVoid -> pure $ charStar
-- special case to handle recursively defined structures
-- TODO: add real cycle checking so that improperly defined
-- recursive types case the compiler to hang forever
Pointer (TyStruct n) ->
pure $ AST.ptr (AST.NamedTypeReference (mkName $ cs ("struct." <> n)))
Pointer t -> fmap AST.ptr (ltypeOfTyp t)
TyStruct n -> do
fields <- getFields n
typs <- mapM (ltypeOfTyp . bindType) fields
-- Packed structs aren't great for performance but very easy to code for now
pure $ AST.StructureType { AST.isPacked = True, AST.elementTypes = typs }
sizeof :: MonadState Env m => Type -> m Word32
sizeof = \case
TyBool -> pure 1
TyChar -> pure 1
TyInt -> pure 4
TyFloat -> pure 8
TyVoid -> pure 0
Pointer _ -> pure 8
TyStruct n -> do
fields <- getFields n
sizes <- mapM (sizeof . bindType) fields
pure (sum sizes)
Expressions
LVals
Now, we're ready to generate code for expressions. First, LVal
s. When generating an LVal
, we generate an Operand
corresponding to the address of the value. That way, we can use it as an argument to the store instruction. For variables, we simply look up the variable name in the Env
.
codegenLVal :: LValue -> Codegen Operand
codegenLVal (SId name) = gets ((M.! name) . operands)
Since we are generating addresses, dereferencing is essentially the inverse of this, so we just generate code for the underlying expression.
codegenLVal (SDeref e ) = codegenSexpr e
For struct access, we get to use the fascinating (read: confusing) getelementptr instruction. The instruction only calculates addresses, it doesn't load memory, so it's a perfect fit for the semantics of codegenLVal
. We generate the address of the left hand side of the access and then have to pass two arguments to gep
, a zero to access the memory pointed to by the address we just calculated, then the offset of the struct field we want to access, which we calculated in semant. Note that getelementptr
handles calculating alignment, so we don't need to do it ourselves.
codegenLVal (SAccess e i) = do
e' <- codegenLVal e
let zero = L.int32 0
offset = L.int32 (fromIntegral i)
L.gep e' [zero, offset]
Literals
Most literals, as usual, are straightforward.
codegenSexpr :: SExpr -> Codegen Operand
codegenSexpr (TyInt , SLiteral i ) = pure $ L.int32 (fromIntegral i)
codegenSexpr (TyFloat, SFliteral f) = pure $ L.double f
codegenSexpr (TyBool , SBoolLit b ) = pure $ L.bit (if b then 1 else 0)
codegenSexpr (TyChar , SCharLit c ) = pure $ L.int8 (fromIntegral c)
Strings, however, are not. We look up the string literal in the Env
to see if we've generated a global variable for it before. If so, we just return that. Otherwise, we use globalStringPtr
6 to generate a pointer to a global string variable. We name each variable "0.str", "1.str" etc., since mkName
crashes with non-ASCII input, which we haven't explicitly forbidden in our string literals. Note that globalStringPtr
returns a Constant
which is distinct from an Operand
, so we need to promote it with AST.ConstantOperand
.
codegenSexpr (Pointer TyChar, SStrLit s ) = do
-- Generate a new unique global variable for every string literal we see
strs <- gets strings
case M.lookup s strs of
Nothing -> do
let nm = mkName (show (M.size strs) <> ".str")
op <- L.globalStringPtr (cs s) nm
modify $ \env -> env { strings = M.insert s (AST.ConstantOperand op) strs }
pure (AST.ConstantOperand op)
Just op -> pure op
Null pointers are generated with inttoptr
.
codegenSexpr (t, SNull) = L.inttoptr (L.int64 0) =<< ltypeOfTyp t
Sizeof
is calculated with the sizeof
function we wrote earlier.
codegenSexpr (TyInt, SSizeof t) = L.int32 . fromIntegral <$> sizeof t
The &
operator finds the address of an LVal
, which is already taken care of by codegenLVal
.
codegenSexpr (_, SAddr e) = codegenLVal e
Binary operators
For assignment, we calculate the address of the left hand side, the value of the right hand side, and then store said value at the address, returning the value.
codegenSexpr (_, SAssign lhs rhs) = do
rhs' <- codegenSexpr rhs
lhs' <- codegenLVal lhs
L.store lhs' 0 rhs'
return rhs'
For the Binop
constructor, we begin by generating code for the left and right and sides.
codegenSexpr (t, SBinop op lhs rhs) = do
lhs' <- codegenSexpr lhs
rhs' <- codegenSexpr rhs
case op of
For addition on int
s and float
s, we simply generate the corresponding machine instruction. For pointer addition, getElementPtr
takes care of calculating the offset for each pointer type so we don't have to worry about it.
Add -> case (fst lhs, fst rhs) of
(Pointer _, TyInt ) -> L.gep lhs' [rhs']
(TyInt , Pointer _) -> L.gep rhs' [lhs']
(TyInt , TyInt ) -> L.add lhs' rhs'
(TyFloat , TyFloat ) -> L.fadd lhs' rhs'
_ -> error "Internal error - semant failed"
For pointer subtraction, we do actually have to calculate the pointer width ourselves and divide the difference in addresses by it.
Sub -> let zero = L.int64 0 in case (fst lhs, fst rhs) of
(Pointer typ, Pointer typ') ->
if typ' /= typ then error "Internal error - semant failed" else do
lhs'' <- L.ptrtoint lhs' AST.i64
rhs'' <- L.ptrtoint rhs' AST.i64
diff <- L.sub lhs'' rhs''
width <- L.int64 . fromIntegral <$> sizeof typ
L.sdiv diff width
Subtracting int
s from pointers is similar to adding them, except that we negate the int
before passing it to getElementPtr
.
(Pointer _, TyInt) -> do
rhs'' <- L.sub zero rhs'
L.gep lhs' [rhs'']
For int
s and float
s, we again dispatch to the corresponding machine instruction.
(TyInt , TyInt ) -> L.sub lhs' rhs'
(TyFloat, TyFloat) -> L.fsub lhs' rhs'
_ -> error "Internal error - semant failed"
Multiplication and division are easy.
Mult -> case t of
TyInt -> L.mul lhs' rhs'
TyFloat -> L.fmul lhs' rhs'
_ -> error "Internal error - semant failed"
Div -> case t of
TyInt -> L.sdiv lhs' rhs'
TyFloat -> L.fdiv lhs' rhs'
_ -> error "Internal error - semant failed"
For the exponentiation operator, all remaining cases are raising int
s to int
s. We can take this opportunity to write some non-trivial LLVM and implement exponentiation as repeated multiplication directly in the IR. In haskell, the algorithm would be
-- We can obviously be more terse but this form maps better onto LLVM
raise lhs rhs = go 1 rhs where
go acc expt =
if expt == 0 then acc
else let nextAcc = lhs * acc
nextExpt = expt - 1
in go nextAcc nextExpt
In order to marry SSA with conditionals, LLVM uses phi nodes. Phi nodes must all appear at the very beginning of a basic block. There cannot be any non-phi instructions preceding them. The phi instruction takes a list of pairs. The first element of each pair is a value and the second element is the label of a basic block which has an outgoing branch to the block with phi nodes.
First, we need to get the label of the enclosing block so that we can start our new block. We then set acc
and expt
to phi nodes, such that if control flow proceeds into the loop_pow
block from the enclosing scope, they are initialized to 1 and rhs
, respectively, and if control flow is from continue
, they are set to nextAcc
and nextExpt
. The if
clause is handled by issuing a condBr
if the expt
has reached 0, at which point we either return the acc
or jump back to loop_pow
. 7
Note that we use mdo
, courtesy of {-# LANGUAGE RecursiveDo #-}
, instead of do
, as we need to forward-reference the doneBlock
and continueBlock
s in our branch instruction. We can't define our blocks with L.block
and then branch to them because calling L.block
ends the current block and starts a new one. When using other LLVM bindings, one usually has to create all of the blocks and then manually position the builder at the correct location before emitting instructions. However, haskell's laziness allows us to avoid this inelegance and write branching code much more naturally.
Power -> mdo
enclosing <- L.currentBlock
L.br loop
loop <- L.block `L.named` "loop_pow"
acc <- L.phi [(L.int32 1, enclosing), (nextAcc, continueBlock)] `L.named` "acc"
expt <- L.phi [(rhs', enclosing), (nextExpt, continueBlock)] `L.named` "expt"
done <- L.icmp IP.EQ expt (L.int32 0)
L.condBr done doneBlock continueBlock
continueBlock <- L.block `L.named` "continue"
nextAcc <- L.mul acc lhs' `L.named` "next_acc"
nextExpt <- L.sub expt (L.int32 1) `L.named` "next_expt"
L.br loop
doneBlock <- L.block `L.named` "done"
pure acc
(It is left as an exercise for the reader to implement a more efficient exponentiation algorithm in LLVM.)
The remaining binary operators all map directly onto their LLVM counterparts.
Equal -> case fst lhs of
TyInt -> L.icmp IP.EQ lhs' rhs'
TyBool -> L.icmp IP.EQ lhs' rhs'
TyChar -> L.icmp IP.EQ lhs' rhs'
Pointer _ -> L.icmp IP.EQ lhs' rhs'
TyFloat -> L.fcmp FP.OEQ lhs' rhs'
_ -> error "Internal error - semant failed"
Neq -> case fst lhs of
TyInt -> L.icmp IP.NE lhs' rhs'
TyBool -> L.icmp IP.NE lhs' rhs'
TyChar -> L.icmp IP.NE lhs' rhs'
Pointer _ -> L.icmp IP.NE lhs' rhs'
TyFloat -> L.fcmp FP.ONE lhs' rhs'
_ -> error "Internal error - semant failed"
Less -> case fst lhs of
TyInt -> L.icmp IP.SLT lhs' rhs'
TyBool -> L.icmp IP.SLT lhs' rhs'
TyChar -> L.icmp IP.ULT lhs' rhs'
TyFloat -> L.fcmp FP.OLT lhs' rhs'
_ -> error "Internal error - semant failed"
Leq -> case fst lhs of
TyInt -> L.icmp IP.SLE lhs' rhs'
TyBool -> L.icmp IP.SLE lhs' rhs'
TyChar -> L.icmp IP.ULE lhs' rhs'
TyFloat -> L.fcmp FP.OLE lhs' rhs'
_ -> error "Internal error - semant failed"
Greater -> case fst lhs of
TyInt -> L.icmp IP.SGT lhs' rhs'
TyBool -> L.icmp IP.SGT lhs' rhs'
TyChar -> L.icmp IP.UGT lhs' rhs'
TyFloat -> L.fcmp FP.OGT lhs' rhs'
_ -> error "Internal error - semant failed"
Geq -> case fst lhs of
TyInt -> L.icmp IP.SGE lhs' rhs'
TyBool -> L.icmp IP.SGE lhs' rhs'
TyChar -> L.icmp IP.UGE lhs' rhs'
TyFloat -> L.fcmp FP.OGE lhs' rhs'
_ -> error "Internal error - semant failed"
And -> L.and lhs' rhs'
Or -> L.or lhs' rhs'
BitAnd -> L.and lhs' rhs'
BitOr -> L.or lhs' rhs'
Unary operators
There aren't any negation intrinsics in LLVM, but it's easy enough to implement them ourselves.
codegenSexpr (t, SUnop op e) = do
e' <- codegenSexpr e
case op of
Neg -> case t of
TyInt -> L.sub (L.int32 0) e'
TyFloat -> L.fsub (L.double 0) e'
_ -> error "Internal error - semant failed"
Not -> case t of
TyBool -> L.xor e' (L.bit 1)
_ -> error "Internal error - semant failed"
Function calls
For function calls, we generate code for each argument, look up the function in our Env
, and then emit the call
instruction. Note that we add an empty list to each argument. LLVM allows us to emit parameter attributes attached to each argument, which we don't really care about.
codegenSexpr (_, SCall fun es) = do
es' <- mapM (fmap (, []) . codegenSexpr) es
f <- gets ((M.! fun) . operands)
L.call f es'
Casts
For casts from type t
to t'
, we simply use the corresponding instruction.
codegenSexpr (_, SCast t' e@(t, _)) = do
e' <- codegenSexpr e
llvmType <- ltypeOfTyp t'
case (t', t) of
(Pointer _, Pointer _) -> L.bitcast e' llvmType
(Pointer _, TyInt ) -> L.inttoptr e' llvmType
(TyInt , Pointer _) -> L.ptrtoint e' llvmType
-- Signed Int to Floating Point
(TyFloat , TyInt ) -> L.sitofp e' llvmType
_ -> error "Internal error - semant failed"
Finally, for SNoexpr
we just generate a 0, and if something got by the semantic checker, we crash.
codegenSexpr (_, SNoexpr) = pure $ L.int32 0
-- Final catchall
codegenSexpr sx =
error $ "Internal error - semant failed. Invalid sexpr " <> show sx
Statements
Codegen for statements isn't too bad. In the case of naked expressions, returns, and blocks, we simply reuse the work from codegenSexpr
.
codegenStatement :: SStatement -> Codegen ()
codegenStatement (SExpr e) = void $ codegenSexpr e
codegenStatement (SReturn e) = case e of
(TyVoid, SNoexpr) -> L.retVoid
_ -> L.ret =<< codegenSexpr e
codegenStatement (SBlock ss) = mapM_ codegenStatement ss
For conditionals, we follow a similar strategy as we did in implementing integer exponentiation. We generate the condition, branch on it, generate statements for each alternative in the correct block, and then issue an unconditional branch to a merge
block. One subtlety that we have to keep in mind is the possibility of a return
inside of the if statement. LLVM only allows one kind of terminator in a block, so we can use the mkTerminator
8 helper to check if that is the case and if so, do nothing.
codegenStatement (SIf p cons alt) = mdo
bool <- codegenSexpr p
L.condBr bool thenBlock elseBlock
thenBlock <- L.block `L.named` "then"
codegenStatement cons
mkTerminator $ L.br mergeBlock
elseBlock <- L.block `L.named` "else"
codegenStatement alt
mkTerminator $ L.br mergeBlock
mergeBlock <- L.block `L.named` "merge"
return ()
For do while loops, we immediately branch into the while
block, generate the code for the condition and the body, then conditionally branch into either the while
or merge
blocks.
codegenStatement (SDoWhile p body) = mdo
L.br whileBlock
whileBlock <- L.block `L.named` "while_body"
codegenStatement body
continue <- codegenSexpr p
mkTerminator $ L.condBr continue whileBlock mergeBlock
mergeBlock <- L.block `L.named` "merge"
return ()
Functions
To generate function code, we use the function
function (who said haskellers were bad at naming?!). We actually need to insert it into the Env
before generating code for it in case it calls itself recursively. Fortunately, we have our trusty mdo
. After generating the body, we have to re-insert all the strings we encountered back into the global Env
so that they can be reused across functions (this is ugly and should be refactored.)
codegenFunc :: SFunction -> LLVM ()
codegenFunc f = mdo
registerOperand (sname f) function
(function, strs) <- locally $ do
retty <- ltypeOfTyp (styp f)
params <- mapM mkParam (sformals f)
fun <- L.function name params retty genBody
strings' <- gets strings
pure (fun, strings')
modify $ \e -> e { strings = strs }
The L.function
call merits further discussion. It has type
:: MonadModuleBuilder m => Name -> [(Type, ParameterName)] -> Type -> ([Operand] -> IRBuilderT m () -> m Operand)
which specializes to
:: Name -> [(Type, ParameterName)] -> Type -> ([Operand] -> Codegen ()) -> LLVM Operand
(Now we understand why Codegen
and LLVM
are defined the way they are.)
The name
is easy.
where
name = mkName (cs $ sname f)
To make parameters, we just find the corresponding LLVM type and suggest the name as it appears in the source file.
mkParam (Bind t n) = (,) <$> ltypeOfTyp t <*> pure (L.ParameterName (cs n))
To generate the body, we first create an entry
block.
genBody :: [Operand] -> Codegen ()
genBody ops = do
_entry <- L.block `L.named` "entry"
Then, for each of the Operand
s that the function takes, we allocate space on the stack with alloca
, store that Operand
in that memory, and register the memory in our Env
. 9
forM_ (zip ops (sformals f)) $ \(op, Bind _ n) -> do
-- typeOf is defined in LLVM.AST.Typed
addr <- L.alloca (typeOf op) Nothing 0
L.store addr 0 op
registerOperand n addr
Local variables are treated similarly, except that we can leave them as uninitialized memory.
forM_ (slocals f) $ \(Bind t n) -> do
ltype <- ltypeOfTyp t
addr <- L.alloca ltype Nothing 0
registerOperand n addr
Finally, we generate the body of the function.
codegenStatement (sbody f)
For built in functions, we can use extern
to indicate to the linker to insert them into the final program.
emitBuiltIn :: (String, [AST.Type], AST.Type) -> LLVM ()
emitBuiltIn (name, argtys, retty) = do
func <- L.extern (mkName name) argtys retty
registerOperand (cs name) func
-- Printf has varargs so we treat it separately
builtIns :: [(String, [AST.Type], AST.Type)]
builtIns =
[ ("printbig" , [AST.i32] , AST.void)
, ("llvm.pow.f64" , [AST.double, AST.double], AST.double)
, ("llvm.powi.i32", [AST.double, AST.i32] , AST.double)
, ("malloc" , [AST.i32] , AST.ptr AST.i8)
, ("free" , [AST.ptr AST.i8] , AST.void)
]
Globals
For global variables, we simply call global
with a dummy 0 initial value and add the variable to our Env
.
codegenGlobal :: Bind -> LLVM ()
codegenGlobal (Bind t n) = do
let name = mkName $ cs n
initVal = C.Int 0 0
typ <- ltypeOfTyp t
var <- L.global name typ initVal
registerOperand n var
For toplevel structs, we register typedef
s at the module level with L.typedef
.
emitTypeDef :: Struct -> LLVM AST.Type
emitTypeDef (Struct name _) = do
typ <- ltypeOfTyp (TyStruct name)
L.typedef (mkName (cs ("struct." <> name))) (Just typ)
Finally, we generate the entire SProgram
by emitting all of the builtin functions, toplevel structs, global variables, and functions sequentially.
codegenProgram :: SProgram -> AST.Module
codegenProgram (structs, globals, funcs) =
flip evalState (Env { operands = M.empty, structs = structs, strings = M.empty })
$ L.buildModuleT "microc"
$ do
printf <- L.externVarArgs (mkName "printf") [charStar] AST.i32
registerOperand "printf" printf
mapM_ emitBuiltIn builtIns
mapM_ emitTypeDef structs
mapM_ codegenGlobal globals
mapM_ codegenFunc funcs
(Full source for Codegen.hs
here.)
The runtime
There's really very little runtime to speak of, but for demonstration purposes there's a runtime.c
file that we link with all of our executables to provide the builtin functions.
#include <stdio.h>
#include <stdlib.h>
void printbig(int c)
{
// elided...
}
(Full source for runtime.c
here).
Linking
Our "linker" is just a thin wrapper around clang
. We'll create a Microc.Toplevel
module to handle the details.
module Microc.Toplevel where
import LLVM.AST
import LLVM.Pretty
import Data.String.Conversions
import Data.Text ( Text )
import qualified Data.Text.IO as T
import System.IO
import System.Directory
import System.Process
import System.Posix.Temp
import Control.Exception ( bracket )
The compile
function, given a Module
, generates an executable at the supplied path. We call ppllvm
from llvm-hs-pretty
to dump the textual representation of the Module
into a file so that we can call clang
on it. We use bracket
to make sure that the build artifacts get cleaned up properly.
compile :: Module -> FilePath -> IO ()
compile llvmModule outfile =
bracket (mkdtemp "build") removePathForcibly $ \buildDir ->
withCurrentDirectory buildDir $ do
-- create temporary file for "output.ll"
(llvm, llvmHandle) <- mkstemps "output" ".ll"
let runtime = "../src/runtime.c"
-- write the llvmModule to a file
T.hPutStrLn llvmHandle (cs $ ppllvm llvmModule)
hClose llvmHandle
-- link the runtime with the assembly
callProcess
"clang"
["-Wno-override-module", "-lm", llvm, runtime, "-o", "../" <> outfile]
We also provide a run
function that simply generates an executable, reads its output, then deletes it.
run :: Module -> IO Text
run llvmModule = do
compile llvmModule "./a.out"
result <- cs <$> readProcess "./a.out" [] []
removePathForcibly "./a.out"
return result
(Full source for Toplevel.hs
here.)
Now we can finally finish writing runOpts
in Main.hs
.
runOpts :: Options -> IO ()
runOpts (Options action infile ptype) = do
program <- T.readFile infile
let parseTree = case ptype of
Combinator -> runParser programP infile program
Generator -> Right $ parse . alexScanTokens $ T.unpack program
case parseTree of
Left err -> putStrLn $ errorBundlePretty err
Right ast -> case action of
Ast -> putDoc $ pretty ast <> "\n"
_ -> case checkProgram ast of
Left err -> putDoc $ pretty err <> "\n"
Right sast ->
let llvm = codegenProgram sast
in case action of
Sast -> pPrint sast
LLVM -> T.putStrLn . cs . ppllvm $ llvm
Compile outfile -> compile llvm outfile
Run -> run llvm >>= T.putStr
Ast -> error "unreachable"
(Full source for Main.hs
here.)
Testing
For testing, we write a similar runner that takes a filepath and returns what would be the result of calling mcc <path>
on it.
runFile :: FilePath -> IO Text
runFile infile = do
program <- T.readFile infile
let parseTree = runParser programP (cs infile) program
case parseTree of
Left e -> return . cs $ errorBundlePretty e
Right ast -> case checkProgram ast of
Left err ->
return . renderStrict $ layoutPretty defaultLayoutOptions (pretty err)
Right sast -> run (codegenProgram sast)
We now have enough code to write passing tests for our compiler. They look very similar to the failing tests from part 2.
passing :: IO TestTree
passing = do
mcFiles <- findByExtension [".mc"] "tests/pass"
return $ testGroup
"pass"
[ goldenVsString (takeBaseName mcFile) outfile (cs <$> runFile mcFile)
| mcFile <- mcFiles
, let outfile = replaceExtension mcFile ".golden"
]
main :: IO ()
main = defaultMain =<< goldenTests
goldenTests :: IO TestTree
goldenTests = testGroup "all" <$> sequence [passing, failing, parsing]
(Full source for Testall.hs
here.)
Conclusion and Acknowledgments
The compiler is finished! In just 1500 or so lines of haskell, we've implemented a significant amount of the C programming language! Thanks to everyone who's been reading along. I've had fun writing and revisiting my old code. In particular, I'd like to thank
Théophile Choutri who's been sponsoring the series
Moritz Kiefer for maintaining llvm-hs and approving my PR's to the library
Stephen Edwards and Richard Townsend, whose compiler class I took in college; my team's project from that semester, a small, poorly-named functional array language that compiles to LLVM, is on github
We don't use the FFI here mainly because it's very challenging to set up properly. I've just recently figured out how to get everything to compile with nix, though, so we'll explore using the FFI in a future post.
A typed version of the LLVM haskell bindings exists, but I don't know how to use it, and the repo hasn't been active for a while so I'd hesitate to recommend it.
There's a reason that compilers were some of the earliest programs invented. Writing assembly is difficult and error prone and as humans we like to avoid it.
Why do we do this, you might ask? Because that's what clang does when it compiles code with string literals. It might be architecturally cleaner to collect all of the unique string literals during semantic analysis instead of doing it during codegen but that can be refactored later.
For reasons pertaining to the type of function
from =LLVM.IRBuilder.Module=, Codegen
needs to be a monad transformer over LLVM
. I don't have a better explanation for this other than that without it, code generation for function bodies doesn't compile.
Fun fact, I contributed the original version of globalStringPtr
to llvm-hs
back in 2018, although it's been since improved upon.
Using `L.named`
for variables and blocks isn't strictly necessary, as L.block
will choose fresh, non-conflicting names for the block labels, but it makes debugging the generated output significantly easier if they have meaningful names rather than numbers.
mkTerminator :: Codegen () -> Codegen ()
mkTerminator instr = do
check <- L.hasTerminator
unless check instr
We'll depend heavily on mem2reg to optimize away stack allocations. It would be very hard for us to store all of our variables in registers because our language allows for unrestricted mutation, which plays poorly with SSA. Loading and storing memory doesn't violate SSA, so it's an easy way to implement mutability, but if we were to use registers, we'd have to insert phi nodes everywhere where mutation happened, which would require very extensive analysis that we don't do. At that point, though, we'd just be duplicating optimizations that LLVM already has in its pipeline.