Jasmine Tang

My blog

Jasmine's llvm playground - 1

2024-07-13

Edit:
My resume is here (and downloadable here). If you know of a compiler related job posting, please feel free to contact me or refer me at either tanghocle456@gmail.com or jjasmine@berkeley.edu.

Hi everyone, it's Jasmine here, or badumbatish.

Inspired by mcyoung's llvm ir blog and sha4dy's learning llvm blog, I think it'd be great to write some kind of similar blog posts, albeit I'm still learning :)

In this blog post, let's try to get a minimal example of LLVM running.

Throughout the blog post, I'll try to help explain what each piece of functions do, as well as connect the dots between the LLVM API calls and the resulting LLVM IR.

We'll get started by installing LLVM, write a CMakeLists.txt to build our code, and create a main.cpp that uses LLVM API to generate a main function that returns 0.

Installing llvm

LLVM is a very well-known compiler framework so chances are you wouldn't have to build it from source and just install it from your favorite package manager.

On my MacOS, I'll use homebrew to install llvm

brew install llvm

On MacOS, for some reason CMake fails to recognize and find LLVM via find_package() so you would have to configure it with some extra steps

Run

brew info llvm

It'll spit out some extra information about linking flags for you

  badumbatish.github.io git:(main) brew info llvm
==> llvm: stable 18.1.8 (bottled), HEAD [keg-only]
Next-gen compiler infrastructure
https://llvm.org/
Installed
/opt/homebrew/Cellar/llvm/18.1.8 (7,722 files, 1.8GB)
  Poured from bottle using the formulae.brew.sh API on 2024-07-01 at 16:36:01
From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/l/llvm.rb
License: Apache-2.0 with LLVM-exception
==> Dependencies
Build: cmake ✘, ninja ✔, swig
Required: python@3.12 ✔, xz ✔, z3 ✔, zstd
==> Options
--HEAD
        Install HEAD version
==> Caveats
To use the bundled libc++ please add the following LDFLAGS:
  LDFLAGS="-L/opt/homebrew/opt/llvm/lib/c++ -Wl,-rpath,/opt/homebrew/opt/llvm/lib/c++"

llvm is keg-only, which means it was not symlinked into /opt/homebrew,
because macOS already provides this software and installing another version in
parallel can cause all kinds of trouble.

If you need to have llvm first in your PATH, run:
  echo 'export PATH="/opt/homebrew/opt/llvm/bin:$PATH"' >> ~/.zshrc

For compilers to find llvm you may need to set:
  export LDFLAGS="-L/opt/homebrew/opt/llvm/lib"
  export CPPFLAGS="-I/opt/homebrew/opt/llvm/include"
==> Analytics
install: 74,089 (30 days), 216,506 (90 days), 602,102 (365 days)
install-on-request: 34,994 (30 days), 109,493 (90 days), 314,945 (365 days)
build-error: 727 (30 days)

Build system(s)

The official build system language used by LLVM is CMake, and thus that is what the playground introduces.

Make sure you have your CMake installed and create a CMakeLists.txt file.

Now, by running the following code, CMake will be able to detect your llvm installation (on MacOS) (this is from before)

echo 'export PATH="/opt/homebrew/opt/llvm/bin:$PATH"' >> ~/.zshrc \
&&  export LDFLAGS="-L/opt/homebrew/opt/llvm/lib" \
&&  export CPPFLAGS="-I/opt/homebrew/opt/llvm/include"

It's not the cleanest way to handle these things, but when you're hacking around, it should be good enough.

Now for your CMakeLists.txt

cmake_minimum_required(VERSION 3.22)
project(llvm_playground) 
find_package(LLVM REQUIRED CONFIG)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

include(AddLLVM)

SET(CMAKE_EXPORT_COMPILE_COMMANDS ON)

Here we define the minimum cmake version and the project name with the first two line.

The third line asks CMake to find LLVM and crash if it can't be found.

We'll use C++17. The include command loads the AddLLVM.cmake file so we can later use some of its utility function for our executable.

Verify your cmake script is good

It helps to verify small steps by small steps that everything works. Run

cmake -S . -B build && cmake --build build

to use cmake to build using the CMakelists.txt script we have create in the current folder and store build artifacts in the build folder before adding these two lines that will help us build our main.cpp

add_llvm_executable(main main.cpp) # This is from AddLLVM
target_include_directories(main ${LLVM_INCLUDE_DIRS})

add_llvm_executable is from AddLLVM and LLVM_INCLUDE_DIRS is a list of include paths to directories containing LLVM headers (more details at llvm cmake)

We include that variable to our target so that main.cpp can include all the necessary header for usage later.

For your main.cpp, we'll try to use llvm to create a function main() that returns an int of 0 to generate LLVM IR and then use LLVM's llc to compile it into an object file and use clang to compile that object file to an executable :)

It will be fun :)

cpp

First create your main.cpp,

Include headers

We'll first include our necessary header in order to use our helper functions

#include <llvm/IR/LLVMContext.h>
#include <llvm/IR/IRBuilder.h>
#include <llvm/IR/Module.h>
#include <llvm/IR/Verifier.h>

using namespace llvm;

The first header, with the class LLVMContext, is

... an important class for using LLVM in a threaded context. It (opaquely) owns and manages the core "global" data of LLVM's core infrastructure, including the type and constant uniquing tables.

The second one, IRBuilder (Intermediate Representation Builder) is an important class in LLVM

This provides a uniform API for creating instructions and inserting them into a basic block: either at the end of a BasicBlock, or at a specific iterator location in a block.

This brings us into one of the most important building (basic) blocks (haha see what i did there)

I really think sh4dy's explanation of Modules, BasicBlock is much more succint and helpful than mine, please check out their blog here

A basic block is a straight-line sequence of instructions with no branches, meaning that execution starts at a single entry point and proceeds sequentially to a single exit point, where it then continues to the next basic block. Basic blocks belong to functions and cannot have jumps into their middle, ensuring that once execution starts, it will proceed through all instructions in the block. The first instruction of a basic block is known as the leader.

What we'll do is we'll create a basic block in our main function, and then wrap that basic block around this IRBuilder to more easily do stuff.

Necesary functions includes the 4th constructor of IRBuilder

llvm::IRBuilder< FolderTy, InserterTy >::IRBuilder  (
        BasicBlock *TheBB,
        MDNode * FPMathTag = nullptr,
        ArrayRef< OperandBundleDef >OpBundles = std::nullopt )		

We choose this constructor since it only requires 1 input of BasicBlock*; without going into much detail (because I also don't know anything), MDNode stands for a metadata node and the OpBundles stands for ... something.

and the construction of BasicBlock

static BasicBlock * llvm::BasicBlock::Create (
        LLVMContext &Context,
        const Twine &Name = "",
        Function *Parent = nullptr,
        BasicBlock *InsertBefore = nullptr)	

Let's walk through what each of these things mean besides LLVMContext.

Twine is "A lightweight data structure for efficiently representing the concatenation of temporary values as strings." It would entertain readers (for me at least) that this meaning carries over from the dictionary:

twine (noun) (u) : strong string made by twisting together two or more lengths of string

In our use, we name our basic block through this parameter. Twine also has very nice string-based constructors that allow us to pass a const char* or a std::string in its place.

The last two parameters is related to each other through the insertion position of the newly created basic block. LLVM documentation mandates:

If the Parent parameter is specified, the basic block is automatically inserted at either the end of the function (if InsertBefore is 0), or before the specified basic block.

The third one, Module.h, represents either a translation unit or a collection of translation unit, think of it as a representation of a .cpp source file

The last one, Verifier, we'll use after setting up everything to make sure our code is correct.

Lastly, we then do

using namespace llvm

to ease our namespace typing.

Using LLVM functions

First we create a Module pointer to operate on, the Module class constructor takes in a Context as well as a string reference StringRef called Module ID.

It should help to know that StringRef has a constructor from various string-like types, thus hiding away its explicit constructor with "top" in the following code, in the Module's first argument in its constructor.

When you create a compiler and compile your language's source file x, you can name your source file x in your Module, here, we name it top, signaling this is the top level module

int main() {
    LLVMContext Context;
    auto M = std::unique_ptr<Module>(new Module("top", Context));
}

After we got our module, we can put create our function main inside a basic block and put it inside Module.

Our first playground are fairly simple, we will name our basic block MainEntry and our function Main

auto *FT = FunctionType::get(Type::getInt32Ty(M->getContext()), false);
auto *F = Function::Create(FT, Function::ExternalLinkage, "Main", M.get());
auto *block = BasicBlock::Create(M->getContext(), "MainEntry", F);

First we create a type for our function with FunctionType::get(). We'll use the 2nd overloaded static constructor, creating "a FunctionType taking no parameters."

FunctionType * FunctionType::get ( 
        Type *Result,
        bool isVarArg )

Second, we create our function with Function::Create().

static Function * llvm::Function::Create (
        FunctionType * 	Ty,
        LinkageTypes 	Linkage,
        const Twine & 	N = "",
        Module * 	M = nullptr )

The first argument we get from *FT .We create our function with external linkage, this means that the function is externally visible. Note that in LLVM IR, if the function has no linkage keyword, it is implicitly meant that it has external linkage. Implementation-wise, I think it makes sense that we create the main function with this linkage, it would be crazy there is multiple main function in each source file that can all be run :) But of course, what joy do we have if we limit ourselves.

We have already introduced the last two parameters in this Create() function before. We will name our function with "Main" and this will be associated to the Module "top." This, like all things, should be reflected when we output our LLVM IR.

Now that we got our basic block, we can use the IRBuilder that we talked about from before to create an instruction that returns 0. Wrapping the basic block with the builder, we can call methods on it to create instructions.

IRBuilder<> Builder(block);
Builder.CreateRet(Builder.getInt32(0));

Since IRBuilder inherits from IRBuilderBase, you can see the avaiable helper methods to create instructions via IRBuilderBase's public member functions

As an exercise, see if you can modify so that our main function returns void instead of Int32(0) :)

We then verify our module for error, and then print out the llvm ir for us

verifyModule(*M, &errs());
M->print(outs(), nullptr);

Running and Reading LLVM IR

With our CMake and our main.cpp ready, we can start cooking.

cmake -S . -B build # Use the CMakeLists.txt in current folder to build and store in build/
cmake --build build # Build the executable from material in build/
./build/main # Run the main executable

When you run this, the LLVM IR will be outputted since we asked the Module to print out the IR in main.cpp

; ModuleID = 'top'
source_filename = "top"

define i32 @Main() {
MainEntry:
  ret i32 0
}

It is hard to read LLVM IR since my website hasn't supported LLVM IR yet :( But I'll try to guide you through it. mcyoung's gentle introduction to LLVM IR should be much more in depth than mine. Mine guiding is just to point out the correspondence between the LLVM API call and the LLVM IR itself.

; stands for the beginning of a comment in LLVM IR. The next line indicates the source file name in LLVM IR, which is also top, indicated by the comment on the first line. You can see how this correspond to how we create a Module from before.

The next line starts with a function definition keyword define followed by the return type as well as the function name. Note that the name of a function is followed after the @. It is also used for global.

Then the word MainEntry stands for the entrance of the basic block we just create, having the ret 0 instruction inside it. Since we created our basic block with the Parent being Main (with LLVM API from before), we can see that the basic block is indeed inserted at the end of the function.

Compiling LLVM IR

It would be a sad state of affair if we can't run this beautiful LLVM IR code that we just generated.

We can instead put our LLVM IR inside an .ll file and run it through llc, the LLVM static compiler.

This will give us a .o file, which we can put it through a compiler such as clang or gcc.

 cmake -S . -B build && cmake --build build \
&& ./build/main > temp.ll \
&& llc -filetype=obj temp.ll && clang temp.o -o temp && ./temp

It is interesting to note that LLVM also provides a way to output object code via its API. Readers can refer to this article when implementing their compiler.

Conclusion

This blog post helps the reader to run and create a bare minimum example with llvm: a main function that returns 0

While it is small, readers can now create a compiler albeit useless that when inputted something like

fn main() -> int32 {
    ret 0;
}

, correctly spits out an actual running executable :)

All they need is a small lexer and parser to support the small example :)

Heck, they can even do something like

if (str.startsWith("fn main()")) { .... }

to just start create their own little compiler :)

The next blog post will focus on structs, enums and how to return a custom type. Stay tune!

Readers for more inquiries can contribute with a PR @ my website repo or email me via the website's contact.

The rest of the blog post details the code listing for copy-and-paste-and-hacking enjoyers

Source code

CMake

cmake_minimum_required(VERSION 3.22)
project(llvm_playground) 
find_package(LLVM REQUIRED CONFIG)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

include(AddLLVM)

SET(CMAKE_EXPORT_COMPILE_COMMANDS ON)

add_llvm_executable(main main.cpp)
target_include_directories(main ${LLVM_INCLUDE_DIRS})

main.cpp

#include <llvm/IR/LLVMContext.h>
#include <llvm/IR/IRBuilder.h>
#include <llvm/IR/Module.h>
#include <llvm/IR/Verifier.h>

using namespace llvm;

int main() {
    LLVMContext Context;
    auto M = std::unique_ptr<Module>(new Module("top", Context));

    auto *FT = FunctionType::get(Type::getInt32Ty(M->getContext()), false);
    auto *F = Function::Create(FT, Function::ExternalLinkage, "Main", M.get());
    auto *block = BasicBlock::Create(M->getContext(), "MainEntry", F);

    IRBuilder<> Builder(block);
    Builder.CreateRet(Builder.getInt32(0));

    verifyModule(*M, &errs());
    M->print(outs(), nullptr);

    return 0;
}

Full build instructions

 cmake -S . -B build && cmake --build build \
&& ./build/main > temp.ll \
&& llc -filetype=obj temp.ll && clang temp.o -o temp && ./temp