Jasmine Tang

My blog

[ONGOING] Extending LLVM's Kaleidoscope: Code-gen edition

8888-08-08

Prologue

Hey everyone, how everyone's doing? I've graduated Berkeley and have been back at my parents' place for a while now, getting ready to start Igalia :) I'm still doing good hahhaa :)

I always thought getting back to OC after being far far away in Berkeley wonderland means that everything's gonna change for me: I won't have to do homework anymore :) I'll go to bed early blabla bla; But here I am writing this blog at 2 AM hahhaha. I realized that changing the environment doesn't necessarily change me personally; I'd need to change myself on my own :) Being next to my family makes me really grateful, but now I'm starting to miss my high school teacher as well as my nanny in Viet Nam. I really wanna go back soon :)

Anyways, let's talk business hahah :) This article documents what I've learned about the basics of LLVM's code generation process, self-contained in lowering from AST to LLVM-IR. This includes basic stack variables, addition, subtraction, etc etc to strings, structs as well as garbage collection.

As is tradition, here are three songs for you by Zedd: Papercut, Addicted To A Memory, and Done With Love. All three songs showcase extremely well the emotional depth of a person in and out of love.

I hope you enjoy the songs (and the blog post as well!) :)

Introduction

Codegen in LLVM is an extremely well-thought-out process. For a simple stack-centric codegen process, the framework users can abstract away SSA form; the alloca process together with mem2reg allows an extremely fast assembly generation even for primitive stack allocations.

In this article, I'll report on the codegen process of sammine-lang, an extension on the kaleidoscope tutorial. Besides code generation for scalar values, funcitons and control flow, the article also touches on code gen for aggregated data types such as structs :) I hope everyone enjoys :)

The below picture demonstrates sammine's ability to generate code for fibonacci, a classic mathematical problem :) [TODO: show that sammine can indeed code fibonacci and output assembly for it]

Aggregated datatype's also generated with sammine: [TODO: show that sammine can indeed code record]

Codegening

Alloca(tion of the stack) and mem2reg

todo talk about allocation of the stack, first draws out an example of the global var in llvm ir

Variables

Variable codegen (i32, i64, double) discuss three modes:

  • creation: in var def and the map that keeps it, this done using alloca
  • modification: in binary op =, this stores to address of alloca
  • read: this loads from address of alloca.

todo: talks about in relation to the walk and visitor pattern

Control flow

discuss how alloca makes this easier.

Functions and Extern (via PrototypeAST)

todo: talk about a caveat that we need to allocate "alloca" addresses at the start of the function. todo: relate to how strong the visitor pattern is

Case study: printf

Back in Berk, I sometimes would play Factorio in my free time. The game is intuitive and interesting. But for some reasons, the game really stresses me out. "Wait a minute, I love problem-solving, don't I? But I feel so stressed trying to build the factory with these new science packs and these new different energy types." Now, in this summer of 2025, when I'm writing my own compiler, I suddenly realize the reason. I love problem sovling, but I was solving the wrong problems. I wasn't interested in trying to build and maintain factories. Rather, compiler always seems to have a softer spot in my heart. That and blog writing, hence I'm writing this :)

Ah, I still don't know what I'm rambling about. I guess what I'm trying to say is in that in my experience, while I enjoy problem-solving in general, it also matters what type of problems I'm solving. Factorio, leaning hard into resource management and logistics, just isn't my forte, which is creative expression (writing), coding and abstraction (compiler engineer). Maybe front-end and/or back-end development isn't what you love, then you should consider switching to compiler engineering, ahhahahha :P

Anyways, let's now talk about the lowering of printing..

Right now, in sammine, I'm adopting a python-esque way of printing:

print(x);  # x and y are variables in this case
println(y);
println(2.4 + 5.3); # printing should also support expressions

With int and double powered by alloca, how should we go around lowering this?

We know that in libc, the signature for printf() is bla bla bla hahaha

  • print(2)
  • print("x")
  • println(2.4)

What we'll do is, depending on the type via the AST, we'll access the

Arrays and bounds checking :)

Closures and 2 different ways to get them working

TODO: Talk about the general structure

Generics

Initially I was unsure on where to put this part but I guess I can mention generics in both sections.

In shortness, when we generate code in the codegen phase, there won't be any generics anymore; the generics would've been dissolve away by the type checker. For a generic of type T in function f, as in the following code, the typechecker sets the boolean of is_generic() to true . Then when the type checker sees a call expr of f with a concrete type like i32, it literally creates another AST clone of the function f, called f.i32 (this is called monomorphization).

When the codegen stage comes in, for a function ast, if its generic (via the boolean is_generic), then it won't generate code for it. But if it's monomorphized (non-generic), then it goes ahead and generate the code for it normally :)

I'll have to reserve talking about the type checking aspect of generics here since the topic of the article is about codegen :) But if you'd like to support me in pumping out articles faster, maybe you can buy me a redbull? :)

Turning on optimizations

Epilogue

Remarks

The article, as you can tell, is directed towards new grads and/or beginners in LLVM. If you've benefited from the article or if you'd like to support my writing these blogs, please consider getting me a red bull :)

I also want to extend my thanks to all the developers that helped contributed to the Kaleidoscope. Without them, the journey to generate code, as well as the creation of this article, would be ten-fold harder. I realize that we all stands on the shoulder of the previous generations and of giants and I would like to pay my tribute to them.

Claude

You'll realize that in this edition of extending kalei, a lot of work's been done. This is due to no small parts to Claude, allowing me to loo