[HIATUS] Going to the gym with MLIR: Linear register allocation for ... a stackless virtual machine?

2025-01-16

Please, please, please don't read it now. It's unfinished, and I'm not sure if it'll go into impl or design doc :)

Prologue

Hi everyone, I hope everybody's been great :) I hope you like my last report for my graduate class here. This article is also related to the same project. It discusses the SSA linear scan register allocation algorithm and the steps taken to implement it.

Just like the last MLIR report post, I also include a supporting section (SS) right before the introduction for unfamiliar reader in the compiler space.

Without reading the SS, I assume interested readers are familiar with some basic middle-end IR concepts (e.g $\phi$ notation, liveness analysis, loops).

With reading the SS, there is some context from the last report's SS that I think would be helpful.

And as is tradition, I also wanna share a song for you to listen to while reading. Initially it was between this song and that song. What can I say? I guess this is a long form way to say I want you to listen to both :)

I hope everybody enjoys :)

Reader supporting section

Open Reader supporting section

Dominance: In a CFG, with the entry basic block $BB_0$ and any two basic block A and B, if all paths from $BB_0$ to B goes through A, then A is said to dominate B. The notation for dominance is A $\gg$ B and every node is said to dominate itself.
Dataflow analysis: Abstractly, dataflow analysis is a way to understand how data moves and changes throughout a computer program. It helps us track what happens to different values (like variables) as the program runs from start to finish.

For example, when reaching definition is run, the compiler can detect if some variable is possibly uninitialized and warn us if such cases appear:

// Just an example for reaching definition.
fn reaching_def_example(condition: bool) {
    let x: i32; // x is declared but not initialized
 
    if condition { // if this `if` condition is `false`,
       x = 5;      // `x` is not initialized
    }
 
    println!("{}", x); // Warning: x might be uninitialized
                      // if condition is false
}

Loop header: In hand with dominance, a block is said to be a loop header if it dominates one of its predecessors (a block that points to it). Knowing what a loop header is, and later identifying it would be of great help in our live intervals (of our linear register allocator) construction. As in the paper by Wimmer and Franz, their live intervals construction bypasses the dataflow analysis cost itself, simply on the concept of dominance and loop header.

Before heading further, let's look at an example for the two concepts.

In this control flow graph, A dominates every BB, including itself. B dominates D; even though there is a path from D to B, it must have first passed from B first. Since B dominates its predecessor, B is a loop header; hence, there is a loop :)

D is:

Live interval and Live ranges pre-notes: Different papers derive the meaning of a live range differently. For example, in Engineering a Compiler 2nd Edition pg. 690, live range is a closed set of pairs of use and def locations. In Mössenböck & Pfeiffer, the roughly same definition is dedicated to live intervals, leaving the meaning of range to just mean a pair of use and def. I chose to follow the definition laid out by Mössenböck & Pfeiffer where a live interval contains an ordered list of live ranges. [todo, add some example here, or the next two section?]

For example:

// Add an example here pleaseeeeeeeee >,<

Overall, the concepts were born to ease the job of register allocators. If some variables are not live anymore (since their definitions), their registers can be reused for another variable instead. In this view, contention for a register (a scarce resource) between some variables can instead be seen as the intersection (or lack thereof) between different live ranges. In this effect, at the register allocation phase, the compiler effective translates the tradition SSA variable naming scheme into a set of live intervals.

Hopefully, after reading the two (closely related) definitions and seeing how they're defined to solve the same problem, we've cleared up the confusion.

Terminologies: in this article, sometimes I refer to SSA variables as mlir::Value. This boils down to the same concept: a variable that is defined only once (and cannot be changed).

Introduction

todo: refer to this https://ieeexplore.ieee.org/document/7924389 Register allocation is an important aspect in the compiler construction phase. Registers being the fastest (but most limited) in the memory hierarchy, the compiler's register allocator must be strategic about this resource, more usages in registers instead of the stack means reduced latency and improved execution speed.

In the case of our recompiler, the main motivation related to obfuscation. Edu's employer, Quarkslab is interested in applying obfuscation method to DEX files. With the methods generating a lot of virtual registers, there were concerns about it exceeding the limit of $2^{16} = 65,536$ virtual registers. In this case, the linear register allocation project helps bring down the number of virtual registers under the limit (of course they can just run less obfuscation hahahah).

This article then draws the learning mainly from three sources: the text book Engineer a Compiler, and two papers from Mössenböck & Pfeiffer MP02 and Wimmer & Franz WF10.

Personally, this is also another opportunity for me to improve my compiler arsenal. Some undergraduate compiler classes leave this part out or only as part of a pen and paper exercise; knowing how to perform register allocation (and writing about it) would be a big push in the direction.

All discussed concepts are driven from the links in the resources section.

Differences

We're planning to improve the lowering of MjolnIR to smali. What this means is that instead of lowering directly from SSA variables to virtual registers with template<class Aspect> class SmaliCounter, we lower indirectly with our register allocation, first translating them into the namespace of live intervals first, then down to virtual registers.

More specifically, we'll :

Produce live intervals from ssa values and their uses and defs.
Optionally merge compatible live intervals together (the process is called coalescing).
Lower down to virtual registers.

This new process introduces new stages that are clearly described and compared through this diagram.

[todo add mermaid here]

Steps

To say that linear register allocation simply comprises 3 steps is a simplification. The three big steps eventually break down into substeps when a reader starts asking these questions of the sort: Where and how do we get our live intervals? How do we map SSA value to a said interval? What mechanics helps us merge compatible? How is MLIR helping here? We'll

First discuss the problem of $\phi$ in the context of live intervals.
Number the mlir operation via the topological ordering of basic blocks.
Construct our live intervals.
Coalesce
Translate to virtual registers, during this time we might split intervals to produce what is called live holes.

It is important to note that although I follow the MP02 paper, some recaps for WF10 will also be noted.

Phis in register allocation (4.1)

The (natural) nuisance of $\phi$ s is rooted in the idea of false registers contention. If two basic blocks $B_1$ and $B_2$ with their respective SSA variables $b_1$ and $b_2$ both connect at basic block $C$ with the $\phi_{c}$ , the two SSA variables (and the $\phi$ ) all refers to the same entity. In this case, even though the live range of $b_1$ and $b_2$ clashes together, they should still share the same register.

To navigate this, we often perform a preprocessing step called copy generation or move generation: add a proxy move instruction and introduce new SSA variables to make sure that $b_1$ and $b_2$ doesn't clash anymore, then perform a step called register coalesce to join them together.

In this context, different papers (MP02 and WF10) handle this problem differently in terms of preprocessing the graph. Expectedly, the two papers made no mention of MLIR (of course hahaa).

Idea: The idea is that in MLIR, we can use MjolnIR's move op on these.

these $b_1$ and $b_2$ is the resulting operands of $B_1$ and $B_2$ 's terminators and that when the control flows to $C$ , their representation turns into $\phi_{c}$ . Simply put, we can focus on generating the virtual registers for $\phi_{c}$ and produce redundant moves as the first step for the algorithm.

todo: show the graph of B1 and B2 flowing into C. todo: discuss Brock and then wimmer.

Topo blocks and numbering (4.2)

In order for a live range to make sense, we need to know where (or when) does an SSA value start living and where (or when) does an SSA value ends its life. To do this, we need to number the instruction topologically.

In linear scan register allocation, after we have handled our $\phi$ s as prescribed above, we then flatten the CFG into a list of basic blocks topologically. In a DAG, this reduces to Reverse Post Order traversal, or RPO for short.

The reason for this is todo: numbering the op

In both the MP02 and the WF10 paper, this is achieved with an ordered list based on the BB's dominance. I should also note here that ordered (in layperson terms) means that if block A arrives before block B, then block B cannot dominate block A.

Since the topological ordering of the graph naturally lends itself to a dominance ordered list, MLIR itself shows its definition for mlir::getBlocksSortedByDominance as reverse-post-order DFS:

SetVector<Block *> mlir::getBlocksSortedByDominance(Region &region) {
    // For each block that has not been visited yet (i.e. that has no
    // predecessors), add it to the list as well as its successors.
    SetVector<Block *> blocks;
    for (Block &b : region) {
         if (blocks.count(&b) == 0) {
               llvm::ReversePostOrderTraversal<Block *> traversal(&b);
               blocks.insert(traversal.begin(), traversal.end());
         }
    }
    assert(blocks.size() == region.getBlocks().size() &&
          "some blocks are not sorted");
 
    return blocks;
}

At this point, we've gotten our hands on the block list. But Wimmer and Franz introduce a stronger condition: all blocks belonging to the same loop must be continuous and that "TODO".

Personally, I think the API provided by MLIR won't satisfy the condition; there is a chance that a loop header block might be followed by its exit, non-looping block, and then, the blocks that belong to the loop body.

The write up here follows the MP02 paper more. Thus the need for loop detection is left as an afterthought

                                        // the arg is from mlir::getBlocksSortedByDominance
SmaliCounter<Operation*> numberOperation(SetVector<Block*> blocks) {
    SmaliCounter op_counter;
 
    for (auto &block : blocks)
        for (auto &op : block)
           op_counter.get_counter(op);
 
    return op_counter;
}

Construct the live ranges (with live holes).
Construct the

Building live intervals (4.3)

In MP02, the paper starts off with what it means for an SSA variable to be live. Since we know that for a variable to be used in block b, it must have been defined in either b or one of its predecessors p.

Let's call $r_{v,b}$ the live range of an SSA variable in a block. MP02 describes different intricacies in handling this:

An algorithm on calculating live ranges
How the live range of a $\phi$ variable should be.

TODO: say how phi's hard to handle TODO: you really should define a live range class. TODO: Talk about how the MP02 paper conflates the live range of an instruction with an SSA value.

using LRStart = uint32_t;
using LREnd= uint32_t;
using LiveRange = std::pair<LRStart, LREnd>; // From above description
 
// todo: do add range here
// todo: readers probably find the algorithm too complicated. you might wanna draw some kind of diagram.
void add_range(ins i, ssa value, block b, end: int32_t) {
    // assume that ssa value ends its living at `end`
    // and that the instruction i uniquely defines the ssa value (of course cause its ssa duh hahaha)
    // jasmine plz do this
 
    if (b.contains_instruction(i)) {
        live_interval[value].combine_range(numbering_of(i), end);
    } else
        live_interval[value].combine_range(numbering_of(b.begin()), end);
 
    // For example, the ranges [1,3[, [3,7[ are merged into a single range [1,7[
}
DenseMap<Value, Vector<LiveRange>> build_intervals(Region &region) {
    // TODO: i know my ass will need the smali counter again soon. Maybe split the function here.
 
    // TODO: We get the value from the operand like this
 
    for b in blocks:
        live = {}
 
    for b in blocks:
        for s in b.get_successors():
        b.live = b.live.union(s.live)
        for each argument arg of the block s:
            b.live = b.live - arg
        for each terminator t  in b:
            b.live = b.live union t's jump argument input
        for each ssa value v in live do:
            add_range(get_defining_instruction(v), v, b, numbering_of(b.end()) + 1)
        for instruction i in b.reverse():
            b.live = b.live - i.defining_result()
            for each op in i:
            if op not in live:
                b.live = b.live union op
                add_range(get_defining_instruction(op), op, b, numbering_of(i))
 
    // todo: we do this because although we visualize ssa as having an inherent counter, it is not visible in the
mlir api
}

todo: Note that we will not directly output any register. What we'll do is we'll have a 2D hash map that given an instruction and an SSA value, we get out some new virtual registers that is not dependent on the smali counter.

Coalescing arguments

mlir::Value has hasOneUse()

if op = move a <- b && b.hasOneUse():
    union(a, b, uf)
 
 
mlir::Value find(mlir::Value v
void union(mlir::Value a, mlir::Value b, UF uf) {
 
}

Assigning registers

In greedily assigning SSA variables to registers -> needs priority queue to always get the lowest starting pointer interval available.

note: need a register counter class, need composition for customized set.

note: need dedicated class for live range, it needs to optionally store a reg. It needs to have bool ends_before (LiveRange other) and bool no_overlap()

note: set in C++ is binary tree, we can compromise speed of pq for ease of algo impl

4 different collections:

unhandled (set): all intervals that start after cur.beg
handled (set) : all intervals end before cur.beg
active (set): all intervlas where one of the ranges overlap cur.beg
inactive (set): all intervals where cur.beg falls into one of their holes

todo: explain to readers meaning of "ends before" versus "does not overlap"

LinearScan() {
    auto unhandled = get_ranges_from(build_intervals(method))
    DenseSet active, inactive, handled;
 
 
    while (!unhanlded.empty()) {
        auto curr = unhandled.top(); unhandled.pop();
 
        for (auto i : active) {
            // todo
        }
        for (auto i : inactive) {
            // todo
        }
    }
}

todo: no concept of spilled.

todo: if we don't assign mem loc to an SSA, we need to prove that assign mem loc is the same as not assigning mem loc.

Obstacles

Resources

Papers

MP02 - Linear Scan Register Allocation in the Context of SSA Form and Register Constraints - Hanspeter Mössenböck and Michael Pfeiffer.
WF10 Linear Scan Register Allocation on SSA Form - Christian Wimmer and Michael Franz.
Linear Scan Register Allocation - Massimiliano Poletto and Vivek Sarkar.

Textbooks

Engineering a Compiler - Keith D. Cooper and Linda Torczon.

AI tools

Hahahaha you're really funny.

Jasmine Tang

My blog