User Tools

Site Tools

Return to Home page

From Go to LLVM


This page shows the implementation of a compiler which recognizes and translates part of the Go programming language into the LLVM IR syntax (more information about LLVM can be found here).

The LLVM Intermediate Representation (IR) is placed between the source code in a given programming language and the machine code for a specific architecture. It is independent from both the programming language and the architecture.

Implemented parts

List of Go parts supported by the compiler:

  • Data types:
    • Basic data types:
      • int
      • float64
    • Aggregate data types:
      • array (1-dimension)
    • Reference data types:
      • pointers
      • functions
  • Variables:
    • global variables
      • declaration only
      • declaration + inline assignment
    • local variables
      • declaration only
      • declaration + inline assignment
      • short variable declaration
  • Constants:
    • global constants
  • Operators:
    • arithmetic operators:
      • addition (+)
      • subtraction (-)
      • multiplication (*)
      • division (/)
    • relational operators:
      • equal to (==)
      • not equal to (!=)
      • greater than (>)
      • less than (<)
      • greater than equal to (>=)
      • less than equal to (<=)
    • bitwise operators:
      • AND (&)
      • OR (|)
      • XOR (^)
    • assignment operators:
      • simple assignment (=)
    • unary operators:
      • post increment (++)
      • post decrement ()
      • reference (&): return address of a variable
      • dereference (*): return the value pointed by a pointer variable
  • Control statements:
    • decision making:
      • if statement
      • if - else if - else stetement
      • nested if statement
    • loops:
      • for loop
      • for loop as infinite loop
      • for loop as while loop
  • Functions:
    • support parameter list
      • simple variables, arrays, pointers
        • by value
        • by reference
    • return type
      • int
      • float64
      • void
  • Output:
    • fmt.Printf instruction
      • support C-like parameters print
  • Instructions:
    • declaration
    • assignment
    • return statement
    • function call
    • expression


The compiler is made up of two parts: a scanner and a parser.


The Scanner, created using jflex, scans an input file and generates a specific token each time that a stream of characters matches with a given rule (that can be a regular expression). In that case, it recognizes all useful symbols of a Go source code, generating tokens which are later used by the Parser.

Snippet of Go Scanner

  1. id = [A-Za-z_][A-Za-z0-9_]*
  2. integer = ([1-9][0-9]*|0)
  3. double = (([0-9]+\.[0-9]*) | ([0-9]*\.[0-9]+)) (e|E('+'|'-')?[0-9]+)?
  4. package = package[ ]+{id}
  5. import = import [ ]+\"{id}\"|import[ ]+\(~\)
  6. comment = "//"~{nl}|"/*"~"*/"
  7. string = \"~\"
  9. %%
  11. "func" {return symbol(sym.FUNC);}
  12. "return" {return symbol(sym.RET);}
  14. "fmt.Printf" {return symbol(sym.PRINT);}
  16. "int" {return symbol(sym.INT_TYPE);}
  17. "float64" {return symbol(sym.FLOAT64_TYPE);}
  19. "var" {return symbol(sym.VAR);}
  20. "const" {return symbol(sym.CONST);}
  22. "for" {return symbol(sym.FOR);}
  23. "if" {return symbol(sym.IF);}
  24. "else" {return symbol(sym.ELSE);}
  26. ...
  28. {id} {return symbol(sym.ID, yytext());}
  29. {string} {return symbol(sym.STRING, yytext());}
  30. {integer} {return symbol(sym.INT, new Integer(yytext()));}
  31. {double} {return symbol(sym.DOUBLE, new Double(yytext()));}
  33. ...
  35. {package}|{import}|{comment} {;}

The first 7 lines define grammar rules. For example, the first line is used to recognize all sequences of characters which starts with a letter and are followed by an undefined number of letters or digits. Then, when the scanned input matches with this rule, the symbol ID is generated (line 28). In this way, parts of the input code like variables, constants, function names, are “transformed” into an ID symbol.

As shown above, other rules are used to recognize integer and double numbers, strings enclosed between quotes, package and import statements, as well as comments (last three mentioned rules ignore the input, so they don't generate any token, as they are useless for the generation of the output program). All remaining rules (some of them are represented in lines 11-24) recognize all important keywords, like var, const, printf statement, func, return, etc.


The parser, created using CUP, takes as input all symbols generated by the scanner, analizes a specific sequence of input and can associate an action each time that a sequence of symbols is reduced.

To show how everything works, let's consider a trivial example related to the declaration of an integer variable:

var a int

This sequence of words is turned by the scanner into the following tokens:


This is, instead, a snippet from the parser:

declaration ::= VAR var_stmt_list type:t {:
var_stmt_list ::= var_stmt_list CM ID:name {:
:} | ID:name {:
type ::= INT_TYPE 
| SO INT:x SC type:t 
| STAR type:t 

In this case, ID token is reduced into var_stmt_list and INT_TYPE into type. Then, symbols VAR var_stmt_list type are reduced into declaration, which indicates that this sequence of symbols is related to a delcaration statement, and the corresponding LLVM IR instruction can be generated.

Data structures

This snippet shows all variables and classes used to support the parser on the creation of the output program:

    public StringBuffer outputBuffer;
    public StringBuffer errorBuffer;
    public int semErrors;
    public int synWarnings;
    public ArrayList<String> stringList;
    public ArrayList<ValueObj> globalVarList;
    public HashMap<String, TypeItem> funcList;
    public HashMap<String, ValueObj> globalVarTable;
    public int indexString;
    public int globalVarCount;
    public int loopCount;
    public int totLoopCount;
    public class SymbolTable{
        public ArrayList<ValueObj> varList;
        public HashMap<String, ValueObj> varTable;
        public int nargs;
        SymbolTable prev;
        public SymbolTable(SymbolTable p){
            this.varTable = new HashMap<String, ValueObj>();
            this.varList = new ArrayList<ValueObj>();
            this.nargs = 0;
            this.prev = p;
        public ValueObj get(String s){ //search a variable traversing all "ancestors" to find 
            for (SymbolTable sym = this; sym != null; sym = sym.prev){
                ValueObj found = sym.varTable.get(s);
                if (found != null)
                    return found;
            return null;
    public class FuncObj {
        public String name;
        public int nargsTot;
        public ArrayList<ValueObj> paramsList;
        public int varCount;
        public boolean ret;
        public SymbolTable currentSymbolTable;
        public FuncObj(String name){
   = name;
            this.paramsList = new ArrayList<>();
            this.varCount = 1;
            this.nargsTot = 0;
            this.ret = false;
            this.currentSymbolTable = null;
    public class ValueObj {
        public String value;
        public String index;
        public TypeItem typeItem;
        public ValueObj(String value, TypeItem typeItem){
            this.value = value;
            this.typeItem = typeItem;
         public ValueObj(String value, TypeItem typeItem, String index){
            this.value = value;
            this.typeItem = typeItem;
            this.index = index;
        public ValueObj(String value){
            this.value = value;
        public ValueObj(String value, String index){
            this.value = value;
            this.index = index;
        public void setTypeItem(TypeItem typeItem){
            this.typeItem = typeItem;
     public class TypeItem {
        public Tag tag;
        public String type;
        public int align;
        public int size;
        public boolean isPointer;
        public TypeItem(Tag tag, String type){
            this.tag = tag;
            this.type = type;
            this.size = -1;
            this.isPointer = false;
        public TypeItem(Tag tag, String type, int align){
            this.tag = tag;
            this.type = type;
            this.align = align;
            this.isPointer = false;
        public void setSize(int size){
            this.size = size;
        public void setPointer(boolean isPointer){
            this.isPointer = isPointer;
        public void setTag(Tag tag){
            this.tag = tag;
    public FuncObj currentFunc;
  • outputBuffer: stores generated LLVM IR instructions
  • errorBuffer: stores error messages which can be found by the compiler
  • semErrors: count number of semantic errors
  • synWarnings: count number of syntactic warnings
  • stringList: stores strings used by printf instructions
  • globalVarList: temporary stores variables of a single (global) declaration statement
  • funcList: table which stores the list of defined functions
  • globalVarTable: table which stores the list of global variables
  • indexString: index used to generate unique labels to store string constants
  • globalVarCount: counter used to access to right element of globalVarList during var assignments
  • loopCount: counter used to correctly deal with condition and loop statements, expecially when they are nested
  • totLoopCount: count total number of condition and loop statements
  • SymbolTable: class which defines a symbol table
    • nargs: count number of parameters of a given type (used to access to the right element on a inline assignment)
    • varTable: table which stores the list of variables
    • varList: temporary stores variables of a single declaration statement
    • prev: stores the parent symbol table
  • FuncObj: class which stores all useful properties of a function
    • name: is the function name
    • nargsTot: count total number of parameters
    • paramsList: contains the list of parameters of the function
    • varCount: counter used to make a unique register for each operation
    • ret: indicates if the function contains a return instruction or not
    • currentSymbolTable: stores the symbol table related to che current scope
  • ValueObj: class which represents an entity (like a variable, an immediate value, the results of an expression, etc.)
    • value: is the entity value
    • index: is the index of the element to access (used with arrays)
    • typeItem: class which contains details about value type (also used for type checking)
      • tag: stores the kind of element (base variable, array, pointer, etc.)
      • type: indicates the data type (i32, double)
      • align: is the alignment, in bytes, of the memory operation (4 for i32, 8 for double)
      • size: indicates the total size of the entity, like its number of elements (used with arrays)
      • isPointer: flag which indicates if the value is treated as a pointer

In the following paragraphs, some pieces of the parser code are shown to illustrate few examples of generation of the output program.

Grammar start

The root element of the grammar is the last symbol that is reduced. A reduction of the start symbol indicates that the code has been parsed correctly.

start with prog;
prog ::= global_decl_list func_list {: 
    //check if main function is declared
    if (funcList.get("main") == null)
        genSemWarning("'main' function not found");
    //print code if there are no syntactic warnings e/o semantic errors
    if (semErrors == 0 && synWarnings == 0){
        System.err.println("\nOutput cannot be generated due to following errors:\n");
    //print error buffer

As shown in the prog rule, this parser is able to recognize first a list of global delcarations, and then a sequence of functions. When the rule is reduced, the program checks if the main function is declared. After that, it detects if there are semantic errors or syntactic warnings: if not, the output program is printed, otherwise an error message is generated. In any case, the list of error is printed.

Global delcaration

The following code shows a part of the implementation of global variables/constants declaration:

global_decl ::= global_decl_keyword:k global_var_list type:t {:
    for (int i = 0; i < globalVarList.size(); i++){
        ValueObj p = globalVarList.get(i);
        ValueObj val = new ValueObj("@" + p.value, t);
        globalVarTable.put(p.value, val);
        if (t.tag == Tag.BASE)
            append(("@" + p.value + " = " + k + " " + t.type + " 0, align " + val.typeItem.align), true);
        else if (t.tag == Tag.ARRAY)
            append(("@" + p.value + " = " + k + " [" + t.size + " x " + t.type + "] zeroinitializer, align " + val.typeItem.align), true);
        else if (t.tag == Tag.POINTER)
             append(("@" + p.value + " = " + k + " " + t.type + "* null, align " + val.typeItem.align), true);   
global_decl_keyword ::= VAR {: RESULT = "global"; :} | CONST {: RESULT = "constant"; :};
global_var_list ::= global_var_list CM ID:name {:
    globalVarList.add(new ValueObj(name));
| ID:name {:
    globalVarList.add(new ValueObj(name));
type ::= INT_TYPE {: RESULT = new TypeItem(Tag.BASE, "i32", 4); :}
| FLOAT64_TYPE {: RESULT = new TypeItem(Tag.BASE, "double", 8); :}
| SO INT:x SC type:t {: t.setTag(Tag.ARRAY); t.setSize(x); RESULT = t; :}
| STAR type:t {: t.setTag(Tag.POINTER); RESULT = t; :};

This piece of code is able to recognize something like this:

var v1, v2, v3 int

In this case, var keyword is reduced into global_decl_keyword (passing global as result); variables v1, v2 and v3 are stored into globalVarList array, and then they are reuduced into global_var_list; the type int is reduced into type, returning a TypeItem describing the specified data type (BASE as tag, “i32” as type and align equals to 4).

After that, global_decl_keyword:k global_var_list type:t are reduced into global_decl, which scans the globalVarList containing the declared variables, creates a ValueObj for each of them, assigning them the type returned by type reduction, and stores them into the globalVarTable hashMap.

Finally, it generates the corresponding assembly instruction for each declared value, obtaining the following output in LLVM IR syntax:

@v1 = global i32 0, align 4
@v2 = global i32 0, align 4
@v3 = global i32 0, align 4

Function implementation

The following code shows the strategy adopted to recognize a function definition:

func ::= FUNC ID:name {:
    currentFunc = new FuncObj(name);
:} RO param_list RC ret_type:ret BO {: 
    parser.funcList.put(name, ret);
    append(("define " + ret.type + " @" + name + "("), false);
    FuncObj func = currentFunc; 
    if (func.paramsList.size() > 0){
        for (int i = 0; i < func.paramsList.size(); i++){
            TypeItem t = func.paramsList.get(i).typeItem;
            if (t.tag == Tag.ARRAY)
                t.tag = Tag.POINTER_ARRAY;
            if (i == func.paramsList.size()-1){
                if (t.tag == Tag.POINTER_ARRAY || t.tag == Tag.POINTER)
                    append((t.type + "*"), false);
                    append((t.type), false);
                if (t.tag == Tag.POINTER_ARRAY || t.tag == Tag.POINTER)
                    append((t.type + "*, "), false);
                    append((t.type + ", "), false);
        append((") {"), true); 
        func.varCount = func.nargsTot + 1;
        for (int i = 0; i < func.paramsList.size(); i++){
            ValueObj p = func.paramsList.get(i);
            ValueObj code = new ValueObj("%" + func.varCount, p.typeItem);
            func.varTable.put(p.value, code);
            if (p.typeItem.tag == Tag.POINTER_ARRAY || p.typeItem.tag == Tag.POINTER){
                append(("%" + func.varCount + " = alloca " + p.typeItem.type + "*, align " + code.typeItem.align), true);
                append(("store " + p.typeItem.type + "* %" + i + ", " + p.typeItem.type + "** " + "%" + func.varCount + ", align " + code.typeItem.align), true);
                append(("%" + func.varCount + " = alloca " + p.typeItem.type + ", align " + code.typeItem.align), true);
                append(("store " + p.typeItem.type + " %" + i + ", " + p.typeItem.type + "* " + "%" + func.varCount + ", align " + code.typeItem.align), true);
        append((") {"), true); 
:} stmt_list BC {:
    if (!currentFunc.ret && ret.type.equals("void"))
        append(("ret void"), true); //add void return if function is void and there is not any return instruction
    else if (!currentFunc.ret && !ret.type.equals("void"))
        genSemError("Missing return statement in function '" + + "'");
    append(("}"), true);
param_list ::= param_list CM param | param;
param ::= var_list:x type:t {:
     FuncObj func = currentFunc;
     int size = func.paramsList.size();
     int tot = func.nargsTot;
    for (int i = func.nargsTot; i<func.currentSymbolTable.nargs + tot; i++){
    func.currentSymbolTable.nargs = 0;
:} | ;

Let's consider the following function declaration in Go:

func test(a, b int, c *int) int {

When the function name test is recognized by the parser, a FuncObj object is created. Then, parameters a, b are stored in func.paramsList array and the TypeItem related to the int type is associated to them. In the same way, the parameter c is stored with the type related to the int pointer (TypeItem object with POINTER as Tag, “i32” as value, 4 as align). After that, all parameters have been reduced into the token param_list.

When the func rule have recognized FUNC keyword, function name, param_list, ret_type (obtained by reducing the int return value) and the open brace ({), it starts generating the output code. First of all, it saves the return value of the function in the funcList table, using function name as key. Then, it iterates over the function parameters, generating the params list in LLVM IR syntax. At this point, it has generated something like this:

define i32 @division(i32, i32, i32*) {

The aim of the second loop is to iterate again over parameters to generate the corresponding variables to access to parameters value. For each parameter, it allocates a new variable with the same type, and stores the correspondance parameter name - variable id in the varTable of the currentFunc object.

After that it recognizes the function body (reduced into stmt_list token), it checks the return value before closing the function scope. In particular, if the function body doesn't contain any return statement and the function return type is void, it manually inserts a ret void instruction. It's worth to notice that, if the function doesn't contain a return statement and the expected return type is not void, it generates a “Missing return statement” error.

The final result is the following:

define i32 @division(i32, i32, i32*) {
%4 = alloca i32, align 4
store i32 %0, i32* %4, align 4
%5 = alloca i32, align 4
store i32 %1, i32* %5, align 4
%6 = alloca i32*, align 4
store i32* %2, i32** %6, align 4
//function body

A string, in order to be printed, should be declared as a global constant array of characters, and then passed as parameter of the printf function. This code shows the rule to recognize a print instruction with only text:

print ::= PRINT RO STRING:value RC {: 
    int label = genLabelString();
    String s = value;
    s = s.substring(1, s.length() - 1);
    int length = s.length();
    if (!s.contains("\\n"))
    s = s.replace("\\n", "\\0A");
    s = s + "\\00";
    append(("%" + currentFunc.varCount + " = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([" + length + " x i8], [" + length + " x i8]* @.str." + label + ", i32 0, i32 0))"), true);

It is able to detect an instruction like the following one:

fmt.Printf("Hello, world\n")

When the instruction is recognized, a new label (used to create a global unique name) is generated. Then, the string is manipulated in order to be supported by the assembly code: Each \n is replaced by \0A, and the string should terminate with the characters \00. Then, the instruction to call the @printf assembly function is generated, passing the string label (which starts with @.str.[label]) as parameter. Finally, the string value is stored into the stringList array, which is printed at the end of the program, when the start symbol is recognized, calling the function printStrings defined in the action code of the parser. This function, after declaring the @printf defined by LLVM, creates a private constant for each array's entry.

The output is the following:

declare i32 @printf(i8*, ...)
@.str.0 = private constant [14 x i8] c"Hello, world\0A\00", align 1
%1 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([14 x i8], [14 x i8]* @.str.0, i32 0, i32 0))

Short variable declaration

Go provides a short statement to declare and assign variables, using the keyword :=. Let's consider the following short variable declaration:

a, b := 1, 2

This statement declares two variables, a and b, and assigns them values 1 and 2, so that the data type is deduced by the assigned value.

This code shows the rules to detect a short variable declaration statement:

short_declaration ::= var_stmt_list COL EQ ass_list {: currentFunc.currentSymbolTable.nargs = 0; currentFunc.currentSymbolTable.varList.clear(); :}
| var_stmt_list COL EQ error {: genSynWarning("Error in short assignment"); :};
ass_list ::= ass_list CM ass_exp | ass_exp;
ass_exp ::= exp:x {:
    ValueObj p = currentFunc.currentSymbolTable.varList.get(currentFunc.currentSymbolTable.nargs);
    ValueObj code;
    if (p.typeItem != null){
        code = new ValueObj("%" + currentFunc.varCount, p.typeItem);
        CheckTypeAssignment(p.typeItem, x.typeItem);
        code = new ValueObj("%" + currentFunc.varCount, new TypeItem(Tag.BASE, x.typeItem.type, x.typeItem.align));
    currentFunc.currentSymbolTable.varTable.put(p.value, code);
    append((code.value + " = alloca " + x.typeItem.type + ", align " + code.typeItem.align), true);
    append(("store " + x.typeItem.type + " " + x.value + ", " + code.typeItem.type + "* " + code.value + ", align " + code.typeItem.align), true);

First of all, it detects and stores a list of variables in currentFunc.varList array. Then, after having detected symbols :=, it reads the assigned values. For each value (obtained by the exp rule), it creates a new variable, assigning it the data type of the value itself, and associates this new variable to the name of the declared variable, storing this correspondence in currentFunc.varTable. Finally, it generates the assembly instructions to allocate the variable and to store its value.

Regarding the short variable declaration statement defined at the beginning of this paragraph, the corresponding output is like this:

%1 = alloca i32, align 4 ;allocate a variable
store i32 1, i32* %1, align 4 ;store the value
%2 = alloca i32, align 4
store i32 2, i32* %2, align 4

For instruction

Implementing loops and condition instructions in LLVM IR requires jumping among the code. Let's think about the following for instruction:

for i := 0; i<10; i++ {
  //do something

The for instruction is divided into three parts:

  • init statement (i := 0), executed at the beginning of the cycle
  • condition statement (i<10), evaluated at every loop cycle
  • post statement (i++), computed at the end of a cycle, before looping again

This kind of for is implemented by the compiler in the following way:

for_loop ::= FOR init_statement {:
    loopCount = ++totLoopCount;
    append(("br label %for.cond." + loopCount), true);
:} S {:
    append(("for.cond." + loopCount + ":"), true);
:} condition:x {:
    append(("br i1 " + x.value + ", label %for.body." + loopCount + ", label %for.exit." + loopCount), true);
:} S {:
    append(("" + loopCount + ":"), true);
:} assignment {:
    append(("br label %for.cond." + loopCount), true);
    append(("for.body." + loopCount + ":"), true);
:} stmt {:
    append(("br label" + loopCount), true);
    append(("for.exit." + loopCount + ":"), true);
init_statement ::= short_declaration | assignment;
condition ::= exp:x {: RESULT = x; :};

Following these rules, the compiler performs the following operations:

  • recognizes FOR keyword and the init_statement (which can be an assignment or a short variable declaration)
  • generates a label for.cond.[loopCount]: and jumps to it (br label &for.cond.[loopCount])
  • evaluates the condition and jumps to label for.body.[loopCount] if true, otherwise go to label for.exit.[loopCount]
  • generates the label[loopCount] (used to enter to the post statement)
  • reduces the post statement (assignment) and go to for.cond.[loopCount], in order to re-evalute the condition
  • generates a label for.body.[loopCount], at the beginning of loop body
  • when loop body end, jumps to label[loopCount]
  • generates the label for.exit.[loopCount] to allow leaving the loop if the condition is false

The loopCount variable is decremented to correctly deal with nested loops, so that the compiler generates the beginning of the outer loop, then increases loopCount to generate the entire inner loop, and finally decrements it to generate the final part of the outer loop.

The output generated to implement the for defined above is the following:

%1 = alloca i32, align 4
store i32 0, i32* %1, align 4
br label %for.cond.1
%2 = load i32, i32* %1, align 4
%3 = icmp slt i32 %2, 10
br i1 %3, label %for.body.1, label %for.exit.1
%4 = load i32, i32* %1, align 4
%5 = add nsw i32 %4, 1
store i32 %5, i32* %1, align 4
br label %for.cond.1
//function body
br label

Error handling

The compiler is able to recognize the following kind of errors:

  • Variable not declared
  • Assignment of uncompatible type
    • between i32/double to double/i32 variable
    • between i32/double to a pointer variable
  • Operation between different types
  • main function not found
  • Missing return statement in function
  • Array index exceeds array size
  • Function not declared
  • Wrong return type in function

The program also detects some kinds of syntactic errors/warnings using the error special token of CUP parser.

All detected errors are stored in the errorBuffer and printed when the start symbol is reduced.

A simple example

Let's consider the “Hello, world” written in Go language:

package main
import "fmt"
func main() {
    fmt.Printf("Hello, world\n")

Giving this code as input of the parser, the following LLVM IR code is generated:

declare i32 @printf(i8*, ...)
@.str.0 = private constant [14 x i8] c"Hello, world\0A\00", align 1
define void @main() {
%1 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([14 x i8], [14 x i8]* @.str.0, i32 0, i32 0))
ret void

This output, in order to be executed, should be stored in a [filename].ll file, and launched with the command lli [filename].ll (requires llvm package).

Execution of program will produce the following output:

Hello, world

Download and installation

Compiler: go_compiler


Each example contains the go source code and the corresponding output in LLVM IR, generated by the compiler. For sure, each source code can be given as input of the compiler, which will produce the same output contained in the example.

How to run it

  1. Install, if you have not already done so, the llvm package: sudo apt install llvm
  2. Download the go compiler and extract it
  3. Open the terminal, go to the folder where the compiler is extracted and run: make
  4. Take a .go file (from the examples or write a code supported by the compiler)
  5. Run the compiler, passing the .go file as input and redirecting the output in a .ll file
    1. java Main [filename].go > [filename].ll
  6. Run the output file with the command lli: lli [filename].ll

If you found any error, or if you want to partecipate to the editing of this wiki, please contact: admin [at]

You can reuse, distribute or modify the content of this page, but you must cite in any document (or webpage) this url:
/web/htdocs/ · Last modified: 2024/04/08 22:34 by

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki
Privacy Policy