From Julia to LLVM

From Julia to LLVM

Background

This page shows the implementation of a compiler which recognizes and translates part of the Julia programming language into the LLVM IR syntax (more information about LLVM can be found here).

Implemented parts

List of Julia parts supported:

Data types:
- Basic data types:
  - int
  - doubles
  - 1-dimension array
(Local scope for variables is implemented)

Operators:
- arithmetic:
  - sum (+)
  - difference (-)
  - multiplication (*)
  - division (/)
- compound:
  - sum (+=)
  - difference (-=)
  - multiplication (*=)
  - division (/=)
- comparison:
  - equality(==)
  - greater than (>)
  - less than (<)
  - greater than equal to (>=)
  - less than equal to (<=)
- logic:
  - AND (&&)
  - OR (||)
Control flow instructions:
- if instruction
- if-then-else instruction
- for loop over a range
- while loop
Function details (partial implementation):
- double input parameters
- double return value
Output:
- print/println
- with and without string interpolation (“x value is: $x”)

Compiler

The compiler is made up of two parts: a scanner and a parser.

Scanner

The scanner is able to recognize and retrieve tokens (terminal symbols) to the parser coupled with an object containing a value that the token represents. It identifies integer, doubles and ids (that will be used for variables, function names, …) and other significant Julia keywords like:

else
end
for
function
if
in
print
println
return
while

And other syntax elements like punctuation and other symbols.

Snippet of Julia Scanner

d = [A-Za-z_][A-Za-z0-9_]*
integer =  ([1-9][0-9]*|0)
double = (([0-9]+\.[0-9]*) | ([0-9]*\.[0-9]+)) (e|E('+'|'-')?[0-9]+)?
 
%%
"("     {return symbol(sym.RO);}
")"     {return symbol(sym.RC);}
"="     {return symbol(sym.EQU);}
"+"     {return symbol(sym.PLUS);}
"-"     {return symbol(sym.MINUS);}
"*"     {return symbol(sym.STAR);}
"/"     {return symbol(sym.DIV);}
 
...
 
"&&"     {return symbol(sym.AND);}
"||"     {return symbol(sym.OR);}
"!"     {return symbol(sym.NOT);}
"["     {return symbol(sym.SO);}
"]"     {return symbol(sym.SC);}
 
 
"else"    {return symbol(sym.ELSE);}
"end"     {return symbol(sym.END);}
"for"    {return symbol(sym.FOR);}
"function"    {return symbol(sym.FUNCT);}
"if"      {return symbol(sym.IF);}
"in"      {return symbol(sym.IN);}
"print"   {return symbol(sym.PRINT);}
"println"   {return symbol(sym.PRINTLN);}
"return"  {return symbol(sym.RET);}
"while"   {return symbol(sym.WHILE);}
 
... 
 
{id}      {return symbol(sym.ID, yytext());}
 
{integer} {return symbol(sym.INT, new Integer(yytext()));}
 
{double}  {return symbol(sym.DOUBLE, new Double(yytext()));}

Parser

The parser is able to take as input the tokens that are provided by the scanner and recognize the main grammatical rules of Julia language. Consequently, the correspondend LLVM IR code is produced.

Data structures

This snippet shows all variables and classes used to support the parser on the creation of the output program:

    public HashMap <String, TypeVar> symbolTable;
	public HashMap <String, TypeFun> functionTable;
	public boolean isCorrect = true; 
	public StringBuffer printBuff;
	public ArrayList<String> stringDef;
 
	public int var_label = 0;
	public int str_label = 0; 
 
	public int flow_label = 0;
 
	public int tot_flow_label = 0;
 
	public BufferedWriter bwr;
 
	public String retType;
	public int genVarLabel(){
 
		var_label++;
 
		return var_label; 
	};
 
	public int genStrLabel(){
 
		str_label++;
 
		return str_label; 
	};
 
	public class TypeVar{
 
		public String reg_id;
		public String type; 
		public String value;
		public Integer align;
		public Integer size1;
		public Integer size2;
		public Integer l;
 
		public TypeVar()
		{
			reg_id = Integer.toString(genVarLabel());
			size1 = size2 = -1;
		}
		TypeVar(Integer value, String type, Integer align)
		{
			this();
			this.value = Integer.toString(value);
			this.type = type;
			this.align = align;
		}
 
		TypeVar(Double value, String type, Integer align)
		{
			this();
			this.value = Double.toString(value);
			this.type = type;
			this.align = align;
		}
 
 
	}
 
	public class TypeFun{
 
		ArrayList<String> parT; 
		Integer nPar; 
		String retT; 
 
		public TypeFun(ArrayList<String> parT, String retT)
		{
			this.parT = parT;
			this.nPar = parT.size(); 
			this.retT = retT;
		}
 
	}

Class TypeVar: is a class used to represent every scalar value, variable or array in the source code
- reg_id: represents the register in which the variable is stored
- type: represents the type of the variable
- value: if it’s a scalar, this is the value of the variable
- align: the needed align for the variable
- size1: -1 if a scalar, any (>0) otherwise
- size2: was though for matrices, but it's not used since matrices are not supported
- l: it represents the length of some instructions to allocate the correspondent variable used only for while and if constructs in order to place some labels properly
Class TypeFun: is a class used to represent functions
- parT: list of string representing parameters type
- nPar: number of input parameters
- retT: return type
Hashmap<String, TypeVar> symbolTable: is an hashmap containing the correspondence between a variable ID and its TypeVar
Hashmap<String, TypeFun> functionTable: is an hashmap containing the correspondence between a fuction ID and its TypeFun
Stringbuffer printBuff: is the buffer used to store all the outputs and then write on output.ll file all at once
ArrayList<String> stringDef: it's an array of strings, each of them contains the string definition that’s evaluated when a string is met during the parsing
var_label: counter used for register names in LLVM IR
str_label: counter for string labels
flow_label: counter for control flow labels
tot_flow_label: counter for total control flow labels

Grammar start

The grammar defined for the parser assumes that there are function definitions first and then there are the instructions that will be included in the @main of the LLVM IR code. printBuff is first used to store (and eventually write on output.ll) the recognized instructions. After that the non terminal function_defs is recognized, the parser write the instructions correspondent to the functions defined. Then it clears the printBuffer and var_label counter that will be then used for writing instructions of the @main. BufferedWriter bwr is the buffer used to create an output.ll file and write the LLVM IR code in it.

prog ::= function_defs {:
    if(parser.isCorrect)
    {
        bwr.write("declare i32 @printf(i8*, ...)\n");
        bwr.write(printBuff.toString());
 
    }
    else
        System.out.println("Program contains errors.");
    var_label = 0; 
    printBuff.setLength(0);
 
:}statements {:
    if(parser.isCorrect)
    {
        for(String s : stringDef)
        {
            bwr.write(s+"\n");
        }
 
        bwr.write("define void @main(){");
 
        bwr.write(printBuff.toString());
 
        bwr.write("ret void\n}");
        bwr.flush();
 
        //close the stream
        bwr.close();
    }
    else
        System.out.println("Program contains errors.");
 
:};

Some practical examples

Recognition of scalars and variable IDs

val token is used to represent any scalar entity in the parser: it represents integer, double and variable names. Each time an INT or DOUBLE token is recognized, the parser defines a new TypeVar and allocate the needed memory locations for that scalars and the stores them. While when an ID token is recognized, the parser checks if the ID is already defined in the symbolTable, if it is, then assign to the val token the TypeVar that the ID is coupled to.

val ::= ID:x {:
    if(!parser.symbolTable.containsKey(x))
    {
        pSemError("Variable "+x+" not declared.");  
    }else{
        RESULT = parser.symbolTable.get(x);
 
    }
 
:}
| INT:x {:
    RESULT = new TypeVar(x, "i32", new Integer(4));
    append("%"+RESULT.reg_id+" = alloca "+RESULT.type+", align "+RESULT.align, true);
    append("store "+RESULT.type+" "+RESULT.value+", "+RESULT.type+"* %"+RESULT.reg_id, true);
:}
| DOUBLE:x {:
    RESULT = new TypeVar(x, "double", new Integer(8));
    append("%"+RESULT.reg_id+" = alloca "+RESULT.type+", align "+RESULT.align, true);
    append("store "+RESULT.type+" "+RESULT.value+", "+RESULT.type+"* %"+RESULT.reg_id, true);
:}
;

Array definition

This rules uses elem_list non terminal symbol that represents a list of square brackets enclosed val tokens. elem_list is coupled with an ArrayList<TypeVar> storing the values stated in the brackets. The parser allocates the needed memory for the array elements and then it stores each array element in the location at proper position within the just allocated memory locations.

assignment ::= ID:id EQU SO elem_list:x SC{:
    TypeVar nTypeVar = new TypeVar(); 
    Integer arrReg = Integer.parseInt(nTypeVar.reg_id);
    append("%"+arrReg+" = alloca ["+x.size()+" x "+x.get(0).type+"], align "+x.get(0).align, true);
    for(int i = 0; i<x.size(); i++)
    {
        TypeVar xTy = x.get(i);     
        append("%"+genVarLabel()+" = getelementptr inbounds ["+x.size()+" x "+x.get(i).type+"], ["+x.size()+" x "+x.get(i).type+"]* %"+arrReg+", "+x.get(i).type+" 0, "+x.get(i).type+" "+i,true);
        append("%"+genVarLabel()+" = load "+xTy.type+", "+xTy.type+"* %"+xTy.reg_id+", align "+xTy.align,true); 
        append("store "+xTy.type+" %"+var_label+", "+xTy.type+"* %"+(var_label-1)+", align "+xTy.align,true);
    } 
 
    nTypeVar.type = x.get(0).type;
    nTypeVar.align = x.get(0).align;
    nTypeVar.size1 = x.size();
    addEntry(id, nTypeVar );
 
:}

For example, given the following code:

a = [1,2,3]

Here's the correspondent LLVM IR code:

%1 = alloca i32, align 4
store i32 1, i32* %1
%2 = alloca i32, align 4
store i32 2, i32* %2
%3 = alloca i32, align 4
store i32 3, i32* %3
%4 = alloca [3 x i32], align 4
%5 = getelementptr inbounds [3 x i32], [3 x i32]* %4, i32 0, i32 0
%6 = load i32, i32* %1, align 4
store i32 %6, i32* %5, align 4
%7 = getelementptr inbounds [3 x i32], [3 x i32]* %4, i32 0, i32 1
%8 = load i32, i32* %2, align 4
store i32 %8, i32* %7, align 4
%9 = getelementptr inbounds [3 x i32], [3 x i32]* %4, i32 0, i32 2
%10 = load i32, i32* %3, align 4
store i32 %10, i32* %9, align 4

Function implementation

The following code shows how LLVM IR code is produced for function definition. Remember: function definition implementation only supports doubles as input and return parameters.

When the function definition is recognized, the parser produces the function definition in LLVM IR code. Before parsing the function statements, the input parameters are stored in new registers.

function_def ::= FUNCT ID:f RO ID_list:par{:
 
	var_label = par.size()+1;
	append("define @"+f+"(",false); 
	for(int i = 0; i<par.size(); i++)
	{
		append("double", false); 
		if(i != (par.size()-1))
			append(", ", false); 
		else 
			append(") {", true); 
 
	}
	for(int i = 0; i<par.size(); i++)
	{
		Integer reg; 
		if(i==0)
			reg = var_label;
		else	
			reg = genVarLabel();
		append("%"+reg+" = alloca double, align 8", true);
		append("store double %"+i+", double* %"+reg, true); 
		TypeVar p = new TypeVar();
		var_label--;
		p.reg_id = Integer.toString(reg); 
		p.type = new String("double"); 
		p.align = new Integer(8);  
		addEntry(par.get(i), p);
 
	}
 
 
 
 
:} RC statements END{:
	append("}", true);
	String fName = parser.stack(-6).toString();
	ArrayList<String> pType= new ArrayList<String>();
	for(int i = 0; i<par.size(); i++)
	{
		pType.add("double");
	}
	TypeFun funct = new TypeFun(pType,retType);
	functionTable.put(fName,funct);
 
	pType.set((pType.size()-1), "double*");
	funct = new TypeFun(pType, retType);
 
	Integer index = printBuff.indexOf(fName)-2;
	printBuff.insert(index, " "+retType);
 
 
	retType = "void"; 
	var_label = 0;
	symbolTable.clear(); 
:};

After function statements are recognized, the parser defines the correspondent entry in the functionTable hashmap.

Let's consider the following function definition in Julia:

function division(a,b)
 
...
 
end

The correspondent LLVM IR code will be:

define double @division(double, double) {
%3 = alloca double, align 8
store double %0, double* %3
%4 = alloca double, align 8
store double %1, double* %4
...
}

Print instruction

When a string is recognized by the parser, it calls the CreateString functions that defines the string declaration as a global constant. This function is also able to recognize if there’s interpolation of variables within the string (e.g. “$variable_name”). Then, for simplicity and for exploit this CreateString functionality, when a variable is printed the parser encapsulates that variable in a string and pre-place a $ symbol to it. If a numerical value is printed, the operation of pre-placing a $ symbol is obviously not done. If an array element is printed, CreateString is not called but the parser directly produces the needed LLVM IR code.

print_instr ::= print_keyw:nl RO STRING:x RC{:
    CreateString(nl,x);
:} 
| print_keyw:nl RO val:x RC {:
 
    if(!x.value.matches("\\d+"))
    {
        x.value = "$"+x.value;
    }
    CreateString(nl, x.value);
:} 
| print_keyw:nl RO dest_var:x RC
{:
    String s = new String("\\00");
    if(nl)
        s = "\\0A" + s;
    s = "%d"+s;
    Integer length = s.length()-(nl?4:2);
    append("%"+genVarLabel()+" = load "+x.type+", "+x.type+"* %"+x.reg_id+", align "+x.align,true); 
    parser.stringDef.add("@.str." + genStrLabel() + " = private constant [" + length + " x i8] c\"" + s + "\", align 1");
    append(("%" + genVarLabel() + " = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([" + length + " x i8], [" + length + " x i8]* @.str." + str_label + ", i32 0, i32 0), i32 %"+(var_label-1)+")"), true);
 
 
:}

Error handling

The compiler is able to recognize the following kind of errors:

Variable not declared
Variable is not an array
Function not defined
Generic error in assignment
Missing ] in array definition
Missing ] in matrix definition
General error in while condition
Missing ) in while condition
General error in if condition
General error in print instruction

Missing functionalities, partial implementations

Matrix and touples grammar rules defined but their usage is not supported
Functions work only with only doubles as input and return parameters
String assignment to variable not supported
Doubles are supported for arithmetic operations but not for other code functionalities (e.g. no support for array of double,…)
Print instruction doesn't support printing the result of operations (e.g. print(1+2))
No support for global variables

Download and installation

Compiler: julia_compiler.zip

Examples

Division: division.zip
Bubble sort: bubble.zip
Floyd triangle: triangle.zip

Examples contains a Julia source code and the correspondent LLVM IR code.

How to run it

Install the llvm package: sudo apt install llvm
Download the julia_compiler and extract it
Open the terminal, go to the folder where the compiler is extracted and run the following commands:
- jflex julia_scanner.jflex
- java java_cup.Main -expect 8 julia_parser.cup
- javac *.java
- java Main source.jl
This will produce an output.ll file
Run output.ll file with lli: lli output.ll

Table of Contents