Table of Contents
From Julia to LLVM
Background
This page shows the implementation of a compiler which recognizes and translates part of the Julia programming language into the LLVM IR syntax (more information about LLVM can be found here).
Implemented parts
List of Julia parts supported:
- Data types:
- Basic data types:
- int
- doubles
- 1-dimension array
- (Local scope for variables is implemented)
- Operators:
- arithmetic:
- sum (+)
- difference (-)
- multiplication (*)
- division (/)
- compound:
- sum (+=)
- difference (-=)
- multiplication (*=)
- division (/=)
- comparison:
- equality(==)
- greater than (>)
- less than (<)
- greater than equal to (>=)
- less than equal to (<=)
- logic:
- AND (&&)
- OR (||)
- Control flow instructions:
- if instruction
- if-then-else instruction
- for loop over a range
- while loop
- Function details (partial implementation):
- double input parameters
- double return value
- Output:
- print/println
- with and without string interpolation (“x value is: $x”)
Compiler
The compiler is made up of two parts: a scanner and a parser.
Scanner
The scanner is able to recognize and retrieve tokens (terminal symbols) to the parser coupled with an object containing a value that the token represents. It identifies integer, doubles and ids (that will be used for variables, function names, …) and other significant Julia keywords like:
- else
- end
- for
- function
- if
- in
- print
- println
- return
- while
And other syntax elements like punctuation and other symbols.
Snippet of Julia Scanner
d = [A-Za-z_][A-Za-z0-9_]* integer = ([1-9][0-9]*|0) double = (([0-9]+\.[0-9]*) | ([0-9]*\.[0-9]+)) (e|E('+'|'-')?[0-9]+)? %% "(" {return symbol(sym.RO);} ")" {return symbol(sym.RC);} "=" {return symbol(sym.EQU);} "+" {return symbol(sym.PLUS);} "-" {return symbol(sym.MINUS);} "*" {return symbol(sym.STAR);} "/" {return symbol(sym.DIV);} ... "&&" {return symbol(sym.AND);} "||" {return symbol(sym.OR);} "!" {return symbol(sym.NOT);} "[" {return symbol(sym.SO);} "]" {return symbol(sym.SC);} "else" {return symbol(sym.ELSE);} "end" {return symbol(sym.END);} "for" {return symbol(sym.FOR);} "function" {return symbol(sym.FUNCT);} "if" {return symbol(sym.IF);} "in" {return symbol(sym.IN);} "print" {return symbol(sym.PRINT);} "println" {return symbol(sym.PRINTLN);} "return" {return symbol(sym.RET);} "while" {return symbol(sym.WHILE);} ... {id} {return symbol(sym.ID, yytext());} {integer} {return symbol(sym.INT, new Integer(yytext()));} {double} {return symbol(sym.DOUBLE, new Double(yytext()));}
Parser
The parser is able to take as input the tokens that are provided by the scanner and recognize the main grammatical rules of Julia language. Consequently, the correspondend LLVM IR code is produced.
Data structures
This snippet shows all variables and classes used to support the parser on the creation of the output program:
public HashMap <String, TypeVar> symbolTable; public HashMap <String, TypeFun> functionTable; public boolean isCorrect = true; public StringBuffer printBuff; public ArrayList<String> stringDef; public int var_label = 0; public int str_label = 0; public int flow_label = 0; public int tot_flow_label = 0; public BufferedWriter bwr; public String retType; public int genVarLabel(){ var_label++; return var_label; }; public int genStrLabel(){ str_label++; return str_label; }; public class TypeVar{ public String reg_id; public String type; public String value; public Integer align; public Integer size1; public Integer size2; public Integer l; public TypeVar() { reg_id = Integer.toString(genVarLabel()); size1 = size2 = -1; } TypeVar(Integer value, String type, Integer align) { this(); this.value = Integer.toString(value); this.type = type; this.align = align; } TypeVar(Double value, String type, Integer align) { this(); this.value = Double.toString(value); this.type = type; this.align = align; } } public class TypeFun{ ArrayList<String> parT; Integer nPar; String retT; public TypeFun(ArrayList<String> parT, String retT) { this.parT = parT; this.nPar = parT.size(); this.retT = retT; } }
Class TypeVar
: is a class used to represent every scalar value, variable or array in the source codereg_id
: represents the register in which the variable is storedtype
: represents the type of the variablevalue
: if it’s a scalar, this is the value of the variablealign
: the needed align for the variablesize1
: -1 if a scalar, any (>0) otherwisesize2
: was though for matrices, but it's not used since matrices are not supportedl
: it represents the length of some instructions to allocate the correspondent variable used only for while and if constructs in order to place some labels properly
Class TypeFun
: is a class used to represent functionsparT
: list of string representing parameters typenPar
: number of input parametersretT
: return type
Hashmap<String, TypeVar> symbolTable
: is an hashmap containing the correspondence between a variable ID and its TypeVarHashmap<String, TypeFun> functionTable
: is an hashmap containing the correspondence between a fuction ID and its TypeFunStringbuffer printBuff
: is the buffer used to store all the outputs and then write on output.ll file all at onceArrayList<String> stringDef
: it's an array of strings, each of them contains the string definition that’s evaluated when a string is met during the parsingvar_label
: counter used for register names in LLVM IRstr_label
: counter for string labelsflow_label
: counter for control flow labelstot_flow_label
: counter for total control flow labels
Grammar start
The grammar defined for the parser assumes that there are function definitions first and then there are the instructions that will be included in the @main
of the LLVM IR code.
printBuff
is first used to store (and eventually write on output.ll
) the recognized instructions. After that the non terminal function_defs
is recognized, the parser write the instructions correspondent to the functions defined. Then it clears the printBuffer
and var_label
counter that will be then used for writing instructions of the @main
.
BufferedWriter bwr
is the buffer used to create an output.ll
file and write the LLVM IR code in it.
prog ::= function_defs {: if(parser.isCorrect) { bwr.write("declare i32 @printf(i8*, ...)\n"); bwr.write(printBuff.toString()); } else System.out.println("Program contains errors."); var_label = 0; printBuff.setLength(0); :}statements {: if(parser.isCorrect) { for(String s : stringDef) { bwr.write(s+"\n"); } bwr.write("define void @main(){"); bwr.write(printBuff.toString()); bwr.write("ret void\n}"); bwr.flush(); //close the stream bwr.close(); } else System.out.println("Program contains errors."); :};
Some practical examples
Recognition of scalars and variable IDs
val token
is used to represent any scalar entity in the parser: it represents integer, double and variable names.
Each time an INT
or DOUBLE
token is recognized, the parser defines a new TypeVar
and allocate the needed memory locations for that scalars and the stores them.
While when an ID
token is recognized, the parser checks if the ID
is already defined in the symbolTable
, if it is, then assign to the val
token the TypeVar
that the ID
is coupled to.
val ::= ID:x {: if(!parser.symbolTable.containsKey(x)) { pSemError("Variable "+x+" not declared."); }else{ RESULT = parser.symbolTable.get(x); } :} | INT:x {: RESULT = new TypeVar(x, "i32", new Integer(4)); append("%"+RESULT.reg_id+" = alloca "+RESULT.type+", align "+RESULT.align, true); append("store "+RESULT.type+" "+RESULT.value+", "+RESULT.type+"* %"+RESULT.reg_id, true); :} | DOUBLE:x {: RESULT = new TypeVar(x, "double", new Integer(8)); append("%"+RESULT.reg_id+" = alloca "+RESULT.type+", align "+RESULT.align, true); append("store "+RESULT.type+" "+RESULT.value+", "+RESULT.type+"* %"+RESULT.reg_id, true); :} ;
Array definition
This rules uses elem_list
non terminal symbol that represents a list of square brackets enclosed val
tokens.
elem_list
is coupled with an ArrayList<TypeVar>
storing the values stated in the brackets. The parser allocates the needed memory for the array elements and then it stores each array element in the location at proper position within the just allocated memory locations.
assignment ::= ID:id EQU SO elem_list:x SC{: TypeVar nTypeVar = new TypeVar(); Integer arrReg = Integer.parseInt(nTypeVar.reg_id); append("%"+arrReg+" = alloca ["+x.size()+" x "+x.get(0).type+"], align "+x.get(0).align, true); for(int i = 0; i<x.size(); i++) { TypeVar xTy = x.get(i); append("%"+genVarLabel()+" = getelementptr inbounds ["+x.size()+" x "+x.get(i).type+"], ["+x.size()+" x "+x.get(i).type+"]* %"+arrReg+", "+x.get(i).type+" 0, "+x.get(i).type+" "+i,true); append("%"+genVarLabel()+" = load "+xTy.type+", "+xTy.type+"* %"+xTy.reg_id+", align "+xTy.align,true); append("store "+xTy.type+" %"+var_label+", "+xTy.type+"* %"+(var_label-1)+", align "+xTy.align,true); } nTypeVar.type = x.get(0).type; nTypeVar.align = x.get(0).align; nTypeVar.size1 = x.size(); addEntry(id, nTypeVar ); :}
For example, given the following code:
a = [1,2,3]
Here's the correspondent LLVM IR code:
%1 = alloca i32, align 4 store i32 1, i32* %1 %2 = alloca i32, align 4 store i32 2, i32* %2 %3 = alloca i32, align 4 store i32 3, i32* %3 %4 = alloca [3 x i32], align 4 %5 = getelementptr inbounds [3 x i32], [3 x i32]* %4, i32 0, i32 0 %6 = load i32, i32* %1, align 4 store i32 %6, i32* %5, align 4 %7 = getelementptr inbounds [3 x i32], [3 x i32]* %4, i32 0, i32 1 %8 = load i32, i32* %2, align 4 store i32 %8, i32* %7, align 4 %9 = getelementptr inbounds [3 x i32], [3 x i32]* %4, i32 0, i32 2 %10 = load i32, i32* %3, align 4 store i32 %10, i32* %9, align 4
Function implementation
The following code shows how LLVM IR code is produced for function definition. Remember: function definition implementation only supports doubles as input and return parameters.
When the function definition is recognized, the parser produces the function definition in LLVM IR code. Before parsing the function statements, the input parameters are stored in new registers.
function_def ::= FUNCT ID:f RO ID_list:par{: var_label = par.size()+1; append("define @"+f+"(",false); for(int i = 0; i<par.size(); i++) { append("double", false); if(i != (par.size()-1)) append(", ", false); else append(") {", true); } for(int i = 0; i<par.size(); i++) { Integer reg; if(i==0) reg = var_label; else reg = genVarLabel(); append("%"+reg+" = alloca double, align 8", true); append("store double %"+i+", double* %"+reg, true); TypeVar p = new TypeVar(); var_label--; p.reg_id = Integer.toString(reg); p.type = new String("double"); p.align = new Integer(8); addEntry(par.get(i), p); } :} RC statements END{: append("}", true); String fName = parser.stack(-6).toString(); ArrayList<String> pType= new ArrayList<String>(); for(int i = 0; i<par.size(); i++) { pType.add("double"); } TypeFun funct = new TypeFun(pType,retType); functionTable.put(fName,funct); pType.set((pType.size()-1), "double*"); funct = new TypeFun(pType, retType); Integer index = printBuff.indexOf(fName)-2; printBuff.insert(index, " "+retType); retType = "void"; var_label = 0; symbolTable.clear(); :};
After function statements are recognized, the parser defines the correspondent entry in the functionTable
hashmap.
Let's consider the following function definition in Julia:
function division(a,b) ... end
The correspondent LLVM IR code will be:
define double @division(double, double) { %3 = alloca double, align 8 store double %0, double* %3 %4 = alloca double, align 8 store double %1, double* %4 ... }
Print instruction
When a string is recognized by the parser, it calls the CreateString
functions that defines the string declaration as a global constant. This function is also able to recognize if there’s interpolation of variables within the string (e.g. “$variable_name”).
Then, for simplicity and for exploit this CreateString functionality, when a variable is printed the parser encapsulates that variable in a string and pre-place a $ symbol to it.
If a numerical value is printed, the operation of pre-placing a $ symbol is obviously not done. If an array element is printed, CreateString
is not called but the parser directly produces the needed LLVM IR code.
print_instr ::= print_keyw:nl RO STRING:x RC{: CreateString(nl,x); :} | print_keyw:nl RO val:x RC {: if(!x.value.matches("\\d+")) { x.value = "$"+x.value; } CreateString(nl, x.value); :} | print_keyw:nl RO dest_var:x RC {: String s = new String("\\00"); if(nl) s = "\\0A" + s; s = "%d"+s; Integer length = s.length()-(nl?4:2); append("%"+genVarLabel()+" = load "+x.type+", "+x.type+"* %"+x.reg_id+", align "+x.align,true); parser.stringDef.add("@.str." + genStrLabel() + " = private constant [" + length + " x i8] c\"" + s + "\", align 1"); append(("%" + genVarLabel() + " = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([" + length + " x i8], [" + length + " x i8]* @.str." + str_label + ", i32 0, i32 0), i32 %"+(var_label-1)+")"), true); :}
Error handling
The compiler is able to recognize the following kind of errors:
- Variable not declared
- Variable is not an array
- Function not defined
- Generic error in assignment
- Missing ] in array definition
- Missing ] in matrix definition
- General error in while condition
- Missing ) in while condition
- General error in if condition
- General error in print instruction
Missing functionalities, partial implementations
- Matrix and touples grammar rules defined but their usage is not supported
- Functions work only with only doubles as input and return parameters
- String assignment to variable not supported
- Doubles are supported for arithmetic operations but not for other code functionalities (e.g. no support for array of double,…)
- Print instruction doesn't support printing the result of operations (e.g. print(1+2))
- No support for global variables
Download and installation
Compiler: julia_compiler.zip
Examples
- Division: division.zip
- Bubble sort: bubble.zip
- Floyd triangle: triangle.zip
Examples contains a Julia source code and the correspondent LLVM IR code.
How to run it
- Install the llvm package:
sudo apt install llvm
- Download the julia_compiler and extract it
- Open the terminal, go to the folder where the compiler is extracted and run the following commands:
jflex julia_scanner.jflex
java java_cup.Main -expect 8 julia_parser.cup
javac *.java
java Main source.jl
- This will produce an
output.ll
file - Run output.ll file with lli:
lli output.ll
If you found any error, or if you want to partecipate to the editing of this wiki, please contact: admin [at] skenz.it
You can reuse, distribute or modify the content of this page, but you must cite in any document (or webpage) this url: https://www.skenz.it/compilers/julia_to_llvm