This page shows the implementation of a compiler which recognizes and translates part of the Julia programming language into the LLVM IR syntax (more information about LLVM can be found here).
List of Julia parts supported:
The compiler is made up of two parts: a scanner and a parser.
The scanner is able to recognize and retrieve tokens (terminal symbols) to the parser coupled with an object containing a value that the token represents. It identifies integer, doubles and ids (that will be used for variables, function names, …) and other significant Julia keywords like:
And other syntax elements like punctuation and other symbols.
d = [A-Za-z_][A-Za-z0-9_]* integer = ([1-9][0-9]*|0) double = (([0-9]+\.[0-9]*) | ([0-9]*\.[0-9]+)) (e|E('+'|'-')?[0-9]+)? %% "(" {return symbol(sym.RO);} ")" {return symbol(sym.RC);} "=" {return symbol(sym.EQU);} "+" {return symbol(sym.PLUS);} "-" {return symbol(sym.MINUS);} "*" {return symbol(sym.STAR);} "/" {return symbol(sym.DIV);} ... "&&" {return symbol(sym.AND);} "||" {return symbol(sym.OR);} "!" {return symbol(sym.NOT);} "[" {return symbol(sym.SO);} "]" {return symbol(sym.SC);} "else" {return symbol(sym.ELSE);} "end" {return symbol(sym.END);} "for" {return symbol(sym.FOR);} "function" {return symbol(sym.FUNCT);} "if" {return symbol(sym.IF);} "in" {return symbol(sym.IN);} "print" {return symbol(sym.PRINT);} "println" {return symbol(sym.PRINTLN);} "return" {return symbol(sym.RET);} "while" {return symbol(sym.WHILE);} ... {id} {return symbol(sym.ID, yytext());} {integer} {return symbol(sym.INT, new Integer(yytext()));} {double} {return symbol(sym.DOUBLE, new Double(yytext()));}
The parser is able to take as input the tokens that are provided by the scanner and recognize the main grammatical rules of Julia language. Consequently, the correspondend LLVM IR code is produced.
This snippet shows all variables and classes used to support the parser on the creation of the output program:
public HashMap <String, TypeVar> symbolTable; public HashMap <String, TypeFun> functionTable; public boolean isCorrect = true; public StringBuffer printBuff; public ArrayList<String> stringDef; public int var_label = 0; public int str_label = 0; public int flow_label = 0; public int tot_flow_label = 0; public BufferedWriter bwr; public String retType; public int genVarLabel(){ var_label++; return var_label; }; public int genStrLabel(){ str_label++; return str_label; }; public class TypeVar{ public String reg_id; public String type; public String value; public Integer align; public Integer size1; public Integer size2; public Integer l; public TypeVar() { reg_id = Integer.toString(genVarLabel()); size1 = size2 = -1; } TypeVar(Integer value, String type, Integer align) { this(); this.value = Integer.toString(value); this.type = type; this.align = align; } TypeVar(Double value, String type, Integer align) { this(); this.value = Double.toString(value); this.type = type; this.align = align; } } public class TypeFun{ ArrayList<String> parT; Integer nPar; String retT; public TypeFun(ArrayList<String> parT, String retT) { this.parT = parT; this.nPar = parT.size(); this.retT = retT; } }
Class TypeVar
: is a class used to represent every scalar value, variable or array in the source codereg_id
: represents the register in which the variable is storedtype
: represents the type of the variable value
: if it’s a scalar, this is the value of the variablealign
: the needed align for the variable size1
: -1 if a scalar, any (>0) otherwise size2
: was though for matrices, but it's not used since matrices are not supported l
: it represents the length of some instructions to allocate the correspondent variable used only for while and if constructs in order to place some labels properly Class TypeFun
: is a class used to represent functionsparT
: list of string representing parameters typenPar
: number of input parametersretT
: return typeHashmap<String, TypeVar> symbolTable
: is an hashmap containing the correspondence between a variable ID and its TypeVarHashmap<String, TypeFun> functionTable
: is an hashmap containing the correspondence between a fuction ID and its TypeFunStringbuffer printBuff
: is the buffer used to store all the outputs and then write on output.ll file all at onceArrayList<String> stringDef
: it's an array of strings, each of them contains the string definition that’s evaluated when a string is met during the parsingvar_label
: counter used for register names in LLVM IR str_label
: counter for string labelsflow_label
: counter for control flow labels tot_flow_label
: counter for total control flow labels
The grammar defined for the parser assumes that there are function definitions first and then there are the instructions that will be included in the @main
of the LLVM IR code.
printBuff
is first used to store (and eventually write on output.ll
) the recognized instructions. After that the non terminal function_defs
is recognized, the parser write the instructions correspondent to the functions defined. Then it clears the printBuffer
and var_label
counter that will be then used for writing instructions of the @main
.
BufferedWriter bwr
is the buffer used to create an output.ll
file and write the LLVM IR code in it.
prog ::= function_defs {: if(parser.isCorrect) { bwr.write("declare i32 @printf(i8*, ...)\n"); bwr.write(printBuff.toString()); } else System.out.println("Program contains errors."); var_label = 0; printBuff.setLength(0); :}statements {: if(parser.isCorrect) { for(String s : stringDef) { bwr.write(s+"\n"); } bwr.write("define void @main(){"); bwr.write(printBuff.toString()); bwr.write("ret void\n}"); bwr.flush(); //close the stream bwr.close(); } else System.out.println("Program contains errors."); :};
val token
is used to represent any scalar entity in the parser: it represents integer, double and variable names.
Each time an INT
or DOUBLE
token is recognized, the parser defines a new TypeVar
and allocate the needed memory locations for that scalars and the stores them.
While when an ID
token is recognized, the parser checks if the ID
is already defined in the symbolTable
, if it is, then assign to the val
token the TypeVar
that the ID
is coupled to.
val ::= ID:x {: if(!parser.symbolTable.containsKey(x)) { pSemError("Variable "+x+" not declared."); }else{ RESULT = parser.symbolTable.get(x); } :} | INT:x {: RESULT = new TypeVar(x, "i32", new Integer(4)); append("%"+RESULT.reg_id+" = alloca "+RESULT.type+", align "+RESULT.align, true); append("store "+RESULT.type+" "+RESULT.value+", "+RESULT.type+"* %"+RESULT.reg_id, true); :} | DOUBLE:x {: RESULT = new TypeVar(x, "double", new Integer(8)); append("%"+RESULT.reg_id+" = alloca "+RESULT.type+", align "+RESULT.align, true); append("store "+RESULT.type+" "+RESULT.value+", "+RESULT.type+"* %"+RESULT.reg_id, true); :} ;
This rules uses elem_list
non terminal symbol that represents a list of square brackets enclosed val
tokens.
elem_list
is coupled with an ArrayList<TypeVar>
storing the values stated in the brackets. The parser allocates the needed memory for the array elements and then it stores each array element in the location at proper position within the just allocated memory locations.
assignment ::= ID:id EQU SO elem_list:x SC{: TypeVar nTypeVar = new TypeVar(); Integer arrReg = Integer.parseInt(nTypeVar.reg_id); append("%"+arrReg+" = alloca ["+x.size()+" x "+x.get(0).type+"], align "+x.get(0).align, true); for(int i = 0; i<x.size(); i++) { TypeVar xTy = x.get(i); append("%"+genVarLabel()+" = getelementptr inbounds ["+x.size()+" x "+x.get(i).type+"], ["+x.size()+" x "+x.get(i).type+"]* %"+arrReg+", "+x.get(i).type+" 0, "+x.get(i).type+" "+i,true); append("%"+genVarLabel()+" = load "+xTy.type+", "+xTy.type+"* %"+xTy.reg_id+", align "+xTy.align,true); append("store "+xTy.type+" %"+var_label+", "+xTy.type+"* %"+(var_label-1)+", align "+xTy.align,true); } nTypeVar.type = x.get(0).type; nTypeVar.align = x.get(0).align; nTypeVar.size1 = x.size(); addEntry(id, nTypeVar ); :}
For example, given the following code:
a = [1,2,3]
Here's the correspondent LLVM IR code:
%1 = alloca i32, align 4 store i32 1, i32* %1 %2 = alloca i32, align 4 store i32 2, i32* %2 %3 = alloca i32, align 4 store i32 3, i32* %3 %4 = alloca [3 x i32], align 4 %5 = getelementptr inbounds [3 x i32], [3 x i32]* %4, i32 0, i32 0 %6 = load i32, i32* %1, align 4 store i32 %6, i32* %5, align 4 %7 = getelementptr inbounds [3 x i32], [3 x i32]* %4, i32 0, i32 1 %8 = load i32, i32* %2, align 4 store i32 %8, i32* %7, align 4 %9 = getelementptr inbounds [3 x i32], [3 x i32]* %4, i32 0, i32 2 %10 = load i32, i32* %3, align 4 store i32 %10, i32* %9, align 4
The following code shows how LLVM IR code is produced for function definition. Remember: function definition implementation only supports doubles as input and return parameters.
When the function definition is recognized, the parser produces the function definition in LLVM IR code. Before parsing the function statements, the input parameters are stored in new registers.
function_def ::= FUNCT ID:f RO ID_list:par{: var_label = par.size()+1; append("define @"+f+"(",false); for(int i = 0; i<par.size(); i++) { append("double", false); if(i != (par.size()-1)) append(", ", false); else append(") {", true); } for(int i = 0; i<par.size(); i++) { Integer reg; if(i==0) reg = var_label; else reg = genVarLabel(); append("%"+reg+" = alloca double, align 8", true); append("store double %"+i+", double* %"+reg, true); TypeVar p = new TypeVar(); var_label--; p.reg_id = Integer.toString(reg); p.type = new String("double"); p.align = new Integer(8); addEntry(par.get(i), p); } :} RC statements END{: append("}", true); String fName = parser.stack(-6).toString(); ArrayList<String> pType= new ArrayList<String>(); for(int i = 0; i<par.size(); i++) { pType.add("double"); } TypeFun funct = new TypeFun(pType,retType); functionTable.put(fName,funct); pType.set((pType.size()-1), "double*"); funct = new TypeFun(pType, retType); Integer index = printBuff.indexOf(fName)-2; printBuff.insert(index, " "+retType); retType = "void"; var_label = 0; symbolTable.clear(); :};
After function statements are recognized, the parser defines the correspondent entry in the functionTable
hashmap.
Let's consider the following function definition in Julia:
function division(a,b) ... end
The correspondent LLVM IR code will be:
define double @division(double, double) { %3 = alloca double, align 8 store double %0, double* %3 %4 = alloca double, align 8 store double %1, double* %4 ... }
When a string is recognized by the parser, it calls the CreateString
functions that defines the string declaration as a global constant. This function is also able to recognize if there’s interpolation of variables within the string (e.g. “$variable_name”).
Then, for simplicity and for exploit this CreateString functionality, when a variable is printed the parser encapsulates that variable in a string and pre-place a $ symbol to it.
If a numerical value is printed, the operation of pre-placing a $ symbol is obviously not done. If an array element is printed, CreateString
is not called but the parser directly produces the needed LLVM IR code.
print_instr ::= print_keyw:nl RO STRING:x RC{: CreateString(nl,x); :} | print_keyw:nl RO val:x RC {: if(!x.value.matches("\\d+")) { x.value = "$"+x.value; } CreateString(nl, x.value); :} | print_keyw:nl RO dest_var:x RC {: String s = new String("\\00"); if(nl) s = "\\0A" + s; s = "%d"+s; Integer length = s.length()-(nl?4:2); append("%"+genVarLabel()+" = load "+x.type+", "+x.type+"* %"+x.reg_id+", align "+x.align,true); parser.stringDef.add("@.str." + genStrLabel() + " = private constant [" + length + " x i8] c\"" + s + "\", align 1"); append(("%" + genVarLabel() + " = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([" + length + " x i8], [" + length + " x i8]* @.str." + str_label + ", i32 0, i32 0), i32 %"+(var_label-1)+")"), true); :}
The compiler is able to recognize the following kind of errors:
Compiler: julia_compiler.zip
Examples
Examples contains a Julia source code and the correspondent LLVM IR code.
sudo apt install llvm
jflex julia_scanner.jflex
java java_cup.Main -expect 8 julia_parser.cup
javac *.java
java Main source.jl
output.ll
file lli output.ll