Table of Contents

From Julia to LLVM

Background

This page shows the implementation of a compiler which recognizes and translates part of the Julia programming language into the LLVM IR syntax (more information about LLVM can be found here).

Implemented parts

List of Julia parts supported:

Compiler

The compiler is made up of two parts: a scanner and a parser.

Scanner

The scanner is able to recognize and retrieve tokens (terminal symbols) to the parser coupled with an object containing a value that the token represents. It identifies integer, doubles and ids (that will be used for variables, function names, …) and other significant Julia keywords like:

And other syntax elements like punctuation and other symbols.

Snippet of Julia Scanner

  1. d = [A-Za-z_][A-Za-z0-9_]*
  2. integer = ([1-9][0-9]*|0)
  3. double = (([0-9]+\.[0-9]*) | ([0-9]*\.[0-9]+)) (e|E('+'|'-')?[0-9]+)?
  4.  
  5. %%
  6. "(" {return symbol(sym.RO);}
  7. ")" {return symbol(sym.RC);}
  8. "=" {return symbol(sym.EQU);}
  9. "+" {return symbol(sym.PLUS);}
  10. "-" {return symbol(sym.MINUS);}
  11. "*" {return symbol(sym.STAR);}
  12. "/" {return symbol(sym.DIV);}
  13.  
  14. ...
  15.  
  16. "&&" {return symbol(sym.AND);}
  17. "||" {return symbol(sym.OR);}
  18. "!" {return symbol(sym.NOT);}
  19. "[" {return symbol(sym.SO);}
  20. "]" {return symbol(sym.SC);}
  21.  
  22.  
  23. "else" {return symbol(sym.ELSE);}
  24. "end" {return symbol(sym.END);}
  25. "for" {return symbol(sym.FOR);}
  26. "function" {return symbol(sym.FUNCT);}
  27. "if" {return symbol(sym.IF);}
  28. "in" {return symbol(sym.IN);}
  29. "print" {return symbol(sym.PRINT);}
  30. "println" {return symbol(sym.PRINTLN);}
  31. "return" {return symbol(sym.RET);}
  32. "while" {return symbol(sym.WHILE);}
  33.  
  34. ...
  35.  
  36. {id} {return symbol(sym.ID, yytext());}
  37.  
  38. {integer} {return symbol(sym.INT, new Integer(yytext()));}
  39.  
  40. {double} {return symbol(sym.DOUBLE, new Double(yytext()));}

Parser

The parser is able to take as input the tokens that are provided by the scanner and recognize the main grammatical rules of Julia language. Consequently, the correspondend LLVM IR code is produced.

Data structures

This snippet shows all variables and classes used to support the parser on the creation of the output program:

    public HashMap <String, TypeVar> symbolTable;
	public HashMap <String, TypeFun> functionTable;
	public boolean isCorrect = true; 
	public StringBuffer printBuff;
	public ArrayList<String> stringDef;
 
	public int var_label = 0;
	public int str_label = 0; 
 
	public int flow_label = 0;
 
	public int tot_flow_label = 0;
 
	public BufferedWriter bwr;
 
	public String retType;
	public int genVarLabel(){
 
		var_label++;
 
		return var_label; 
	};
 
	public int genStrLabel(){
 
		str_label++;
 
		return str_label; 
	};
 
	public class TypeVar{
 
		public String reg_id;
		public String type; 
		public String value;
		public Integer align;
		public Integer size1;
		public Integer size2;
		public Integer l;
 
		public TypeVar()
		{
			reg_id = Integer.toString(genVarLabel());
			size1 = size2 = -1;
		}
		TypeVar(Integer value, String type, Integer align)
		{
			this();
			this.value = Integer.toString(value);
			this.type = type;
			this.align = align;
		}
 
		TypeVar(Double value, String type, Integer align)
		{
			this();
			this.value = Double.toString(value);
			this.type = type;
			this.align = align;
		}
 
 
	}
 
	public class TypeFun{
 
		ArrayList<String> parT; 
		Integer nPar; 
		String retT; 
 
		public TypeFun(ArrayList<String> parT, String retT)
		{
			this.parT = parT;
			this.nPar = parT.size(); 
			this.retT = retT;
		}
 
	}

Grammar start

The grammar defined for the parser assumes that there are function definitions first and then there are the instructions that will be included in the @main of the LLVM IR code. printBuff is first used to store (and eventually write on output.ll) the recognized instructions. After that the non terminal function_defs is recognized, the parser write the instructions correspondent to the functions defined. Then it clears the printBuffer and var_label counter that will be then used for writing instructions of the @main. BufferedWriter bwr is the buffer used to create an output.ll file and write the LLVM IR code in it.

prog ::= function_defs {:
    if(parser.isCorrect)
    {
        bwr.write("declare i32 @printf(i8*, ...)\n");
        bwr.write(printBuff.toString());
 
    }
    else
        System.out.println("Program contains errors.");
    var_label = 0; 
    printBuff.setLength(0);
 
:}statements {:
    if(parser.isCorrect)
    {
        for(String s : stringDef)
        {
            bwr.write(s+"\n");
        }
 
        bwr.write("define void @main(){");
 
        bwr.write(printBuff.toString());
 
        bwr.write("ret void\n}");
        bwr.flush();
 
        //close the stream
        bwr.close();
    }
    else
        System.out.println("Program contains errors.");
 
:};

Some practical examples

Recognition of scalars and variable IDs

val token is used to represent any scalar entity in the parser: it represents integer, double and variable names. Each time an INT or DOUBLE token is recognized, the parser defines a new TypeVar and allocate the needed memory locations for that scalars and the stores them. While when an ID token is recognized, the parser checks if the ID is already defined in the symbolTable, if it is, then assign to the val token the TypeVar that the ID is coupled to.

val ::= ID:x {:
    if(!parser.symbolTable.containsKey(x))
    {
        pSemError("Variable "+x+" not declared.");  
    }else{
        RESULT = parser.symbolTable.get(x);
 
    }
 
:}
| INT:x {:
    RESULT = new TypeVar(x, "i32", new Integer(4));
    append("%"+RESULT.reg_id+" = alloca "+RESULT.type+", align "+RESULT.align, true);
    append("store "+RESULT.type+" "+RESULT.value+", "+RESULT.type+"* %"+RESULT.reg_id, true);
:}
| DOUBLE:x {:
    RESULT = new TypeVar(x, "double", new Integer(8));
    append("%"+RESULT.reg_id+" = alloca "+RESULT.type+", align "+RESULT.align, true);
    append("store "+RESULT.type+" "+RESULT.value+", "+RESULT.type+"* %"+RESULT.reg_id, true);
:}
;
Array definition

This rules uses elem_list non terminal symbol that represents a list of square brackets enclosed val tokens. elem_list is coupled with an ArrayList<TypeVar> storing the values stated in the brackets. The parser allocates the needed memory for the array elements and then it stores each array element in the location at proper position within the just allocated memory locations.

assignment ::= ID:id EQU SO elem_list:x SC{:
    TypeVar nTypeVar = new TypeVar(); 
    Integer arrReg = Integer.parseInt(nTypeVar.reg_id);
    append("%"+arrReg+" = alloca ["+x.size()+" x "+x.get(0).type+"], align "+x.get(0).align, true);
    for(int i = 0; i<x.size(); i++)
    {
        TypeVar xTy = x.get(i);     
        append("%"+genVarLabel()+" = getelementptr inbounds ["+x.size()+" x "+x.get(i).type+"], ["+x.size()+" x "+x.get(i).type+"]* %"+arrReg+", "+x.get(i).type+" 0, "+x.get(i).type+" "+i,true);
        append("%"+genVarLabel()+" = load "+xTy.type+", "+xTy.type+"* %"+xTy.reg_id+", align "+xTy.align,true); 
        append("store "+xTy.type+" %"+var_label+", "+xTy.type+"* %"+(var_label-1)+", align "+xTy.align,true);
    } 
 
    nTypeVar.type = x.get(0).type;
    nTypeVar.align = x.get(0).align;
    nTypeVar.size1 = x.size();
    addEntry(id, nTypeVar );
 
:}

For example, given the following code:

a = [1,2,3]

Here's the correspondent LLVM IR code:

%1 = alloca i32, align 4
store i32 1, i32* %1
%2 = alloca i32, align 4
store i32 2, i32* %2
%3 = alloca i32, align 4
store i32 3, i32* %3
%4 = alloca [3 x i32], align 4
%5 = getelementptr inbounds [3 x i32], [3 x i32]* %4, i32 0, i32 0
%6 = load i32, i32* %1, align 4
store i32 %6, i32* %5, align 4
%7 = getelementptr inbounds [3 x i32], [3 x i32]* %4, i32 0, i32 1
%8 = load i32, i32* %2, align 4
store i32 %8, i32* %7, align 4
%9 = getelementptr inbounds [3 x i32], [3 x i32]* %4, i32 0, i32 2
%10 = load i32, i32* %3, align 4
store i32 %10, i32* %9, align 4

Function implementation

The following code shows how LLVM IR code is produced for function definition. Remember: function definition implementation only supports doubles as input and return parameters.

When the function definition is recognized, the parser produces the function definition in LLVM IR code. Before parsing the function statements, the input parameters are stored in new registers.

function_def ::= FUNCT ID:f RO ID_list:par{:
 
	var_label = par.size()+1;
	append("define @"+f+"(",false); 
	for(int i = 0; i<par.size(); i++)
	{
		append("double", false); 
		if(i != (par.size()-1))
			append(", ", false); 
		else 
			append(") {", true); 
 
	}
	for(int i = 0; i<par.size(); i++)
	{
		Integer reg; 
		if(i==0)
			reg = var_label;
		else	
			reg = genVarLabel();
		append("%"+reg+" = alloca double, align 8", true);
		append("store double %"+i+", double* %"+reg, true); 
		TypeVar p = new TypeVar();
		var_label--;
		p.reg_id = Integer.toString(reg); 
		p.type = new String("double"); 
		p.align = new Integer(8);  
		addEntry(par.get(i), p);
 
	}
 
 
 
 
:} RC statements END{:
	append("}", true);
	String fName = parser.stack(-6).toString();
	ArrayList<String> pType= new ArrayList<String>();
	for(int i = 0; i<par.size(); i++)
	{
		pType.add("double");
	}
	TypeFun funct = new TypeFun(pType,retType);
	functionTable.put(fName,funct);
 
	pType.set((pType.size()-1), "double*");
	funct = new TypeFun(pType, retType);
 
	Integer index = printBuff.indexOf(fName)-2;
	printBuff.insert(index, " "+retType);
 
 
	retType = "void"; 
	var_label = 0;
	symbolTable.clear(); 
:};

After function statements are recognized, the parser defines the correspondent entry in the functionTable hashmap.

Let's consider the following function definition in Julia:

function division(a,b)
 
...
 
end

The correspondent LLVM IR code will be:

define double @division(double, double) {
%3 = alloca double, align 8
store double %0, double* %3
%4 = alloca double, align 8
store double %1, double* %4
...
}

When a string is recognized by the parser, it calls the CreateString functions that defines the string declaration as a global constant. This function is also able to recognize if there’s interpolation of variables within the string (e.g. “$variable_name”). Then, for simplicity and for exploit this CreateString functionality, when a variable is printed the parser encapsulates that variable in a string and pre-place a $ symbol to it. If a numerical value is printed, the operation of pre-placing a $ symbol is obviously not done. If an array element is printed, CreateString is not called but the parser directly produces the needed LLVM IR code.

print_instr ::= print_keyw:nl RO STRING:x RC{:
    CreateString(nl,x);
:} 
| print_keyw:nl RO val:x RC {:
 
    if(!x.value.matches("\\d+"))
    {
        x.value = "$"+x.value;
    }
    CreateString(nl, x.value);
:} 
| print_keyw:nl RO dest_var:x RC
{:
    String s = new String("\\00");
    if(nl)
        s = "\\0A" + s;
    s = "%d"+s;
    Integer length = s.length()-(nl?4:2);
    append("%"+genVarLabel()+" = load "+x.type+", "+x.type+"* %"+x.reg_id+", align "+x.align,true); 
    parser.stringDef.add("@.str." + genStrLabel() + " = private constant [" + length + " x i8] c\"" + s + "\", align 1");
    append(("%" + genVarLabel() + " = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([" + length + " x i8], [" + length + " x i8]* @.str." + str_label + ", i32 0, i32 0), i32 %"+(var_label-1)+")"), true);
 
 
:}

Error handling

The compiler is able to recognize the following kind of errors:

Missing functionalities, partial implementations

Download and installation

Compiler: julia_compiler.zip

Examples

Examples contains a Julia source code and the correspondent LLVM IR code.

How to run it

  1. Install the llvm package: sudo apt install llvm
  2. Download the julia_compiler and extract it
  3. Open the terminal, go to the folder where the compiler is extracted and run the following commands:
    • jflex julia_scanner.jflex
    • java java_cup.Main -expect 8 julia_parser.cup
    • javac *.java
    • java Main source.jl
  4. This will produce an output.ll file
  5. Run output.ll file with lli: lli output.ll