compilers:haskell_to_llvm
Return to Home page
If you found any error, or if you want to partecipate to the editing of this wiki, please contact: admin [at] skenz.it
You can reuse, distribute or modify the content of this page, but you must cite in any document (or webpage) this url: https://www.skenz.it/compilers/haskell_to_llvm?do=diff&rev2%5B0%5D=1642505332&rev2%5B1%5D=1642520182&difftype=sidebyside
Differences
This shows you the differences between two versions of the page.
— | compilers:haskell_to_llvm [2024/04/08 22:34] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== From Haskell to LLVM ====== | ||
+ | ===== Introduction ==== | ||
+ | |||
+ | In this presentation, | ||
+ | |||
+ | === The High-end Language === | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | Being a Pure Functional Language, Haskell has many interesting properties: | ||
+ | |||
+ | * Referential Transparency | ||
+ | * Single Assignment | ||
+ | * Non-Strictness | ||
+ | * Type Inference | ||
+ | * Functions are 1st order entities | ||
+ | * Lexical Scoping | ||
+ | * ... | ||
+ | |||
+ | LHC tries to follow in the footsteps of this Language by implementing some of its peculiar features (Referential Transparency, | ||
+ | |||
+ | === The Core === | ||
+ | |||
+ | LHC is based on: | ||
+ | |||
+ | * a [[haskell_to_llvm# | ||
+ | * an LALR(1) [[haskell_to_llvm# | ||
+ | * additional classes for [[haskell_to_llvm# | ||
+ | |||
+ | === The Target Language === | ||
+ | |||
+ | The Target Language of LHC is [[https:// | ||
+ | |||
+ | Born as a research project at the University of Illinois, it has developed into an umbrella project consisting of several subprojects, | ||
+ | |||
+ | ==== Features ==== | ||
+ | |||
+ | Currently, LHC supports the following subset of Haskell: | ||
+ | |||
+ | * Supported Types: | ||
+ | * Basic Types: Int, Double, Char, Bool | ||
+ | * Aggregate Types: Lists (not Recursive), Strings | ||
+ | * Other Types: Functions (not as 1st order entities) | ||
+ | * Supported operations: | ||
+ | * Addition (+) | ||
+ | * Subtraction (-) | ||
+ | * Multiplication (< | ||
+ | * Division (/) | ||
+ | * Integer Division (div) | ||
+ | * Remainder (rem) | ||
+ | * Logical operators: | ||
+ | * Logical and (&& | ||
+ | * Logical or (||) | ||
+ | * Logical not (not) | ||
+ | * Relational operators: | ||
+ | * Greater than (>) | ||
+ | * Greater or equal to (< | ||
+ | * Less than (<) | ||
+ | * Less or equal to (< | ||
+ | * Equal to (< | ||
+ | * Not equal to (< | ||
+ | * Element extraction from List/String (!!) | ||
+ | * Function call | ||
+ | * Constructs: | ||
+ | * If-Then-Else (the only possible Control Flow Statement: loops do not exist in Haskell) | ||
+ | * Let Bindings (both in the Functional and Imperative Part) | ||
+ | * Do Block | ||
+ | * I/O: | ||
+ | * Print Function (accepts Int, Double, Char, String) | ||
+ | * Language Properties: | ||
+ | * Indentation-based Parsing | ||
+ | * Referential Transparency | ||
+ | * Single Assignment (" | ||
+ | * Lexical Scoping | ||
+ | |||
+ | === Partial/ | ||
+ | |||
+ | * Functions cannot return Lists/ | ||
+ | * Global Lists/ | ||
+ | |||
+ | ===== The Compiler ===== | ||
+ | |||
+ | **Note**: all debug code is not shown at all; we use " | ||
+ | ==== Scanner ==== | ||
+ | |||
+ | In this section, we illustrate how Tokens are scanned in LHC using a Scanner written in JFlex. | ||
+ | |||
+ | === General Scanning Algorithm === | ||
+ | |||
+ | The Scanning Algorithm implemented in LHC uses a FIFO '' | ||
+ | In particular, every time that a Token is requested by the Parser: | ||
+ | - if the '' | ||
+ | - otherwise: | ||
+ | - look for the next token(s) (launches the default scanning method '' | ||
+ | - manage the token(s) (in the post-scanning method '' | ||
+ | - return the first next token to the Parser (in '' | ||
+ | Managing basic Tokens, i.e. all those not related to indentation, | ||
+ | |||
+ | **About code**: when a method is reported, it assumes a form like '' | ||
+ | |||
+ | Snippet of '' | ||
+ | <code java> | ||
+ | public Symbol next_token_custom() throws IOException{ | ||
+ | // if the token Queue is not empty, extract the next item | ||
+ | if (this.tokenQueue.size() > 0) { | ||
+ | return this.tokenQueue.remove(); | ||
+ | } | ||
+ | [...] // Additional code for Indentation-based Parsing, we will cover this later | ||
+ | return this.next_token(); | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Snippet of '' | ||
+ | <code java> | ||
+ | private Symbol manageToken(Symbol... symbols) { | ||
+ | [...] // Additional code for Indentation-based Parsing | ||
+ | for (Symbol sym: symbols) { | ||
+ | tokenQueue.add(sym); | ||
+ | } | ||
+ | return this.tokenQueue.remove(); | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | It is always up to the Parser to ask the Scanner for a new token; by default, the Parser calls the '' | ||
+ | |||
+ | <code java> | ||
+ | scan with {: | ||
+ | if (true) { // Required because of a bug in this enhanced implementation of Cup | ||
+ | Scanner scanner = (Scanner) this.getScanner(); | ||
+ | Symbol s = scanner.next_token_custom(); | ||
+ | [...] // Additional code for tracking current line and column and for drawing the Parse Tree | ||
+ | return s; | ||
+ | } | ||
+ | :} | ||
+ | </ | ||
+ | |||
+ | === Basic Token management === | ||
+ | |||
+ | The Scanner can recognize several keywords and regular expressions that identify specific terminals of the Grammar. \\ | ||
+ | Here is a snippet of the Scanner code for identifying those Tokens: | ||
+ | |||
+ | < | ||
+ | int = 0|[1-9][0-9]* // | ||
+ | rational = {int}\.[0-9]+ // | ||
+ | expform = {rational}[eE]-? | ||
+ | double = {rational}|{expform} | ||
+ | bool = True|False | ||
+ | char = \' | ||
+ | string = \" | ||
+ | id = [a-z][a-zA-Z0-9_\' | ||
+ | nl = \r|\n|\r\n | ||
+ | ws = [ \t] | ||
+ | |||
+ | %% | ||
+ | |||
+ | " | ||
+ | "::" | ||
+ | "," | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | "/" | ||
+ | div {return manageToken(createSymbol(sym.intdiv)); | ||
+ | rem {return manageToken(createSymbol(sym.rem)); | ||
+ | "&&" | ||
+ | " | ||
+ | not {return manageToken(createSymbol(sym.not)); | ||
+ | "/ | ||
+ | " | ||
+ | "> | ||
+ | ">" | ||
+ | "< | ||
+ | "<" | ||
+ | ";" | ||
+ | in {[...] | ||
+ | return manageToken(createSymbol(sym.in)); | ||
+ | main {return manageToken(createSymbol(sym.main)); | ||
+ | do {[...] | ||
+ | return manageToken(createSymbol(sym.do_begin)); | ||
+ | " | ||
+ | if {return manageToken(createSymbol(sym.if_begin)); | ||
+ | then {return manageToken(createSymbol(sym.then)); | ||
+ | else {return manageToken(createSymbol(sym.else_begin)); | ||
+ | let {[...] | ||
+ | return manageToken(createSymbol(sym.let)); | ||
+ | print {return manageToken(createSymbol(sym.print)); | ||
+ | Int {return manageToken(createSymbol(sym.type_int)); | ||
+ | Double {return manageToken(createSymbol(sym.type_double)); | ||
+ | Bool {return manageToken(createSymbol(sym.type_bool)); | ||
+ | Char {return manageToken(createSymbol(sym.type_char)); | ||
+ | String {return manageToken(createSymbol(sym.type_string)); | ||
+ | {int} {return manageToken(createSymbol(sym.val_int, | ||
+ | {double} {return manageToken(createSymbol(sym.val_double, | ||
+ | {bool} {return manageToken(createSymbol(sym.val_bool, | ||
+ | {char} {return manageToken(createSymbol(sym.val_char, | ||
+ | {string} {return manageToken(createSymbol(sym.val_string, | ||
+ | //LLVM does not support the single quote as a valid character for identifiers, | ||
+ | {id} {return manageToken(createSymbol(sym.id, | ||
+ | {ws} {;} | ||
+ | {nl} {[...]} | ||
+ | |||
+ | /* Single-line Comment */ | ||
+ | " | ||
+ | </ | ||
+ | |||
+ | === Indentation-based Parsing === | ||
+ | |||
+ | For nested blocks, Haskell supports both a coding style that uses brackets and an indentation-based one, which LHC supports. | ||
+ | However, Haskell uses [[wp> | ||
+ | |||
+ | How can we recognize Indentation? | ||
+ | A possible approach, used in Python, consists of defining additional tokens on purpose: Indent, Dedent and Separator; Indent starts a block, Dedent terminates it, Separator divides statements inside the same one. \\ | ||
+ | We also define additional variables and data structures: | ||
+ | * an '' | ||
+ | * an '' | ||
+ | * a '' | ||
+ | * a '' | ||
+ | * a '' | ||
+ | The concept is the following: when an indentation-sensitive block is recognized, i.e. its keyword is scanned (" | ||
+ | __Beware__: nested blocks can only be more to the right than the current one. | ||
+ | |||
+ | In case we are in an indentation-sensitive block: | ||
+ | * if a newline is recognized and the following line is more to the left than the current block, a Dedent is scanned: the block is closed; actually, this is a loop: we may scan multiple Dedents at once. The loop halts when a block is aligned more to the left than the line. In the particular case where a Dedent is scanned because '' | ||
+ | * if the following and the current line are aligned the same, then a Separator is scanned | ||
+ | * if the following line is more to the right than the block, no additional symbol is scanned: we are still in that block | ||
+ | |||
+ | Once defined the principle, we can illustrate the code. | ||
+ | |||
+ | Here are the additional instructions for indentation control, for specific Tokens: | ||
+ | |||
+ | < | ||
+ | in {dedentForce = true; | ||
+ | return manageToken(createSymbol(sym.in)); | ||
+ | [...] | ||
+ | do {indentEnable = true; | ||
+ | return manageToken(createSymbol(sym.do_begin)); | ||
+ | [...] | ||
+ | let {indentEnable = true; | ||
+ | return manageToken(createSymbol(sym.let)); | ||
+ | [...] | ||
+ | {nl} {foundNewline = true;} | ||
+ | |||
+ | /* Single-line Comment */ | ||
+ | " | ||
+ | </ | ||
+ | |||
+ | Code snippet to enable '' | ||
+ | <code java> | ||
+ | if (indentEnable) { | ||
+ | indentEnable = false; | ||
+ | scanIndent = true; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Code snippet to scan a new Indent, define the column level of the nested block and store it (in '' | ||
+ | |||
+ | <code java> | ||
+ | if (!endOfCode && scanIndent) { | ||
+ | indentColumn = yycolumn+1; | ||
+ | if (indentStack.size() == 0 || indentColumn > indentStack.peek()) { | ||
+ | indentStack.push(indentColumn); | ||
+ | tokenQueue.add(createSymbol(sym.indent)); | ||
+ | } | ||
+ | else { | ||
+ | report_error(" | ||
+ | tokenQueue.add(createSymbol(sym.error)); | ||
+ | } | ||
+ | scanIndent = false; | ||
+ | foundNewline = false; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Code snippet to scan one or more Dedents (in '' | ||
+ | |||
+ | <code java> | ||
+ | if (!endOfCode && !indentStack.empty() && (foundNewline || dedentForce)) { | ||
+ | dedentColumn = yycolumn + 1; | ||
+ | boolean thereIsDedentToInsert = true; | ||
+ | while (thereIsDedentToInsert && !indentStack.empty()) { | ||
+ | indentColumn = indentStack.peek(); | ||
+ | if (indentColumn > dedentColumn || dedentForce) { // check if we are out of the innermost block | ||
+ | dedentForce = false; | ||
+ | indentStack.pop(); | ||
+ | [...] | ||
+ | tokenQueue.add(createSymbol(sym.dedent)); | ||
+ | } | ||
+ | else { | ||
+ | thereIsDedentToInsert = false; | ||
+ | if (indentColumn.equals(dedentColumn)) { | ||
+ | // the following statement is not outside of the block : add only a Separator | ||
+ | tokenQueue.add(createSymbol(sym.sep)); | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | === Error Handling === | ||
+ | |||
+ | If no Token can be recognized, a Lexical Error is notified, calling the '' | ||
+ | <code java> | ||
+ | public void report_error(String msg) { | ||
+ | System.err.print(" | ||
+ | System.err.print(" | ||
+ | System.err.println(" | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | If no Token can be recognized, then the text matches with the following catch-all regular expression, defined at the end of the code: | ||
+ | |||
+ | < | ||
+ | . {report_error(" | ||
+ | return this.manageToken(createSymbol(sym.error)); | ||
+ | </ | ||
+ | |||
+ | ==== Parser ==== | ||
+ | |||
+ | The Parser, written in Cup, recognizes the syntax of the program and handles possible occurring errors. \\ | ||
+ | Here is the full Grammar of the Language recognized by LHC (Semantic Rules are explored in [[haskell_to_llvm# | ||
+ | |||
+ | < | ||
+ | PROGRAM ::= /* empty program */ | ||
+ | | indent FUNCT_PART IMPER_PART dedent | ||
+ | |||
+ | /// ////////////////// | ||
+ | /// IMPERATIVE PART /// | ||
+ | /// ////////////////// | ||
+ | |||
+ | IMPER_PART ::= main eq IO_ACTION | ||
+ | |||
+ | IO_ACTION ::= PRINT | ||
+ | | DO_BLOCK | ||
+ | | IF_BLOCK_IMPER | ||
+ | |||
+ | IO_ACTIONS ::= IO_ACTIONS sep IO_ACTION | ||
+ | | IO_ACTIONS sep LET_BLOCK_IMPER | ||
+ | | IO_ACTION | ||
+ | | LET_BLOCK_IMPER | ||
+ | |||
+ | PRINT ::= print ACTARG | ||
+ | |||
+ | DO_BLOCK ::= do_begin indent IO_ACTIONS dedent | ||
+ | |||
+ | IF_BLOCK_IMPER ::= if_begin COND then IO_ACTION else_begin IO_ACTION | ||
+ | |||
+ | LET_BLOCK_IMPER ::= let indent LET_STMTS dedent | ||
+ | |||
+ | LET_STMTS ::= LET_STMTS sep DECL_TYPE | ||
+ | | LET_STMTS sep DEF_VALUE | ||
+ | | DECL_TYPE | ||
+ | | DEF_VALUE | ||
+ | |||
+ | /// ////////////////// | ||
+ | /// FUNCTIONAL PART /// | ||
+ | /// ////////////////// | ||
+ | |||
+ | FUNCT_PART ::= GLOBAL_STMTS | ||
+ | |||
+ | GLOBAL_STMTS ::= /* empty List of Statements section */ | ||
+ | | GLOBAL_STMTS STMT sep | ||
+ | |||
+ | /// STATEMENTS | ||
+ | |||
+ | STMT ::= DECL_TYPE | ||
+ | | DEF_VALUE | ||
+ | | DEF_FUNCT | ||
+ | |||
+ | DECL_TYPE ::= id cm DECL_TYPE | ||
+ | | id clns TYPE | ||
+ | |||
+ | DEF_VALUE ::= id eq EXPR | ||
+ | |||
+ | /* no nullary functions */ | ||
+ | DEF_FUNCT ::= id LFORMARG eq EXPR | ||
+ | |||
+ | LFORMARG ::= LFORMARG id | ||
+ | | id | ||
+ | |||
+ | /* special expression management for boolean conditions */ | ||
+ | COND ::= EXPR | ||
+ | |||
+ | /// | ||
+ | |||
+ | EXPR ::= EXPR plus EXPR | ||
+ | | EXPR minus EXPR | ||
+ | | EXPR times EXPR | ||
+ | | EXPR div EXPR | ||
+ | | EXPR intdiv EXPR | ||
+ | | EXPR rem EXPR | ||
+ | | EXPR and EXPR | ||
+ | | EXPR or EXPR | ||
+ | | EXPR relnoteq EXPR | ||
+ | | EXPR releq EXPR | ||
+ | | EXPR relgt EXPR | ||
+ | | EXPR relge EXPR | ||
+ | | EXPR rellt EXPR | ||
+ | | EXPR relle EXPR | ||
+ | | EXPR index EXPR | ||
+ | | ro EXPR rc | ||
+ | | not EXPR | ||
+ | | minus EXPR %prec uminus | ||
+ | | LET_BLOCK_FUNC | ||
+ | | IF_BLOCK_FUNC | ||
+ | | FUNCT_CALL | ||
+ | | VALUE | ||
+ | | EXPR_LIST | ||
+ | |||
+ | FUNCT_CALL ::= id LACTARG | ||
+ | |||
+ | LET_BLOCK_FUNC ::= let indent LET_STMTS dedent in EXPR | ||
+ | |||
+ | IF_BLOCK_FUNC ::= if_begin COND then EXPR else_begin EXPR | ||
+ | |||
+ | /// args for function call /// | ||
+ | |||
+ | ACTARG ::= id | ||
+ | | VALUE | ||
+ | | ro EXPR rc | ||
+ | |||
+ | // list of input arguments for function call | ||
+ | LACTARG ::= /* empty list of args: we call a Value */ | ||
+ | | LACTARG ACTARG | ||
+ | |||
+ | /// | ||
+ | |||
+ | // Represents constants | ||
+ | VALUE ::= VALUE_BASIC | ||
+ | |||
+ | // VALUE_BASIC represents constants | ||
+ | VALUE_BASIC ::= | ||
+ | val_int | ||
+ | | val_double | ||
+ | | val_bool | ||
+ | | val_char | ||
+ | | val_string | ||
+ | |||
+ | EXPR_LIST ::= bo LEXPR bc | ||
+ | |||
+ | LEXPR ::= EXPR | ||
+ | | LEXPR cm EXPR | ||
+ | |||
+ | /// | ||
+ | |||
+ | TYPE ::= TYPE_VALUE | ||
+ | | TYPE_FUNC | ||
+ | |||
+ | // both basic types and compound | ||
+ | TYPE_VALUE ::= TYPE_LIST | ||
+ | | TYPE_BASIC | ||
+ | |||
+ | TYPE_BASIC ::= type_int | ||
+ | | type_double | ||
+ | | type_bool | ||
+ | | type_char | ||
+ | |||
+ | TYPE_LIST ::= bo TYPE_BASIC bc | ||
+ | | type_string | ||
+ | |||
+ | // no higher order functions | ||
+ | TYPE_FUNC ::= TYPE_VALUE arrow TYPE_FUNC | ||
+ | | TYPE_VALUE arrow TYPE_VALUE | ||
+ | |||
+ | </ | ||
+ | |||
+ | To avoid Shift-Reduce conflicts, operators have given precedence, shown below: | ||
+ | |||
+ | < | ||
+ | /// ///////////////////////// | ||
+ | /// PRECEDENCE RULES | ||
+ | /// ///////////////////////// | ||
+ | |||
+ | precedence left or; | ||
+ | precedence left and; | ||
+ | precedence nonassoc releq, relnoteq, relgt, relge, rellt, relle; | ||
+ | precedence left plus, minus; | ||
+ | precedence left times, div, intdiv, rem; | ||
+ | precedence nonassoc index; | ||
+ | precedence nonassoc not, uminus; | ||
+ | </ | ||
+ | |||
+ | We can notice that Haskell has some peculiarities, | ||
+ | |||
+ | Any Haskell program can be split into two parts: a functional part and an imperative one; the latter corresponds to the main and is the only region of the program where it is possible to access I/O devices via IO Actions; in general, it is where we can modify the state of the machine. Instead, all functions in the first part are stateless: this property is fundamental for making Haskell a Purely Functional programming language and is also known as [[wp> | ||
+ | |||
+ | To declare local variables, the language defines Let Bindings that can be placed inside the function body; here, they are represented by '' | ||
+ | |||
+ | Strangely, the If-Then-Else construct and Let Bindings defined in the functional part are Expressions. In particular, If-Then-Else is an expression whose return value corresponds to that of the Then/Else Branch; since expressions have always one and only one type, the two branches must return a value of the same type. Instead, Let Bindings have the same type and return value of the expressions following the '' | ||
+ | |||
+ | In Haskell, the If-Then-Else construct imposes to define both the branches; this is due to a particular rule of the language: | ||
+ | >> Expressions must always return something | ||
+ | |||
+ | The If-Then-Else construct is an expression, so it must always return something __regardless of the result of the condition__: | ||
+ | |||
+ | Calling a value is like calling a nullary function, which is how Haskell interprets values, actually. | ||
+ | |||
+ | An n-ary function, i.e. a function taking n arguments, is equivalent to a sequence of unary functions taking one argument at a time. This concept is known as [[wp> | ||
+ | |||
+ | Finally, strings are lists of chars: there exists no Array type. | ||
+ | |||
+ | === Error Handling === | ||
+ | |||
+ | The Parser is also able to recognize syntax errors and recover them, in some cases. For this purpose, Cup allows using an '' | ||
+ | |||
+ | Every time an error is detected, the Parser launches the '' | ||
+ | |||
+ | <code java> | ||
+ | public void report_error(Object info) { | ||
+ | noCompileErrors = false; | ||
+ | System.err.print(" | ||
+ | if (info instanceof Symbol) | ||
+ | if (((Symbol)info).left != -1){ | ||
+ | int line = (((Symbol)info).left); | ||
+ | int column = (((Symbol)info).right); | ||
+ | System.err.print(" | ||
+ | } else System.err.print(": | ||
+ | else System.err.print(": | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | The '' | ||
+ | |||
+ | We can observe that '' | ||
+ | To complete the message, we need to wait until the Parser recovers an error, i.e. reduces a rule containing the '' | ||
+ | |||
+ | These are the additional rules, included in the compiler, to handle some syntax errors: | ||
+ | < | ||
+ | |||
+ | IMPER_PART ::= main eq error | ||
+ | |||
+ | IO_ACTIONS ::= IO_ACTIONS: | ||
+ | |||
+ | DO_BLOCK ::= do_begin indent error dedent | ||
+ | |||
+ | IF_BLOCK_IMPER ::= | ||
+ | if_begin error then IO_ACTION else_begin IO_ACTION | ||
+ | | if_begin COND then error else_begin IO_ACTION | ||
+ | | if_begin COND error else_begin IO_ACTION | ||
+ | |||
+ | GLOBAL_STMTS ::= GLOBAL_STMTS error sep | ||
+ | |||
+ | LET_BLOCK_FUNC ::= let error in EXPR | ||
+ | |||
+ | IF_BLOCK_FUNC ::= if_begin error else_begin EXPR | ||
+ | | if_begin COND then error else_begin EXPR | ||
+ | | if_begin COND error else_begin EXPR | ||
+ | </ | ||
+ | |||
+ | This is the list of possible syntax errors that LHC can recognize: | ||
+ | * Error in Main | ||
+ | * Error in Imperative Part Statement | ||
+ | * Error in Do Block | ||
+ | * Error in Condition of If-Then-Else block | ||
+ | * Error in Then Block of If-Then-Else block | ||
+ | * Missing ' | ||
+ | * Error in Functional Statement | ||
+ | * Error in Statement inside a Let Block | ||
+ | |||
+ | ==== Semantic Analysis ==== | ||
+ | |||
+ | Together with lexicon and syntax, semantics is one of the main components of a program. Thanks to semantics, it is possible to define context-sensitive information, | ||
+ | |||
+ | **Note**: | ||
+ | |||
+ | In Cup, it is possible to define intermediate actions, like: | ||
+ | |||
+ | A ::= x { something } y | ||
+ | |||
+ | which is equivalent to | ||
+ | |||
+ | A ::= M y | ||
+ | |||
+ | M ::= x { something } | ||
+ | |||
+ | In particular, Cup can define implicit rules for the markers and translate the grammar from the first representation to the second one; however, this may raise conflicts if the parser cannot be sure that it will reduce exactly that rule. | ||
+ | |||
+ | === The typing system === | ||
+ | |||
+ | Like any other programming language, Haskell has a typing system. It is very complex and uses very high-level concepts as typeclasses, | ||
+ | In LHC, only pretty simple types are supported: Int, Double, Bool, Char, not recursive List, Strings and Functions. | ||
+ | Each of these types has a standard representation as a type tree. Basic types are represented as a simple tree node, whose label corresponds to the name of the type itself. For aggregate types, it is different: Lists are polymorphic, | ||
+ | So, we can represent them as a binary tree where: | ||
+ | * the root node is labelled as " | ||
+ | * the left child holds the type of the first argument | ||
+ | * the right child holds the type of the current function without the first argument: the tree depth is equal to the number of arguments | ||
+ | |||
+ | In this compiler, type trees are represented by the '' | ||
+ | * constructors, | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | |||
+ | In the type map, two additional types are used: | ||
+ | * '' | ||
+ | * '' | ||
+ | |||
+ | === The symbol table stack === | ||
+ | |||
+ | During semantic analysis, one of the most important data structures is the symbol table: it contains all the information about declared values and functions, providing all those context-sensitive functionalities that empower the language. | ||
+ | |||
+ | Every symbol table is a hashmap containing many symbol table entries. Each symbol table entry has a name, a type and a flag to check whether it has been already assigned. | ||
+ | |||
+ | In this compiler, we do not use a unique symbol table; instead, we use a symbol table stack: in LHC, it is possible to define nested blocks with different contexts, e.g. nested let bindings. | ||
+ | |||
+ | The symbol table stack is implemented via the '' | ||
+ | |||
+ | The '' | ||
+ | * '' | ||
+ | * '' | ||
+ | |||
+ | Thanks to this flag, values and functions can be assigned only once. Single assignment is one peculiar characteristic of functional language, where values and functions are intended in a mathematical way. If you define cos(x) as a certain function, why should you define it again? | ||
+ | |||
+ | Where is the entry name? Since a symbol table is a hashmap, we use the name of the entry as a key and a '' | ||
+ | |||
+ | This is the information contained in a '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * additional fields and methods for debug and IR Generation (explored [[haskell_to_llvm# | ||
+ | |||
+ | In the Parser, the symbol table stack is allocated here: | ||
+ | <code java> | ||
+ | action code {: | ||
+ | /* The Symbol Table Stack to be used */ | ||
+ | private SymTableStack symTableStack = new SymTableStack(); | ||
+ | :}; | ||
+ | </ | ||
+ | |||
+ | === The AST === | ||
+ | |||
+ | The [[wp> | ||
+ | AST is very useful for top-down IR generation, as reported [[haskell_to_llvm# | ||
+ | |||
+ | In this compiler, the AST nodes are defined in the wrapper '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | |||
+ | Now that we have defined the AST nodes, we may ask how they are interconnected. It is rather simple: they are connected according to the production rules of the corresponding non-terminal, | ||
+ | < | ||
+ | PROGRAM ::= indent FUNCT_PART IMPER_PART dedent | ||
+ | </ | ||
+ | |||
+ | Here is the relation between non-terminals and AST nodes: | ||
+ | |||
+ | < | ||
+ | non terminal AST.Program PROGRAM; | ||
+ | non terminal AST.FunctPart FUNCT_PART; | ||
+ | non terminal AST.ImperPart IMPER_PART; | ||
+ | non terminal LinkedList< | ||
+ | non terminal AST.IOAction IO_ACTION; | ||
+ | non terminal AST.Print PRINT; | ||
+ | non terminal AST.Expr COND, EXPR; | ||
+ | non terminal AST.ExprList EXPR_LIST, LEXPR; | ||
+ | non terminal AST.FunctCall FUNCT_CALL; | ||
+ | non terminal AST.Expr ACTARG; | ||
+ | non terminal LinkedList< | ||
+ | non terminal AST.LFormArg LFORMARG; | ||
+ | non terminal LinkedList< | ||
+ | non terminal AST.DoBlock DO_BLOCK; | ||
+ | non terminal AST.LetBlockFunc LET_BLOCK_FUNC; | ||
+ | non terminal AST.LetBlockImper LET_BLOCK_IMPER; | ||
+ | non terminal AST.IfBlockFunc IF_BLOCK_FUNC; | ||
+ | non terminal AST.IfBlockImper IF_BLOCK_IMPER; | ||
+ | non terminal AST.Stmt STMT; | ||
+ | non terminal AST.Stmt DEF_VALUE; | ||
+ | non terminal AST.DefFunct DEF_FUNCT; | ||
+ | non terminal AST.DeclType DECL_TYPE; | ||
+ | non terminal Type TYPE, TYPE_VALUE, TYPE_FUNC, TYPE_LIST, TYPE_BASIC; | ||
+ | non terminal AST.Value VALUE, VALUE_BASIC; | ||
+ | </ | ||
+ | |||
+ | === Semantic rules === | ||
+ | |||
+ | After exploring the additional classes for semantic analysis, we can explore the semantic rules, at least the most interesting ones. In particular, we focus extensively on how the symbol table stack is used and much less on building the AST, which is rather straightforward. Error detection and recovery are covered in the following section about error handling, so. | ||
+ | |||
+ | == The start and the end of parsing == | ||
+ | |||
+ | The '' | ||
+ | This is a code snippet of the semantic rules related to the '' | ||
+ | |||
+ | <code java> | ||
+ | PROGRAM ::= /* empty program */ | ||
+ | {: | ||
+ | if (noCompileErrors) { | ||
+ | System.out.println(" | ||
+ | } | ||
+ | RESULT = new AST.Program(new AST.FunctPart(), | ||
+ | [...] // Code generation | ||
+ | :} | ||
+ | | indent | ||
+ | {: // push the top-level symtable | ||
+ | symTableStack.pushSymTable(); | ||
+ | :} | ||
+ | | ||
+ | {: | ||
+ | symTableStack.popSymTable(); | ||
+ | :} | ||
+ | dedent | ||
+ | {: // pop the top-level symtable | ||
+ | if (noCompileErrors) { | ||
+ | System.out.println(" | ||
+ | RESULT = new AST.Program(funct_part, | ||
+ | [...] // Code generation | ||
+ | } | ||
+ | else { | ||
+ | System.out.println(" | ||
+ | RESULT = new AST.Program(null, | ||
+ | } | ||
+ | :} | ||
+ | ; | ||
+ | </ | ||
+ | |||
+ | == The Print function == | ||
+ | |||
+ | The '' | ||
+ | |||
+ | <code java> | ||
+ | PRINT ::= print ACTARG: | ||
+ | {: | ||
+ | Type argType = actarg.getType(); | ||
+ | if (argType.isSubtype(" | ||
+ | RESULT = new AST.Print(actarg); | ||
+ | } | ||
+ | else { | ||
+ | [...] // Error handling | ||
+ | } | ||
+ | :} | ||
+ | ; | ||
+ | </ | ||
+ | |||
+ | == Do blocks == | ||
+ | |||
+ | A do block consists of a sequence of IO Actions; this construct defines also its own context, i.e. its own symbol table. Like for all other constructs that create symbol tables, the semantic rules both push and pop the symbol table, so that contexts are well defined. | ||
+ | Here is the code: | ||
+ | |||
+ | <code java> | ||
+ | DO_BLOCK ::= | ||
+ | do_begin indent | ||
+ | {: | ||
+ | symTableStack.pushSymTable(); | ||
+ | :} | ||
+ | IO_ACTIONS: | ||
+ | {: | ||
+ | symTableStack.popSymTable(); | ||
+ | RESULT = new AST.DoBlock(io_actions); | ||
+ | :} | ||
+ | | [...] // additional rule for detecting syntax errors | ||
+ | ; | ||
+ | </ | ||
+ | |||
+ | == Declarations and definitions == | ||
+ | |||
+ | Declarations and definitions are among the most important constructs: they allow to declare and instantiate variables. There is a difference between the two concepts: | ||
+ | * Declarations only declare that a value/ | ||
+ | * Definitions allocate them | ||
+ | |||
+ | There are three constructs for this purpose: | ||
+ | * Type Declaration | ||
+ | * Value definition | ||
+ | * Function definition | ||
+ | |||
+ | Type Declarations are responsible for inserting new entries in the current symbol table; given an id and a type, the rule checks if it has already been declared in the current symbol table and that, in case it is a function, it does not return a list as it is not possible in this language. Since the type is on the right of the rule, it is easy to propagate the type to multiple declarations using a synthesized attribute. \\ | ||
+ | Here is the code: | ||
+ | <code java> | ||
+ | DECL_TYPE ::= id:id cm DECL_TYPE: | ||
+ | {: | ||
+ | Type type = decl_type.getType(); | ||
+ | if (!symTableStack.isLocallyDeclared(id) && !(type.isSubtype(" | ||
+ | symTableStack.putEntry(id, | ||
+ | } | ||
+ | else [...] //Error handling | ||
+ | RESULT = new AST.DeclType(new Type(type)); | ||
+ | :} | ||
+ | | id:id clns TYPE:type | ||
+ | {: | ||
+ | if (!symTableStack.isLocallyDeclared(id) && !(type.isSubtype(" | ||
+ | symTableStack.putEntry(id, | ||
+ | } | ||
+ | else [...] // Error handling | ||
+ | RESULT = new AST.DeclType(type); | ||
+ | :} | ||
+ | ; | ||
+ | </ | ||
+ | |||
+ | Value definitions allow defining local and global values. Once we recognize that we are parsing a value definition, an intermediate action defines the value as assigned, allowing recursion on the value, which is possible in Haskell. Before this, the action checks if the value is locally declared and not assigned yet. Finally, when the rule is reduced, the first check verifies if the type of the expression and the type of the value match; then, we check if the value is global or local. In case it is global, the value is treated as a nullary function, as global values are here intended as functions with no arguments; we will see that during IR generation; otherwise, it is treated as a value. \\ | ||
+ | Here is the code: | ||
+ | |||
+ | <code java> | ||
+ | DEF_VALUE ::= id:id | ||
+ | {: | ||
+ | if (symTableStack.isDeclared(id) && symTableStack.isLocallyDeclared(id) && !symTableStack.isAssigned(id)) { | ||
+ | symTableStack.getEntry(id).setIsAssigned(true); | ||
+ | } | ||
+ | else [...] //Error handling | ||
+ | :} | ||
+ | eq EXPR:expr | ||
+ | {: | ||
+ | if (symTableStack.isDeclared(id) && symTableStack.isLocallyDeclared(id) && symTableStack.isAssigned(id) && !expr.type.isEquivalent(symTableStack.getEntry(id).getType().getTypeName())) { | ||
+ | [...] //Error handling | ||
+ | } | ||
+ | if (symTableStack.isGlobal(id)) { | ||
+ | // Global Values are actually Nullary Functions | ||
+ | ArrayList< | ||
+ | ArrayList< | ||
+ | AST.LFormArg lformarg = new AST.LFormArg(argNames, | ||
+ | RESULT = new AST.DefFunct(id, | ||
+ | } | ||
+ | else { | ||
+ | RESULT = new AST.DefValue(symTableStack.createUniqueId(id), | ||
+ | } | ||
+ | :} | ||
+ | ; | ||
+ | </ | ||
+ | |||
+ | Finally, function definitions allow defining functions. Once we recognize that we are parsing a function definition, an intermediate action defines the function as assigned, allowing recursion. | ||
+ | Before that, the action checks if the function is locally declared and not assigned yet. Functions define their own context, i.e. their own symbol table, so we need to push a symbol table before parsing the rest of the function. Finally, the symbol table is popped and the AST node for the function definition is created. | ||
+ | |||
+ | <code java> | ||
+ | DEF_FUNCT ::= id:id | ||
+ | {: | ||
+ | if (symTableStack.isDeclared(id) && symTableStack.isLocallyDeclared(id) && !symTableStack.isAssigned(id)) { | ||
+ | symTableStack.getEntry(id).setIsAssigned(true); | ||
+ | } | ||
+ | else [...] // Error handling | ||
+ | symTableStack.pushSymTable(); | ||
+ | :} | ||
+ | LFORMARG: | ||
+ | {: | ||
+ | [...] // Error handling | ||
+ | symTableStack.popSymTable(); | ||
+ | RESULT = new AST.DefFunct(symTableStack.createUniqueId(id), | ||
+ | :} | ||
+ | </ | ||
+ | |||
+ | To handle formal arguments, represented by the '' | ||
+ | Here is the code: | ||
+ | <code java> | ||
+ | LFORMARG ::= LFORMARG: | ||
+ | {: | ||
+ | Type propType = lformarg.getPropType(); | ||
+ | if (!symTableStack.isLocallyDeclared(id) && propType != null && propType.isSubtype(" | ||
+ | Type argTree = new Type(propType.getTypeParam(0)); | ||
+ | symTableStack.putEntry(id, | ||
+ | lformarg.getArgNames().add(symTableStack.createUniqueId(id)); | ||
+ | lformarg.getArgTypes().add(argTree); | ||
+ | RESULT = new AST.LFormArg(lformarg.getArgNames(), | ||
+ | } | ||
+ | else [...] // Error handling | ||
+ | :} | ||
+ | | id:id | ||
+ | {: | ||
+ | String funcName = (String) parser.stack(-2); | ||
+ | if (symTableStack.isDeclared(funcName)) { | ||
+ | Type type = symTableStack.getEntry(funcName).getType(); | ||
+ | if (!symTableStack.isLocallyDeclared(id) && type != null && type.isSubtype(" | ||
+ | Type argTree = new Type(type.getTypeParam(0)); | ||
+ | symTableStack.putEntry(id, | ||
+ | ArrayList< | ||
+ | ArrayList< | ||
+ | argNames.add(symTableStack.createUniqueId(id)); | ||
+ | argTypes.add(argTree); | ||
+ | RESULT = new AST.LFormArg(argNames, | ||
+ | } | ||
+ | else [...] //Error handling | ||
+ | } | ||
+ | else [...] //Error handling | ||
+ | :} | ||
+ | ; | ||
+ | </ | ||
+ | |||
+ | == Expressions and conditions == | ||
+ | |||
+ | Expressions are fundamental for performing computation, | ||
+ | In particular, special constructs are: | ||
+ | * let bindings; | ||
+ | * if-then-else blocks; | ||
+ | * function calls; | ||
+ | * constants; | ||
+ | * list of expressions | ||
+ | |||
+ | For all these expressions, | ||
+ | Instead, basic expressions are all handled in the same manner: we check the type of the operands and then produce the type of the result. The type is the only information that we can use for semantic analysis, in this case. | ||
+ | |||
+ | Conditions are special expressions, | ||
+ | |||
+ | Example of operation: addition. | ||
+ | |||
+ | <code java> | ||
+ | EXPR ::= EXPR:expr1 plus EXPR:expr2 | ||
+ | {: | ||
+ | String opType = " | ||
+ | AST.BasicExpr[] subExpressions = { expr1, expr2 }; | ||
+ | if (expr1.type.isSubtype(" | ||
+ | RESULT = new AST.Expr(new Type(expr1.type), | ||
+ | } | ||
+ | else [...] //Error handling | ||
+ | :} | ||
+ | </ | ||
+ | |||
+ | == Function calls == | ||
+ | |||
+ | In this compiler, function calls refer both to functions and to values as Haskell does. | ||
+ | The semantic rule proceeds as follows: first, we check that the function/ | ||
+ | |||
+ | <code java> | ||
+ | FUNCT_CALL ::= id:id LACTARG: | ||
+ | {: | ||
+ | int nArgs = lactarg.size(); | ||
+ | Type funcType = null; | ||
+ | boolean failedArgsTypeChecking = false; | ||
+ | if(symTableStack.isDeclared(id) && symTableStack.isAssigned(id)) { | ||
+ | funcType = symTableStack.getEntry(id).getType(); | ||
+ | // This is a Value Call | ||
+ | if (nArgs == 0) { | ||
+ | if (funcType != null && !funcType.isSubtype(" | ||
+ | RESULT = new AST.FunctCall(new Type(funcType), | ||
+ | } | ||
+ | else [...] // Error handling | ||
+ | } | ||
+ | // This is a Function Call | ||
+ | else if (nArgs > 0) { | ||
+ | if (funcType != null && funcType.isSubtype(" | ||
+ | for (AST.Expr actarg : lactarg) { | ||
+ | if (!actarg.type.isEquivalent(funcType.getTypeParam(0))) | ||
+ | failedArgsTypeChecking = true; | ||
+ | funcType = funcType.getTypeParam(1); | ||
+ | } | ||
+ | if (!failedArgsTypeChecking) { | ||
+ | RESULT = new AST.FunctCall(new Type(funcType), | ||
+ | } | ||
+ | else [...] // Error handling | ||
+ | } | ||
+ | else [...] //Error handling | ||
+ | } | ||
+ | } | ||
+ | else [...] //Error handling | ||
+ | } | ||
+ | :} | ||
+ | ; | ||
+ | </ | ||
+ | |||
+ | An important point: the '' | ||
+ | This means that the following Haskell code works: | ||
+ | |||
+ | <code haskell> | ||
+ | value :: Int | ||
+ | value = let x :: Int; x = 3 | ||
+ | in | ||
+ | let y :: Int; y = 4 | ||
+ | in | ||
+ | x + y | ||
+ | </ | ||
+ | |||
+ | == Let bindings (functional part) == | ||
+ | |||
+ | Let bindings exist for one purpose: declaring and defining new variables in a nested context. For this reason, the semantic rules only create and destroy the symbol table where to store the new statements. | ||
+ | |||
+ | <code java> | ||
+ | LET_BLOCK_FUNC ::= let indent | ||
+ | {: | ||
+ | symTableStack.pushSymTable(); | ||
+ | :} | ||
+ | | ||
+ | {: | ||
+ | symTableStack.popSymTable(); | ||
+ | RESULT = new AST.LetBlockFunc(new Type(expr.type), | ||
+ | :} | ||
+ | | [...] // | ||
+ | ; | ||
+ | </ | ||
+ | |||
+ | == The if-then-else block (functional part) == | ||
+ | |||
+ | Like for let bindings, if-then-else blocks are rather simple to analyze: the parser needs only to check that the type of the value returned by the two branches is the same, otherwise it is an error. | ||
+ | |||
+ | < | ||
+ | IF_BLOCK_FUNC ::= if_begin COND:cond then EXPR: | ||
+ | {: | ||
+ | if(thenBody.type.isEquivalent(elseBody.type)) { | ||
+ | RESULT = new AST.IfBlockFunc(thenBody.type, | ||
+ | } | ||
+ | else [...] // Error handling | ||
+ | :} | ||
+ | | [...] // | ||
+ | ; | ||
+ | </ | ||
+ | |||
+ | == Building types == | ||
+ | |||
+ | How are type trees built? By mounting the subtrees according to the production rules, as the code clearly shows. | ||
+ | |||
+ | <code java> | ||
+ | TYPE ::= TYPE_VALUE: | ||
+ | | TYPE_FUNC: | ||
+ | ; | ||
+ | |||
+ | // both basic types and compound | ||
+ | TYPE_VALUE ::= TYPE_LIST: | ||
+ | | TYPE_BASIC: | ||
+ | ; | ||
+ | |||
+ | TYPE_BASIC ::= type_int {: RESULT = Type.getType(" | ||
+ | | type_double {: RESULT = Type.getType(" | ||
+ | | type_bool {: RESULT = Type.getType(" | ||
+ | | type_char {: RESULT = Type.getType(" | ||
+ | ; | ||
+ | |||
+ | TYPE_LIST ::= bo TYPE_BASIC: | ||
+ | {: | ||
+ | Type type = Type.getType(" | ||
+ | | ||
+ | | ||
+ | :} | ||
+ | | type_string {: RESULT = Type.getType(" | ||
+ | ; | ||
+ | |||
+ | // no higher order functions | ||
+ | TYPE_FUNC ::= TYPE_VALUE: | ||
+ | {: | ||
+ | Type type = Type.getType(" | ||
+ | type.setTypeParam(0, | ||
+ | type.setTypeParam(1, | ||
+ | RESULT = type; | ||
+ | :} | ||
+ | | TYPE_VALUE: | ||
+ | {: | ||
+ | Type type = Type.getType(" | ||
+ | type.setTypeParam(0, | ||
+ | type.setTypeParam(1, | ||
+ | RESULT = type; | ||
+ | :} | ||
+ | ; | ||
+ | </ | ||
+ | |||
+ | === Error handling === | ||
+ | |||
+ | As for lexicon and syntax, the semantics of a program can be wrong: it is necessary to handle these errors, too. | ||
+ | |||
+ | Errors are recognized inside the semantic rules, so error detection and recovery are defined inside the Parser. | ||
+ | |||
+ | Every time an error is detected, the parser calls the '' | ||
+ | <code java> | ||
+ | public void report_semantic_error(String msg) { | ||
+ | noCompileErrors = false; | ||
+ | System.err.print(" | ||
+ | System.err.println(" | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Then, the parser creates a valid AST node for the production rule and so it recovers the error and can move forward. | ||
+ | |||
+ | Example: semantic error in a functional if-then-else block: | ||
+ | |||
+ | <code java> | ||
+ | IF_BLOCK_FUNC ::= if_begin COND:cond then EXPR: | ||
+ | {: | ||
+ | if(thenBody.type.isEquivalent(elseBody.type)) { | ||
+ | RESULT = new AST.IfBlockFunc(thenBody.type, | ||
+ | } | ||
+ | else { | ||
+ | report_semantic_error(" | ||
+ | RESULT = new AST.IfBlockFunc(Type.getType(" | ||
+ | } | ||
+ | :} | ||
+ | | [...] // | ||
+ | </ | ||
+ | |||
+ | This is the list of all the possible syntax errors that LHC can recognize: | ||
+ | * Print does not support the argument type | ||
+ | * No support for local functions | ||
+ | * Global lists not supported | ||
+ | * Lists as a return value are not supported | ||
+ | * Missing value declaration | ||
+ | * Missing function declaration | ||
+ | * Multiple value declaration | ||
+ | * Value is not locally declared | ||
+ | * Function is not locally declared | ||
+ | * Multiple value assignment | ||
+ | * Multiple function assignment | ||
+ | * Mismatching type on assignment | ||
+ | * Not a function (but you expect one) | ||
+ | * Not a value (but you expect one) | ||
+ | * Mismatching arity in function (wrong number of arguments) | ||
+ | * Return type is not equivalent to the expression type (in Function Definition) | ||
+ | * Multiple argument declaration | ||
+ | * Wrong type in condition (Expected type: Bool) | ||
+ | * Wrong type in LHS/RHS of [expression_kind] expression (expression_kind represents " | ||
+ | * Mismatching operand type on [expression_kind] Expression | ||
+ | * Wrong type in arguments applied to function | ||
+ | * Value declared but not assigned (when using a not initialized variable) | ||
+ | * Multiple return types for if-then-else | ||
+ | * Wrong type in list expression (subexpressions in a list expression must have the same type) | ||
+ | |||
+ | ==== IR Generation ==== | ||
+ | |||
+ | After lexical, syntactical and semantic analysis, we can assume that the program is correct and recognized so we can proceed with the next step: generating the intermediate representation. | ||
+ | For this purpose, we need additional fields and methods, especially in the AST. | ||
+ | |||
+ | We said that the '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | |||
+ | The '' | ||
+ | The '' | ||
+ | |||
+ | In this section, we focus on the same constructs seen in the previous section with respect to IR generation. | ||
+ | |||
+ | == The start and the end of parsing == | ||
+ | |||
+ | The '' | ||
+ | |||
+ | <code java> | ||
+ | public static class Program extends BasicNode { | ||
+ | private FunctPart functPart; | ||
+ | private ImperPart imperPart; | ||
+ | |||
+ | Program(FunctPart functPart, ImperPart imperPart) { | ||
+ | this.functPart = functPart; | ||
+ | this.imperPart = imperPart; | ||
+ | } | ||
+ | |||
+ | @Override | ||
+ | String codegen() { | ||
+ | functPart.codegen(); | ||
+ | mountCode(functPart); | ||
+ | imperPart.codegen(); | ||
+ | mountCode(imperPart); | ||
+ | return null; | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | == The Print function == | ||
+ | |||
+ | The '' | ||
+ | |||
+ | <code java> | ||
+ | public static class Print extends IOAction { | ||
+ | private Expr actarg; | ||
+ | |||
+ | Print(Expr actarg) { | ||
+ | this.actarg = actarg; | ||
+ | } | ||
+ | |||
+ | @Override | ||
+ | String codegen() { | ||
+ | String actargIndex = actarg.codegen(); | ||
+ | mountCode(actarg); | ||
+ | String result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createPrintCall(result, | ||
+ | return null; | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | == Do blocks == | ||
+ | |||
+ | Do blocks are lists of statements, so it is sufficient to generate the code of all the statements composing the block. | ||
+ | |||
+ | <code java> | ||
+ | public static class DoBlock extends IOAction { | ||
+ | private LinkedList< | ||
+ | |||
+ | DoBlock(LinkedList< | ||
+ | this.ioActions = ioActions; | ||
+ | } | ||
+ | |||
+ | @Override | ||
+ | String codegen() { | ||
+ | for (DoStmt ioAction: ioActions) { | ||
+ | ioAction.codegen(); | ||
+ | mountCode(ioAction); | ||
+ | } | ||
+ | return null; | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | == Declarations and definitions == | ||
+ | |||
+ | Type declarations do not generate code: since in Haskell we cannot use a value/ | ||
+ | |||
+ | Value definitions are rather simple: first, we evaluate the expression to assign; then we allocate the value and store the expression in it. Values here are intended as __local__: global values are treated as functions and use a '' | ||
+ | |||
+ | <code java> | ||
+ | public static class DefValue extends Stmt { | ||
+ | private String id; // must be a " | ||
+ | private Expr expr; | ||
+ | |||
+ | DefValue(String id, Expr expr) { | ||
+ | this.id = id; | ||
+ | this.expr = expr; | ||
+ | } | ||
+ | |||
+ | @Override | ||
+ | String codegen() { | ||
+ | String exprIndex = expr.codegen(); | ||
+ | mountCode(expr); | ||
+ | code.add(LLVM.createAlloca(LLVM.createRegister(id), | ||
+ | code.add(LLVM.createStore(LLVM.createRegister(id), | ||
+ | return null; | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Function definitions define global functions. To generate a function we need to: | ||
+ | * create its definition (return type, name and arguments); | ||
+ | * open the function; the first block of the function is always called '' | ||
+ | * store all the values of the arguments and give them their name as in the native Haskell code; | ||
+ | * generate the code of the body | ||
+ | * create the return statement | ||
+ | * close the function | ||
+ | |||
+ | All these steps can be recognized in the code: | ||
+ | |||
+ | <code java> | ||
+ | public static class DefFunct extends Stmt { | ||
+ | private String id; // must be a " | ||
+ | private LFormArg lformarg; | ||
+ | private Expr expr; | ||
+ | |||
+ | DefFunct(String id, LFormArg lformarg, Expr expr) { | ||
+ | this.id = id; | ||
+ | this.lformarg = lformarg; | ||
+ | this.expr = expr; | ||
+ | } | ||
+ | |||
+ | @Override | ||
+ | String codegen() { | ||
+ | ArrayList< | ||
+ | ArrayList< | ||
+ | LLVM.resetCounterSSA(); | ||
+ | code.add(LLVM.createFunctionDefinition(id, | ||
+ | code.addAll(LLVM.openFunction()); | ||
+ | setBasicBlock(" | ||
+ | for (int i = 0; i < argNames.size(); | ||
+ | code.add(LLVM.createAlloca(LLVM.createRegister(argNames.get(i)), | ||
+ | code.add(LLVM.createStore(LLVM.createRegister(argNames.get(i)), | ||
+ | LLVM.getCounterSSA(); | ||
+ | } | ||
+ | String result = expr.codegen(); | ||
+ | mountCode(expr); | ||
+ | code.add(LLVM.createReturn(result, | ||
+ | code.add(LLVM.closeFunction()); | ||
+ | return null; | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | == Expressions and conditions == | ||
+ | |||
+ | To evaluate an expression, it is necessary to evaluate the subexpressions first, which are stored in an array. Then, we can build the code for the expression itself: first, we create a new register for the result; then, we create the operation instruction and finally use the register of the return value as the return value of the '' | ||
+ | The same distinction between operations and special constructs is still true during code generation. While operations build code using LLVM instructions for arithmetic and more complex operations, special constructs are managed on their own.\\ | ||
+ | Conditions are boolean expressions, | ||
+ | |||
+ | <code java> | ||
+ | public enum ExprKind { | ||
+ | PLUS, MINUS, TIMES, DIV, INTDIV, REM, AND, OR, | ||
+ | RELNOTEQ, RELEQ, RELGE, RELGT, RELLE, RELLT, | ||
+ | INDEX, NOT, UMINUS, LET_BLOCK_FUNC, | ||
+ | IF_BLOCK_FUNC, | ||
+ | } | ||
+ | |||
+ | public static class Expr extends BasicExpr { | ||
+ | |||
+ | private ExprKind exprKind; | ||
+ | private BasicExpr[] subExpressions; | ||
+ | |||
+ | Expr (Type type, ExprKind exprKind, BasicExpr[] subExpressions) { | ||
+ | this.type = type; | ||
+ | this.exprKind = exprKind; | ||
+ | this.subExpressions = subExpressions; | ||
+ | } | ||
+ | |||
+ | @Override | ||
+ | String codegen() { | ||
+ | String[] subExpResults = new String[subExpressions.length]; | ||
+ | String result = null; | ||
+ | for (int i = 0; i < subExpressions.length; | ||
+ | subExpResults[i] = subExpressions[i].codegen(); | ||
+ | mountCode(subExpressions[i]); | ||
+ | } | ||
+ | switch (exprKind) { | ||
+ | case PLUS: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createAdd(result, | ||
+ | break; | ||
+ | case MINUS: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createSub(result, | ||
+ | break; | ||
+ | case TIMES: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createMult(result, | ||
+ | break; | ||
+ | case DIV: | ||
+ | String[] oldExpResults = new String[subExpressions.length]; | ||
+ | for (int i = 0; i < subExpressions.length; | ||
+ | oldExpResults[i] = subExpResults[i]; | ||
+ | if (subExpressions[i].getType().isEquivalent(" | ||
+ | subExpResults[i] = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createSItoFP(subExpResults[i], | ||
+ | } | ||
+ | } | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createDiv(result, | ||
+ | break; | ||
+ | case INTDIV: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createIntDiv(result, | ||
+ | break; | ||
+ | case REM: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createRem(result, | ||
+ | break; | ||
+ | case AND: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createAnd(result, | ||
+ | break; | ||
+ | case OR: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createOr(result, | ||
+ | break; | ||
+ | case RELNOTEQ: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createRelNotEQ(result, | ||
+ | break; | ||
+ | case RELEQ: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createRelEQ(result, | ||
+ | break; | ||
+ | case RELGE: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createRelGE(result, | ||
+ | break; | ||
+ | case RELGT: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createRelGT(result, | ||
+ | break; | ||
+ | case RELLE: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createRelLE(result, | ||
+ | break; | ||
+ | case RELLT: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createRelLT(result, | ||
+ | break; | ||
+ | case INDEX: | ||
+ | String elemPtr = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createGEP(elemPtr, | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createLoad(result, | ||
+ | break; | ||
+ | case NOT: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createNot(result, | ||
+ | break; | ||
+ | case UMINUS: | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createUMinus(result, | ||
+ | break; | ||
+ | case LET_BLOCK_FUNC: | ||
+ | result = subExpResults[0]; | ||
+ | break; | ||
+ | case IF_BLOCK_FUNC: | ||
+ | result = subExpResults[0]; | ||
+ | break; | ||
+ | case FUNCT_CALL: | ||
+ | result = subExpResults[0]; | ||
+ | break; | ||
+ | case VALUE: | ||
+ | result = subExpResults[0]; | ||
+ | break; | ||
+ | case EXPR_LIST: | ||
+ | result = subExpResults[0]; | ||
+ | break; | ||
+ | } | ||
+ | return result; | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | == Function calls == | ||
+ | |||
+ | To perform a function call, we need first to evaluate all its arguments. If it is a value, we create a register where to load its content and finally return this register. Otherwise, we create a register where to save the result, create a function call by using the function name and all its arguments one after the other; finally, we return the result register. | ||
+ | |||
+ | <code java> | ||
+ | public static class FunctCall extends BasicExpr { | ||
+ | private String id; // must be a unique id | ||
+ | private LinkedList< | ||
+ | |||
+ | FunctCall(Type type, String id, LinkedList< | ||
+ | this.type = type; | ||
+ | this.id = id; | ||
+ | this.actargs = actargs; | ||
+ | } | ||
+ | |||
+ | @Override | ||
+ | String codegen() { | ||
+ | ArrayList< | ||
+ | ArrayList< | ||
+ | String result; | ||
+ | for (Expr actarg: actargs) { | ||
+ | argIds.add(actarg.codegen()); | ||
+ | argTypes.add(actarg.type); | ||
+ | mountCode(actarg); | ||
+ | } | ||
+ | if (id.contains(" | ||
+ | result = LLVM.createRegister(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createLoad(result, | ||
+ | } | ||
+ | else { | ||
+ | result = LLVM.createRegister(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createFunctionCall(result, | ||
+ | } | ||
+ | return result; | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | __About lexical scoping__: if we have two values with the same name, how can we refer to the right one? We may distinguish them by observing in which symbol table the value is defined, but symbol tables are destroyed when parsing terminates. So, we append to the name of a value a symbol table identifier reporting this kind of information. \\ | ||
+ | In particular, '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | |||
+ | When we reduce a function call, such a unique id is stored in the AST node of the expression, thus storing the information and preserving name uniqueness. In this way, the unique identifier can be used during code generation. | ||
+ | |||
+ | == Let bindings (functional part) == | ||
+ | |||
+ | A let binding consists of a sequence of statements, together with a final expression. So, to generate its code we need to generate all the statements and then evaluate the expression. | ||
+ | |||
+ | <code java> | ||
+ | public static class LetBlockFunc extends BasicExpr { | ||
+ | private LinkedList< | ||
+ | private Expr expr; | ||
+ | |||
+ | LetBlockFunc(Type type, LinkedList< | ||
+ | this.type = type; | ||
+ | this.letStmts = letStmts; | ||
+ | this.expr = expr; | ||
+ | } | ||
+ | |||
+ | @Override | ||
+ | String codegen() { | ||
+ | for (Stmt letStmt: letStmts) { | ||
+ | letStmt.codegen(); | ||
+ | mountCode(letStmt); | ||
+ | } | ||
+ | String result = expr.codegen(); | ||
+ | mountCode(expr); | ||
+ | return result; | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | == If-then-else block (functional part) == | ||
+ | |||
+ | The if-then-else block is one of the most peculiar in terms of IR generation because it exploits the basic block concept, which is used in LLVM to define the control flow graph. \\ | ||
+ | Before generating code, we define an index for this if-then-else block: every if-then-else has a different index so that it is impossible to have homonymy conflicts on the labels. | ||
+ | To generate an if-then-else block, we need to evaluate its condition first. Then, we create a conditional branch, depending on the condition result. After this, the body of the then branch can be generated. Since we define a new label, we define a new basic block; for this reason, we call '' | ||
+ | Then, we create an unconditional branch towards the end of the block, to skip the else expression. At this point, we can generate the else expression and branch to the exit of the loop: this latter step is mandatory because in LLVM every basic block requires either a branch or a return statement. \\ | ||
+ | Finally, we create a register for the result and define a phi node: the phi node is a sort of merge operator which takes the results of the two branches and selects which one to consider depending on which branch has been executed. For phi nodes to work correctly, it is mandatory to define the exact basic block where the result of the branch is defined, which explains why it is necessary to call '' | ||
+ | |||
+ | <code java> | ||
+ | public static class IfBlockFunc extends BasicExpr { | ||
+ | private Expr cond, thenBody, elseBody; | ||
+ | |||
+ | IfBlockFunc(Type type, Expr cond, Expr thenBody, Expr elseBody) { | ||
+ | this.type = type; | ||
+ | this.cond = cond; | ||
+ | this.thenBody = thenBody; | ||
+ | this.elseBody = elseBody; | ||
+ | } | ||
+ | |||
+ | @Override | ||
+ | String codegen() { | ||
+ | final int ifIndex; | ||
+ | String condIndex, thenIndex, elseIndex, thenLabel, elseLabel, exitLabel, thenBB, elseBB, result; | ||
+ | ifIndex = LLVM.getCounterIf(); | ||
+ | condIndex = cond.codegen(); | ||
+ | mountCode(cond); | ||
+ | code.add(LLVM.createBranchCond(condIndex, | ||
+ | thenLabel = " | ||
+ | code.add(LLVM.createLabel(" | ||
+ | setBasicBlock(thenLabel); | ||
+ | thenIndex = thenBody.codegen(); | ||
+ | mountCode(thenBody); | ||
+ | thenBB = getBasicBlock(); | ||
+ | code.add(LLVM.createBranchNotCond(" | ||
+ | elseLabel = " | ||
+ | code.add(LLVM.createLabel(" | ||
+ | setBasicBlock(elseLabel); | ||
+ | elseIndex = elseBody.codegen(); | ||
+ | mountCode(elseBody); | ||
+ | elseBB = getBasicBlock(); | ||
+ | code.add(LLVM.createBranchNotCond(" | ||
+ | exitLabel = " | ||
+ | code.add(LLVM.createLabel(" | ||
+ | setBasicBlock(exitLabel); | ||
+ | result = LLVM.createVariable(LLVM.getCounterSSA()); | ||
+ | code.add(LLVM.createPHINode(result, | ||
+ | return result; | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | To conclude, the parser needs additional fields and methods, too; in particular, it defines a '' | ||
+ | |||
+ | <code java> | ||
+ | PROGRAM ::= /* empty program */ | ||
+ | {: | ||
+ | if (noCompileErrors) { | ||
+ | System.out.println(" | ||
+ | } | ||
+ | RESULT = new AST.Program(new AST.FunctPart(), | ||
+ | // Code generation | ||
+ | RESULT.codegen(); | ||
+ | // Retrieve the code | ||
+ | setCode(RESULT.getCode()); | ||
+ | :} | ||
+ | | indent | ||
+ | {: // push the top-level symtable | ||
+ | symTableStack.pushSymTable(); | ||
+ | :} | ||
+ | FUNCT_PART: | ||
+ | {: | ||
+ | symTableStack.popSymTable(); | ||
+ | :} | ||
+ | dedent | ||
+ | {: // pop the top-level symtable | ||
+ | if (noCompileErrors) { | ||
+ | System.out.println(" | ||
+ | RESULT = new AST.Program(funct_part, | ||
+ | // Code generation | ||
+ | RESULT.codegen(); | ||
+ | // Retrieve the code | ||
+ | setCode(RESULT.getCode()); | ||
+ | } | ||
+ | else { | ||
+ | System.out.println(" | ||
+ | RESULT = new AST.Program(null, | ||
+ | } | ||
+ | :} | ||
+ | ; | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ===== Installation and usage ===== | ||
+ | |||
+ | LHC is compatible with: | ||
+ | |||
+ | * Ubuntu 20.04 LTS (but it should work with less recent versions or other Linux distributions, | ||
+ | |||
+ | === Prerequisites === | ||
+ | |||
+ | * JDK (Java Development Kit) - Version 16 or above (successfully tested with JDK 16, but it should work with less recent releases, too) | ||
+ | * [[https:// | ||
+ | * [[install_linux_bash# | ||
+ | * LLVM (use '' | ||
+ | |||
+ | === Download and Commands === | ||
+ | |||
+ | Download link: [[https:// | ||
+ | After downloading and accessing the main folder, you can use the following commands to play with the compiler: | ||
+ | |||
+ | * **make install**: Install everything | ||
+ | * **make uninstall**: | ||
+ | * **make %.ll**: build the corresponding %.hs file and produce the resulting LLVM code | ||
+ | * **lli %.ll**: run the output code | ||
+ | |||
+ | In the // | ||
+ | |||
+ | == Example == | ||
+ | |||
+ | Let us suppose that we have just downloaded and unzipped the compiler and that we want to compile the GCD program stored in // | ||
+ | First, go to the main folder (LHC). | ||
+ | Then, run the following commands to install the compiler, build the code and run it: | ||
+ | - to install everything: //make install// | ||
+ | - to build the code: //make ./ | ||
+ | - to run it: //lli ./ | ||
+ | - finally, to uninstall everything: //make uninstall// | ||
+ | |||
+ | === Other links === | ||
+ | |||
+ | Github link to the original repo: https:// |
If you found any error, or if you want to partecipate to the editing of this wiki, please contact: admin [at] skenz.it
You can reuse, distribute or modify the content of this page, but you must cite in any document (or webpage) this url: https://www.skenz.it/compilers/haskell_to_llvm?do=diff&rev2%5B0%5D=1642505332&rev2%5B1%5D=1642520182&difftype=sidebyside
/web/htdocs/www.skenz.it/home/data/pages/compilers/haskell_to_llvm.txt · Last modified: 2024/04/08 22:34 by 127.0.0.1