Differences

This shows you the differences between two versions of the page.

@@ Line 1: / Line 1: @@
+====== Ply ======
+PLY is a parsing tool written purely in Python. It is a re-implementation of Lex and Yacc
+originally in C-language. PLY uses the same LALR parsing technique as Lex and Yacc.
+It includes support for empty productions, precedence rules, error recovery, and ambiguous
+grammars.
+PLY has two main modules which are part of the ply package; these modules are used to extend
+the classes used to develop the Laboratories (myLexer.py and myParser.py) as we’ll see
+later; the modules used are:
+    ply.lex - A re-implementation of Lex for lexical analysis
+    ply.yacc - A re-implementation of Yacc for parser creation
+The two tools are meant to work together. Specifically, lex.py provides an external
+interface in the form of a token() function that returns the next valid token on the
+input stream. ''yacc.py'' calls this repeatedly to retrieve tokens and invoke grammar rules.
+Finally the PLY parser will generate an LALR(1) parsing table.
+In summary, PLY consists of two separate modules: ''lex.py'' and ''yacc.py''.
+In the rest of this page you can find a description of the main functionalities of ''ply''.
+**MANY EXAMPLES ABOUT PLY:** [[https://www.skenz.it/compilers/ply_examples|PLY EXAMPLES]]
+===== Execution =====
+To execute all the examples developed you have to run from the command line the
+following command:
+	python exercise_name.py myFile.txt
+Remember to pass ALWAYS the ''myFile.txt'' file: it contains all the input strings to be
+passed to the lexer.
+When some output files have to be generated, these are created inside the
+current directory of the exercise itself.
+====== Lexer ======
+=== Solution 1 ===
+For simple projects or problems, it is possible to define the lexer (and its components) inside the main function itself.
+<file python Laboratories/Lab1/Lab1_ex1/ex1.py>
+import ply.lex as lex
+import ply.yacc as yacc
+import sys
+# list of TOKENS
+tokens = [
+    'nl',
+    'letter',
+    'digit',
+    'id',
+    'path'
+]
+# tokens DEFINITION
+t_nl = r'\n|\r|\r\n'
+t_letter = r'[^\n\r\\/:*?\"<> |0-9]'
+t_digit  = r'[0-9]'
+t_id     = r'('+t_digit+'|'+t_letter+')+'
+t_path = r'^(' + t_letter + r':)?(\\)?(' + t_id + r'\\)*' + t_id + r'("."' + t_id + r')?$'
+t_ignore = r'^.+$'
+def t_error(t):
+    print("Illegal characters: ", t.value)
+    t.lexer.skip(1)
+# reading INPUT FILE
+myFile = open(sys.argv[1])
+lexer = lex.lex()
+with myFile as fp:
+    for line in fp:
+        try:
+            lexer.input(line)
+            for token in lexer:
+                if token.type == 'path':
+                    print("Path Correct: ", token.value)
+        except EOFError:
+            break
+</file>
+**Text File to execute the code above:** [[https://www.skenz.it/repository/compilers/ply/Laboratories/Lab1/Lab1_ex1/myFile.txt|myFile.txt]]
+For execution:
+<code bash>
+python ex1.py myFile.txt
+</code>
+As you can see the code above is kind of simple and it doesn't need to specify a Lexer class to define all the tokens needed.
+=== Solution 2 ===
+If you have to develop more complex solutions, putting the lexer (and also the parser)
+inside the main function isn't a smart solution.
+The module option can be used to define lexers from instances of a class.
+For example:
+<file python Practices/Lexer/example.py>
+import ply.lex as lex
+import ply.yacc as yacc
+import sys
+from myLexer import *
+# create objects MY LEXER and MY PARSER
+myLex = MyLexer()
+lexer = myLex.lexer
+# reading INPUT FILE
+myFile = open(sys.argv[1])
+with myFile as fp:
+    for line in fp:
+        try:
+            lexer.input(line)
+            for token in lexer:
+                pass
+        except EOFError:
+            break
+</file>
+<file python Practices/Lexer/myLexer.py>
+import ply.lex as lex
+import ply.yacc as yacc
+import sys
+from ply.lex import TOKEN
+class MyLexer():
+    # CONSTRUCTOR
+    def __init__(self):
+        print('Lexer constructor called.')
+        self.lexer = lex.lex(module=self)
+    # DESTRUCTOR
+    def __del__(self):
+        print('Lexer destructor called.')
+    # list of TOKENS
+    tokens = [
+        'NUMBER',
+        'DIV', 'STAR', 'PLUS', 'MINUS',
+        'OPEN_BRACKET', 'CLOSE_BRACKET',
+        'EQUAL', 'error'
+    ]
+    # tokens DEFINITION
+    def t_NUMBER(self,t):
+        r'[1-9][0-9]*'
+        print("NUMBER:", t.value)
+    def t_PLUS(self,t):
+        r'\+'
+        print("PLUS")
+    def t_MINUS(self, t):
+        r'\-'
+        print("MINUS")
+    def t_STAR(self,t):
+        r'\*'
+        print("STAR")
+    def t_DIV(self,t):
+        r'\/'
+        print("DIV")
+    def t_OPEN_BRACKET(self,t):
+        r'\('
+        print("OPEN_BRACKET")
+    def t_CLOSE_BRACKET(self,t):
+        r'\)'
+        print("CLOSE_BRACKET")
+    def t_EQUAL(self,t):
+        r'\='
+        print("EQUAL\n")
+    def t_nl(self,t):
+        r'(\n|\r|\r\n)|\s|\t'
+    # every symbol that doesn't match with almost one of the previous tokens is considered an error
+    def t_error(self,t):
+        r'.'
+        print("ERROR:", t.value)
+        t.lexer.skip(1)
+</file>
+**Text File to execute the code above:** [[https://www.skenz.it/repository/compilers/ply/Practices/Lexer/myFile.txt|myFile.txt]]
+For execution:
+<code bash>
+python example.py myFile.txt
+</code>
+As you can see the exercise was divided in two different files: the example.py which only istantiate the lexer object and pass to it the input file; the ''myLexer.py'' file which contains the token definitions of the lexer.
+In the first solution, to instantiate the lexer, you can simply use the following instruction
+<code python>
+lexer = lex.lex()
+</code>
+In the second solution the initialization of the lexer happens inside the class constructor
+<code python>
+def __init__(self):
+    print('Lexer constructor called.')
+    self.lexer = lex.lex(module=self)
+</code>
+In most laboratories the ''myLexer.py'' class was defined; it was used to represents the lexer in order to tokenize an input string: as you can see, during the first exercises this class wasn’t used because it isn't necessary;
+anyway, with more complex problems, it is a good practice to create a Lexer (and, if necessary, a Parser) class in order to extend the modules given by the PLY library because, as we'll see later during the Laboratories, in some cases the amount of variables and structures used by the lexer and the parser is enormous.
+Inside the lexer class you can find:
+== Token Array ==
+<code python>
+# It contains all the tokens defined inside the class; it is especially important to let
+# the parser recognize all the tokens used in the grammar rules.
+# Tokens are usually given names to indicate what they are. For example:
+    'ID','EQUALS','NUMBER','PLUS','NUMBER','TIMES',
+    'LPAREN','ID','MINUS','ID','RPAREN'
+</code>
+== List of Tokens ==
+<code python>
+# The identification of tokens is typically done by writing a series of regular expression rules.
+# The token could be defined in two different ways:
+# 1. Regular expression rules for simple tokens
+    t_PLUS    = r'\+'
+    t_MINUS   = r'-'
+    t_TIMES   = r'\*'
+    t_DIVIDE  = r'/'
+    t_LPAREN  = r'\('
+    t_RPAREN  = r'\)'
+# 2. A regular expression rule with some action code
+def t_NUMBER(t):
+     r'\d+'
+     t.value = int(t.value)
+     return t
+</code>
+== List of states ==
+<code python>
+a list of the states that can be matched by the lexer
+# list of STATES -> used only the one to catch comments
+states = (
+    ('COMMENT','exclusive'),
+)
+</code>
+== Rule to track line numbers ==
+<code python>
+def t_newline(t):
+     r'\n+'
+     t.lexer.lineno += len(t.value)
+</code>
+== Ignored Characters ==
+<code python>
+t_ignore  = ' \t'
+</code>
+== Error handling Rule ==
+<code python>
+def t_error(t):
+     print("Illegal character '%s'" % t.value[0])
+     t.lexer.skip(1)
+</code>
+To use the lexer, you first need to give to him input text using its input() method.
+After that, repeated calls to token () produce tokens; for example:
+<code python>
+# reading INPUT FILE
+...
+myFile = open(sys.argv[1])
+with myFile as fp:
+    for line in fp:
+        try:
+            lexer.input(line)
+            for token in lexer:
+                pass
+        except EOFError:
+            break
+...
+</code>
+It is important to remember that all lexers must provide a list "tokens" that defines all
+of the possible token names that can be produced by the lexer. The tokens list is also
+used by the ''yacc.py'' module to identify terminals.
+For example, the list of tokens to be defined can be written as follows:
+<code python>
+tokens = (
+  'NUMBER',
+  'PLUS',
+  'MINUS',
+  'TIMES',
+  'DIVIDE',
+  'LPAREN',
+  'RPAREN',
+)
+</code>
+===== Tokens =====
+As we said before, each token is specified by writing a regular expression rule
+compatible with python's regex module. Each of these rules are defined by making declarations
+with the prefix ''t_'', in order to be recognized as tokens.
+I remind you that simple tokens, can be specified as strings such as follows:
+<code python>
+t_PLUS = r'\+'
+</code>
+If some kind of action needs to be performed, a token rule can be specified as a function.
+For example:
+<code python>
+def t_NUMBER(t):
+    r'\d+'
+    t.value = int(t.value)
+    return t
+</code>
+this rule matches number and converts the string into a Python integer.
+When a function is used, the regular expression rule is specified in the
+function documentation string. The function always takes a single argument which
+is an instance of LexToken. This object has different attributes:
+  - t.type which is the token type (as a string)
+  - t.value which is the lexeme (the actual text matched)
+  - t.lineno which is the current line number
+  - t.lexpos which is the position of the token relative to the beginning of the input text.
+By default, t.type is set to the name following the ''t_ prefix''.
+It is important to remember that all tokens defined by functions are added in the same
+order as they appear in the lexer file
+In some applications, you may want to define build tokens from as a series of more
+complex regular expression rules. In this case is important to use the ''@TOKEN'' decorator
+which can be used by importing the token module:
+<code python>
+from ply.lex import TOKEN
+</code>
+In the following example (from the exercise #2 of the Laboratory #2) we are using a ''data_table'' array to store three different regex to be assembled all together in order to form a larger regex, which is the one used to recognize the "table" token.
+<code python>
+data_table = ['(\<((TABLE)|(table)))(\ +', t_att, ')*(\>)']
+@TOKEN("".join(data_table))
+def t_table(self, t):
+    self.numTables = self.numTables + 1
+    self.writeOut(t.value)
+</code>
+===== States =====
+In advanced parsing applications, it may be useful to have different lexing states.
+For instance, you may want the occurrence of a certain token to trigger a
+different kind of lexing. PLY supports a feature that allows the underlying lexer
+to be put into a series of different states. Each state can have its own tokens,
+lexing rules, ecc...
+To define a new lexing state, it must first be declared inside the "states" list
+in your lex file or class. For example:
+<code python>
+states = (
+    ('foo','exclusive'),
+    ('bar','inclusive'),
+)
+</code>
+This declaration declares two states, 'foo' and 'bar'.
+States may be of two types; 'exclusive' and 'inclusive'.
+== Exclusive ==
+''Exclusive'' state completely overrides the default behavior of the lexer.
+That is, lex will only return tokens and apply rules defined specifically for that state.
+== Inclusive ==
+''Inclusive'' state adds additional tokens and rules to the default set of rules.
+Thus, lex will return both the tokens defined by default in addition to those defined for the inclusive state.
+By default, lexing operates in the ''INITIAL'' state. This state includes all of
+the normally defined tokens. For users who aren't using different states,
+this fact is completely transparent. If, during lexing or parsing, you want
+to change the lexing state, use the begin() method. For example:
+<code python>
+     def t_begin_foo(t):
+         r'start_foo'
+         t.lexer.begin('foo')             # Starts 'foo' state
+</code>
+To get out of a state, you use begin() to switch back to the initial state.
+For example:
+<code python>
+     def t_foo_end(t):
+         r'end_foo'
+         t.lexer.begin('INITIAL')        # Back to the initial state
+</code>
+The "state" mechanism was used in the laboratories to recognize the comment section of the code. As we'll see later PLY doesn't recognize the regex <code> "/*" ~ "*/"  </code> so, a good way to work around the problem is to use the State to "enter" the comment section. In the following example we can see that, inside the lexer class, there are two different sections: the one regarding the INITIAL state and the COMMENT state.
+<file python Practices/Synthetized_Attributes/myLexer.py>
+import ply.lex as lex
+import ply.yacc as yacc
+import sys
+from ply.lex import TOKEN
+class MyLexer():
+    # CONSTRUCTOR
+    def __init__(self):
+        print('Lexer constructor called.')
+        self.lexer = lex.lex(module=self)
+        self.lexer.begin('INITIAL')
+    # DESTRUCTOR
+    def __del__(self):
+        print('Lexer destructor called.')
+    # list of TOKENS
+    tokens = [
+        'ID', 'EURO', 'INT',
+        'CM', 'S', 'C', 'SEP',
+    ]
+    # list of STATES -> used only the one to catch comments
+    states = (
+        ('COMMENT','exclusive'),
+    )
+    # tokens DEFINITION
+    def t_ID(self,t):
+        r'[a-zA-Z]+'
+        return t
+    def t_EURO(self,t):
+        r'[0-9]*\.[0-9][0-9]'
+        return t
+    def t_INT(self,t):
+        r'([1-9][0-9]*)|0'
+        return t
+    def t_CM(self,t):
+        r'\,'
+        return t
+    def t_S(self, t):
+        r'\;'
+        return t
+    def t_C(self,t):
+        r'\:'
+        return t
+    def t_SEP(self,t):
+        r'\%'
+        return t
+    def t_nl(self,t):
+        r'(\n|\r|\r\n)|\s|\t'
+        pass
+    # every symbol that doesn't match with almost one of the previous tokens is considered an error
+    def t_error(self,t):
+        r'.'
+        print("ERROR:", t.value)
+        return t
+    # COMMENT STATE
+    def t_INITIAL_comm(self,t):
+        r'\/\*'
+        self.lexer.begin('COMMENT')
+    def t_COMMENT_end(self,t):
+        '\*\/'
+        self.lexer.begin('INITIAL')
+    def t_COMMENT_body(self,t):
+        r'.'
+        pass
+    def t_COMMENT_nl(self,t):
+        r'(\n|\r|\r\n)|\s|\t'
+        pass
+    def t_COMMENT_error(self,t):
+        r'.'
+        print("ERROR:", t.value)
+        return t
+</file>
+====== Parser ======
+''yacc.py'' is used to parse language syntax.
+It represents the class used to recognize the grammar rules defined to develop
+the laboratories.
+Inside the class you can find:
+== List of grammar rules ==
+<code python>
+# Each grammar rule is defined by a Python function where the docstring to that function
+# contains the appropriate context-free grammar specification.
+# The statements that make up the function body implement the semantic actions of the rule.
+# Each function accepts a single argument p that is a sequence containing the values
+# of each grammar symbol in the corresponding rule.
+# The values of p[i] are mapped to grammar symbols as shown here:
+def p_expression_plus(p):
+    'expression : expression PLUS term'
+    #   ^            ^        ^    ^
+    #  p[0]         p[1]     p[2] p[3]
+    p[0] = p[1] + p[3]
+</code>
+== List of precedence rules ==
+<code python>
+# For your grammar rules in some case you have to define precedence rules in order to avoid conflicts;
+# to reach this purpose you have to define a list of precedences. For example:
+precedence = (
+    ('nonassoc', 'LESSTHAN', 'GREATERTHAN'),  # Nonassociative
+    ('left', 'PLUS', 'MINUS'),
+    ('left', 'TIMES', 'DIVIDE'),
+    ('right', 'UMINUS'),            # Unary minus operator
+)
+</code>
+== Error handling grammar rule ==
+<code python>
+def p_error(p):
+    if p:
+    print("Syntax error at token", p.type)
+        # Just discard the token and tell the parser it's okay.
+        parser.errok()
+    else:
+        print("Syntax error at EOF")
+</code>
+===== Grammar Rules =====
+Every grammar rule must begin with the prefix ''p_'' in order to be identified by the parser itself: it is the same
+mechanism used for the lexer to recognize the token whose name starts with prefix ''t_''.
+We take into consideration the parser class definition of the ''practice #2''
+<file python Practices/Parser/myParser.py>
+from myLexer import *
+import ply.yacc as yacc
+import sys
+class MyParser:
+    # CONSTRUCTOR
+    def __init__(self,lexer):
+        print("Parser constructor called")
+        self.parser = yacc.yacc(module=self)
+        self.lexer = lexer
+    # DESTRUCTOR
+    def __del__(self):
+        print('Parser destructor called.')
+    tokens = MyLexer.tokens
+    # GRAMMAR START
+    def p_expr_list(self,p):
+        '''
+        expr_list : expr_list expr
+                    | expr
+        '''
+    def p_expr(self,p):
+        '''
+        expr : e EQUAL
+        '''
+        print("PARSER: expression recognized")
+    def p_e(self,p):
+        '''
+        e : e PLUS t
+            | e MINUS t
+            | t
+        '''
+    def p_t(self,p):
+        '''
+        t : OBRACKET e CBRACKET
+                | NUMBER
+        '''
+    def p_error(self,p):
+        '''
+        '''
+</file>
+The first rule defined in the yacc specification determines the starting grammar
+symbol (in this case, the ''p_expr_list'' rule appears first).
+Whenever the starting rule is reduced by the parser and no more input is
+available, parsing stops and the final value is returned
+(this value will be whatever the top-most rule placed in p[0]).
+For tokens, the "value" of the corresponding p[i] is the same as the p.value
+attribute assigned in the lexer module. For non-terminals, the value is
+determined by whatever is placed in p[0] when rules are reduced.
+This value can be anything at all.
+''yacc.py'' can handle empty productions by defining a rule like this:
+<code python>
+def p_empty(p):
+    'empty :'
+    pass
+</code>
+Now to use the empty production, simply use 'empty' as a symbol. For example:
+<code python>
+def p_optitem(p):
+    'optitem : item'
+    '        | empty'
+    ...
+</code>
+===== Precedence rules =====
+In many situations, it is extremely difficult or awkward to write grammars
+without conflicts.
+When an ambiguous grammar is given to ''yacc.py'' it will print messages about
+''shift/reduce conflicts'' or ''reduce/reduce conflicts''.
+A shift/reduce conflict is caused when the parser generator can't decide
+whether or not to reduce a rule or shift a symbol on the parsing stack.
+By default, all shift/reduce conflicts are resolved in favor of shifting.
+To resolve ambiguity, especially in expression grammars, ''yacc.py'' allows
+individual tokens to be assigned a precedence level and associativity.
+This is done by adding a variable precedence to the grammar file of the parser like this:
+<code python>
+precedence = (
+    ('left', 'PLUS', 'MINUS'),
+    ('left', 'TIMES', 'DIVIDE'),
+    ('right', 'UMINUS'),            # Unary minus operator
+)
+</code>
+This declaration specifies that PLUS/MINUS have the same precedence level and
+are left-associative and that TIMES/DIVIDE have the same precedence and are
+left-associative. Within the precedence declaration, tokens are ordered from
+lowest to highest precedence. Thus, this declaration specifies that TIMES/DIVIDE
+have higher precedence than PLUS/MINUS (since they appear later in the
+precedence specification).
+Now, in the grammar file, we can write our unary minus rule like this:
+<code python>
+def p_expr_uminus(p):
+    'expression : MINUS expression %prec UMINUS'
+    p[0] = -p[2]
+</code>
+In this case, ''%prec UMINUS'' overrides the default rule precedence--setting
+it to that of ''UMINUS'' in the precedence specifier.
+In the following example taken from the ''exercise #3 of the Laboratory #3'' we can see how to use
+precedence rules and also the usage of the keyword "prec" in order to override the default precedence rule.
+<code python>
+from myLexer import *
+import ply.yacc as yacc
+class MyParser:
+    # CONSTRUCTOR
+    def __init__(self):
+        print("Parser called")
+        self.parser = yacc.yacc(module=self)
+    # DESTRUCTOR
+    def __del__(self):
+        print('Parser destructor called.')
+    tokens = MyLexer.tokens
+    precedence = (
+        ('left', 'LOGOP'),
+        ('left', 'PLUS', 'MINUS'),
+        ('left', 'DIVIDE', 'TIMES', 'MOD'),
+        ('left', 'INCR', 'DECR'),
+        ('nonassoc', 'IFX'),
+    )
+    # GRAMMAR START
+    ...
+    def p_if_stmt(self,p):
+        '''
+        if_stmt : IF RBOPEN cond RBCLOSED stmt  %prec IFX
+    	            | IF RBOPEN cond RBCLOSED stmt ELSE stmt
+        '''
+</code>
+====== Compilation ======
+To compile correctly your programs you have to create the lexer and parser objects (when they
+are used, if they don't everything is inside the main function so you don't have
+to instantiate anything); for example:
+<code python>
+# create objects MY LEXER and MY PARSER
+myLex = MyLexer()
+myPars = MyParser(myLex)
+</code>
+In the example above after creating it, the lexer is passed in the parser
+constructor (this is important to let the parser see through the lexer tokens).
+To build the lexer, the function ''lex.lex()'' is used.
+This function uses Python reflection (or introspection) to read the
+regular expression rules out of the calling context and build the lexer.
+<code python>
+    # create object MY LEXER inside the main function
+    myLex = MyLexer()
+    # build the lexer by calling the lex.lex() method inside the myLexer class
+    def __init__(self):
+        print('Lexer constructor called.')
+        self.lexer = lex.lex(module=self)
+        # initialize all parameters
+        self.file = open('output.html', 'w')
+</code>
+Once the lexer has been built, two methods can be used to control the lexer.
+<code python>
+    lexer.input(data) # Reset the lexer and store a new input string.
+    lexer.token() # Return the next token. Returns a special LexToken instance
+    # on success or None if the end of the input text has been reached. Sometimes
+    # is used indirectly in the for cycle "for token in lexer:"
+</code>
+The following example shows how the main function looks like using only the lexer:
+<file python Laboratories/Lab2/Lab2_ex2/ex2.py>
+import ply.lex as lex
+import ply.yacc as yacc
+import sys
+from myLexer import *
+# create object MY LEXER
+myLex = MyLexer()
+# reading INPUT FILE
+myFile = open(sys.argv[1])
+lexer = myLex.lexer
+with myFile as fp:
+    for line in fp:
+        try:
+            lexer.input(line)
+            for token in lexer:
+                pass
+        except EOFError:
+            break
+</file>
+To build the parser, call the ''yacc.yacc()'' function.
+<code python>
+    # create object MY PARSER inside the main function
+    myLex = MyLexer()
+    myPars = MyParser()
+    # build the parser by calling the yacc.yacc() method inside the myParser class
+    def __init__(self):
+        print("Parser called")
+        self.parser = yacc.yacc(module=self)
+</code>
+This function looks at the module and attempts to construct all of the LR
+parsing tables for the grammar you have specified. The first time ''yacc.yacc()''
+is invoked, you will get a message such as this:
+<code>
+Generating LALR tables
+calc >
+</code>
+Since table construction is relatively expensive (especially for large grammars),
+the resulting parsing table is written to a file called ''parsetab.py''.
+In addition, a debugging file called ''parser.out'' is created.
+On subsequent executions, yacc will reload the table from ''parsetab.py'' unless
+it has detected a change in the underlying grammar (in which case the tables
+and parsetab.py file are regenerated). Both of these files are written to the
+same directory as the module in which the parser is specified. The name of the
+parsetab module can be changed using the tabmodule keyword argument to ''yacc()''.
+Once the parser has been built you can use the parse method in order to control
+the parser.
+<code python>
+parser.parse(data)
+</code>
+The following example shows how the main function looks like using both lexer and parser:
+<file python Laboratories/Lab3/Lab3_ex1/ex1.py>
+import ply.lex as lex
+import ply.yacc as yacc
+import sys
+from myLexer import *
+from myParser import *
+# create objects MY LEXER and MY PARSER
+myLex = MyLexer()
+myPars = MyParser()
+lex = myLex.lexer
+parser = myPars.parser
+# reading INPUT FILE
+myFile = open(sys.argv[1])
+parser.parse(myFile.read())
+</file>

If you found any error, or if you want to partecipate to the editing of this wiki, please contact: admin [at] skenz.it

You can reuse, distribute or modify the content of this page, but you must cite in any document (or webpage) this url: https://www.skenz.it/compilers/ply?do=diff