Ply

PLY is a parsing tool written purely in Python. It is a re-implementation of Lex and Yacc originally in C-language. PLY uses the same LALR parsing technique as Lex and Yacc. It includes support for empty productions, precedence rules, error recovery, and ambiguous grammars.

PLY has two main modules which are part of the ply package; these modules are used to extend the classes used to develop the Laboratories (myLexer.py and myParser.py) as we’ll see later; the modules used are:

  ply.lex - A re-implementation of Lex for lexical analysis

  ply.yacc - A re-implementation of Yacc for parser creation

The two tools are meant to work together. Specifically, lex.py provides an external interface in the form of a token() function that returns the next valid token on the input stream. yacc.py calls this repeatedly to retrieve tokens and invoke grammar rules. Finally the PLY parser will generate an LALR(1) parsing table.

In summary, PLY consists of two separate modules: lex.py and yacc.py.

In the rest of this page you can find a description of the main functionalities of ply.

MANY EXAMPLES ABOUT PLY: PLY EXAMPLES

Execution

To execute all the examples developed you have to run from the command line the following command:

python exercise_name.py myFile.txt

Remember to pass ALWAYS the myFile.txt file: it contains all the input strings to be passed to the lexer.

When some output files have to be generated, these are created inside the current directory of the exercise itself.

Lexer

Solution 1

For simple projects or problems, it is possible to define the lexer (and its components) inside the main function itself.

Laboratories/Lab1/Lab1_ex1/ex1.py

import ply.lex as lex
import ply.yacc as yacc
import sys
 
# list of TOKENS
 
tokens = [
 
    'nl',
    'letter',
    'digit',
    'id',
    'path'
 
]
 
# tokens DEFINITION
 
t_nl = r'\n|\r|\r\n'
 
t_letter = r'[^\n\r\\/:*?\"<> |0-9]'
t_digit  = r'[0-9]'
t_id     = r'('+t_digit+'|'+t_letter+')+'
 
t_path = r'^(' + t_letter + r':)?(\\)?(' + t_id + r'\\)*' + t_id + r'("."' + t_id + r')?$'
 
t_ignore = r'^.+$'
 
def t_error(t):
    print("Illegal characters: ", t.value)
    t.lexer.skip(1)
 
# reading INPUT FILE
 
myFile = open(sys.argv[1])
 
lexer = lex.lex()
 
with myFile as fp:
    for line in fp:
        try:
            lexer.input(line)
 
            for token in lexer:
                if token.type == 'path':
                    print("Path Correct: ", token.value)
 
        except EOFError:
            break

Text File to execute the code above: myFile.txt

For execution:

python ex1.py myFile.txt

As you can see the code above is kind of simple and it doesn't need to specify a Lexer class to define all the tokens needed.

Solution 2

If you have to develop more complex solutions, putting the lexer (and also the parser) inside the main function isn't a smart solution.

The module option can be used to define lexers from instances of a class. For example:

Practices/Lexer/example.py

import ply.lex as lex
import ply.yacc as yacc
import sys
 
from myLexer import *
 
# create objects MY LEXER and MY PARSER
myLex = MyLexer()
 
lexer = myLex.lexer
 
# reading INPUT FILE
 
myFile = open(sys.argv[1])
 
with myFile as fp:
    for line in fp:
        try:
            lexer.input(line)
 
            for token in lexer:
                pass
 
        except EOFError:
            break

Practices/Lexer/myLexer.py

import ply.lex as lex
import ply.yacc as yacc
import sys
from ply.lex import TOKEN
 
class MyLexer():
 
    # CONSTRUCTOR
    def __init__(self):
        print('Lexer constructor called.')
        self.lexer = lex.lex(module=self)
 
    # DESTRUCTOR
    def __del__(self):
        print('Lexer destructor called.')
 
    # list of TOKENS
    tokens = [
 
        'NUMBER',
        'DIV', 'STAR', 'PLUS', 'MINUS',
        'OPEN_BRACKET', 'CLOSE_BRACKET',
        'EQUAL', 'error'
 
    ]
 
    # tokens DEFINITION
 
    def t_NUMBER(self,t):
        r'[1-9][0-9]*'
        print("NUMBER:", t.value)
 
    def t_PLUS(self,t):
        r'\+'
        print("PLUS")
 
    def t_MINUS(self, t):
        r'\-'
        print("MINUS")
 
    def t_STAR(self,t):
        r'\*'
        print("STAR")
 
    def t_DIV(self,t):
        r'\/'
        print("DIV")
 
    def t_OPEN_BRACKET(self,t):
        r'\('
        print("OPEN_BRACKET")
 
    def t_CLOSE_BRACKET(self,t):
        r'\)'
        print("CLOSE_BRACKET")
 
    def t_EQUAL(self,t):
        r'\='
        print("EQUAL\n")
 
    def t_nl(self,t):
        r'(\n|\r|\r\n)|\s|\t'
 
    # every symbol that doesn't match with almost one of the previous tokens is considered an error
    def t_error(self,t):
        r'.'
        print("ERROR:", t.value)
        t.lexer.skip(1)

Text File to execute the code above: myFile.txt

For execution:

python example.py myFile.txt

As you can see the exercise was divided in two different files: the example.py which only istantiate the lexer object and pass to it the input file; the myLexer.py file which contains the token definitions of the lexer.

In the first solution, to instantiate the lexer, you can simply use the following instruction

lexer = lex.lex()

In the second solution the initialization of the lexer happens inside the class constructor

def __init__(self):
    print('Lexer constructor called.')
    self.lexer = lex.lex(module=self)

In most laboratories the myLexer.py class was defined; it was used to represents the lexer in order to tokenize an input string: as you can see, during the first exercises this class wasn’t used because it isn't necessary; anyway, with more complex problems, it is a good practice to create a Lexer (and, if necessary, a Parser) class in order to extend the modules given by the PLY library because, as we'll see later during the Laboratories, in some cases the amount of variables and structures used by the lexer and the parser is enormous.

Inside the lexer class you can find:

Token Array

# It contains all the tokens defined inside the class; it is especially important to let 
# the parser recognize all the tokens used in the grammar rules. 
# Tokens are usually given names to indicate what they are. For example:
 
    'ID','EQUALS','NUMBER','PLUS','NUMBER','TIMES',
    'LPAREN','ID','MINUS','ID','RPAREN'

List of Tokens

# The identification of tokens is typically done by writing a series of regular expression rules.
# The token could be defined in two different ways:
 
# 1. Regular expression rules for simple tokens
 
    t_PLUS    = r'\+'
    t_MINUS   = r'-'
    t_TIMES   = r'\*'
    t_DIVIDE  = r'/'
    t_LPAREN  = r'\('
    t_RPAREN  = r'\)'
 
# 2. A regular expression rule with some action code
 
def t_NUMBER(t):
     r'\d+'
     t.value = int(t.value)    
     return t

List of states

a list of the states that can be matched by the lexer
 
# list of STATES -> used only the one to catch comments
states = (
    ('COMMENT','exclusive'),
)

Rule to track line numbers

def t_newline(t):
     r'\n+'
     t.lexer.lineno += len(t.value)

Ignored Characters

t_ignore  = ' \t'

Error handling Rule

def t_error(t):
     print("Illegal character '%s'" % t.value[0])
     t.lexer.skip(1)

To use the lexer, you first need to give to him input text using its input() method. After that, repeated calls to token () produce tokens; for example:

# reading INPUT FILE
 
...
 
myFile = open(sys.argv[1])
 
with myFile as fp:
    for line in fp:
        try:
            lexer.input(line)
 
            for token in lexer:
                pass
 
        except EOFError:
            break
 
...

It is important to remember that all lexers must provide a list “tokens” that defines all of the possible token names that can be produced by the lexer. The tokens list is also used by the yacc.py module to identify terminals.

For example, the list of tokens to be defined can be written as follows:

tokens = (
  'NUMBER',
  'PLUS',
  'MINUS',
  'TIMES',
  'DIVIDE',
  'LPAREN',
  'RPAREN',
)

Tokens

As we said before, each token is specified by writing a regular expression rule compatible with python's regex module. Each of these rules are defined by making declarations with the prefix t_, in order to be recognized as tokens.

I remind you that simple tokens, can be specified as strings such as follows:

t_PLUS = r'\+'

If some kind of action needs to be performed, a token rule can be specified as a function. For example:

def t_NUMBER(t):
    r'\d+'
    t.value = int(t.value)
    return t

this rule matches number and converts the string into a Python integer.

When a function is used, the regular expression rule is specified in the function documentation string. The function always takes a single argument which is an instance of LexToken. This object has different attributes:

t.type which is the token type (as a string)
t.value which is the lexeme (the actual text matched)
t.lineno which is the current line number
t.lexpos which is the position of the token relative to the beginning of the input text.

By default, t.type is set to the name following the t_ prefix.

It is important to remember that all tokens defined by functions are added in the same order as they appear in the lexer file

In some applications, you may want to define build tokens from as a series of more complex regular expression rules. In this case is important to use the @TOKEN decorator which can be used by importing the token module:

from ply.lex import TOKEN

In the following example (from the exercise #2 of the Laboratory #2) we are using a data_table array to store three different regex to be assembled all together in order to form a larger regex, which is the one used to recognize the “table” token.

data_table = ['(\<((TABLE)|(table)))(\ +', t_att, ')*(\>)']
@TOKEN("".join(data_table))
def t_table(self, t):
    self.numTables = self.numTables + 1
    self.writeOut(t.value)

States

In advanced parsing applications, it may be useful to have different lexing states. For instance, you may want the occurrence of a certain token to trigger a different kind of lexing. PLY supports a feature that allows the underlying lexer to be put into a series of different states. Each state can have its own tokens, lexing rules, ecc…

To define a new lexing state, it must first be declared inside the “states” list in your lex file or class. For example:

states = (
    ('foo','exclusive'),
    ('bar','inclusive'),
)

This declaration declares two states, 'foo' and 'bar'. States may be of two types; 'exclusive' and 'inclusive'.

Exclusive

Exclusive state completely overrides the default behavior of the lexer. That is, lex will only return tokens and apply rules defined specifically for that state.

Inclusive

Inclusive state adds additional tokens and rules to the default set of rules. Thus, lex will return both the tokens defined by default in addition to those defined for the inclusive state.

By default, lexing operates in the INITIAL state. This state includes all of the normally defined tokens. For users who aren't using different states, this fact is completely transparent. If, during lexing or parsing, you want to change the lexing state, use the begin() method. For example:

     def t_begin_foo(t):
         r'start_foo'
         t.lexer.begin('foo')             # Starts 'foo' state

To get out of a state, you use begin() to switch back to the initial state. For example:

     def t_foo_end(t):
         r'end_foo'
         t.lexer.begin('INITIAL')        # Back to the initial state

The “state” mechanism was used in the laboratories to recognize the comment section of the code. As we'll see later PLY doesn't recognize the regex

 "/*" ~ "*/"

so, a good way to work around the problem is to use the State to “enter” the comment section. In the following example we can see that, inside the lexer class, there are two different sections: the one regarding the INITIAL state and the COMMENT state.

Practices/Synthetized_Attributes/myLexer.py

import ply.lex as lex
import ply.yacc as yacc
import sys
from ply.lex import TOKEN
 
class MyLexer():
 
 
    # CONSTRUCTOR
    def __init__(self):
        print('Lexer constructor called.')
        self.lexer = lex.lex(module=self)
        self.lexer.begin('INITIAL')
 
    # DESTRUCTOR
    def __del__(self):
        print('Lexer destructor called.')
 
    # list of TOKENS
    tokens = [
 
        'ID', 'EURO', 'INT',
        'CM', 'S', 'C', 'SEP',
 
    ]
 
    # list of STATES -> used only the one to catch comments
    states = (
        ('COMMENT','exclusive'),
    ) 
 
    # tokens DEFINITION
 
    def t_ID(self,t):
        r'[a-zA-Z]+'
        return t
 
    def t_EURO(self,t):
        r'[0-9]*\.[0-9][0-9]'
        return t
 
    def t_INT(self,t):
        r'([1-9][0-9]*)|0'
        return t
 
    def t_CM(self,t):
        r'\,'
        return t
 
    def t_S(self, t):
        r'\;'
        return t
 
    def t_C(self,t):
        r'\:'
        return t
 
    def t_SEP(self,t):
        r'\%'
        return t
 
    def t_nl(self,t):
        r'(\n|\r|\r\n)|\s|\t'
        pass
 
    # every symbol that doesn't match with almost one of the previous tokens is considered an error
    def t_error(self,t):
        r'.'
        print("ERROR:", t.value)
        return t
 
    # COMMENT STATE
 
    def t_INITIAL_comm(self,t):
        r'\/\*'
        self.lexer.begin('COMMENT')
 
    def t_COMMENT_end(self,t):
        '\*\/'
        self.lexer.begin('INITIAL')
 
    def t_COMMENT_body(self,t):
        r'.'
        pass
 
    def t_COMMENT_nl(self,t):
        r'(\n|\r|\r\n)|\s|\t'
        pass
 
    def t_COMMENT_error(self,t):
        r'.'
        print("ERROR:", t.value)
        return t

Parser

yacc.py is used to parse language syntax.

It represents the class used to recognize the grammar rules defined to develop the laboratories.

Inside the class you can find:

List of grammar rules

# Each grammar rule is defined by a Python function where the docstring to that function 
# contains the appropriate context-free grammar specification. 
# The statements that make up the function body implement the semantic actions of the rule. 
# Each function accepts a single argument p that is a sequence containing the values 
# of each grammar symbol in the corresponding rule.
 
# The values of p[i] are mapped to grammar symbols as shown here: 
 
def p_expression_plus(p):
    'expression : expression PLUS term'
    #   ^            ^        ^    ^
    #  p[0]         p[1]     p[2] p[3]
    p[0] = p[1] + p[3]

List of precedence rules

# For your grammar rules in some case you have to define precedence rules in order to avoid conflicts; 
# to reach this purpose you have to define a list of precedences. For example:
 
precedence = (
    ('nonassoc', 'LESSTHAN', 'GREATERTHAN'),  # Nonassociative 
    ('left', 'PLUS', 'MINUS'),
    ('left', 'TIMES', 'DIVIDE'),
    ('right', 'UMINUS'),            # Unary minus operator
)

Error handling grammar rule

def p_error(p):
    if p:
    print("Syntax error at token", p.type)
        # Just discard the token and tell the parser it's okay.
        parser.errok()
    else:
        print("Syntax error at EOF")

Grammar Rules

Every grammar rule must begin with the prefix p_ in order to be identified by the parser itself: it is the same mechanism used for the lexer to recognize the token whose name starts with prefix t_. We take into consideration the parser class definition of the practice #2

Practices/Parser/myParser.py

from myLexer import *
import ply.yacc as yacc
import sys
 
 
class MyParser:
 
    # CONSTRUCTOR
    def __init__(self,lexer):
        print("Parser constructor called")
        self.parser = yacc.yacc(module=self)
        self.lexer = lexer
 
    # DESTRUCTOR
    def __del__(self):
        print('Parser destructor called.')
 
 
    tokens = MyLexer.tokens
 
 
    # GRAMMAR START
    def p_expr_list(self,p):
        '''
        expr_list : expr_list expr
                    | expr
        '''
 
    def p_expr(self,p):
        '''
        expr : e EQUAL
        '''
 
        print("PARSER: expression recognized")
 
    def p_e(self,p):
        '''
        e : e PLUS t
            | e MINUS t
            | t
        '''      
 
    def p_t(self,p):
        '''
        t : OBRACKET e CBRACKET
                | NUMBER
        '''
 
    def p_error(self,p):
        '''
        '''

The first rule defined in the yacc specification determines the starting grammar symbol (in this case, the p_expr_list rule appears first). Whenever the starting rule is reduced by the parser and no more input is available, parsing stops and the final value is returned (this value will be whatever the top-most rule placed in p[0]).

For tokens, the “value” of the corresponding p[i] is the same as the p.value attribute assigned in the lexer module. For non-terminals, the value is determined by whatever is placed in p[0] when rules are reduced. This value can be anything at all.

yacc.py can handle empty productions by defining a rule like this:

def p_empty(p):
    'empty :'
    pass

Now to use the empty production, simply use 'empty' as a symbol. For example:

def p_optitem(p):
    'optitem : item'
    '        | empty'
    ...

Precedence rules

In many situations, it is extremely difficult or awkward to write grammars without conflicts.

When an ambiguous grammar is given to yacc.py it will print messages about shift/reduce conflicts or reduce/reduce conflicts. A shift/reduce conflict is caused when the parser generator can't decide whether or not to reduce a rule or shift a symbol on the parsing stack.

By default, all shift/reduce conflicts are resolved in favor of shifting.

To resolve ambiguity, especially in expression grammars, yacc.py allows individual tokens to be assigned a precedence level and associativity. This is done by adding a variable precedence to the grammar file of the parser like this:

precedence = (
    ('left', 'PLUS', 'MINUS'),
    ('left', 'TIMES', 'DIVIDE'),
    ('right', 'UMINUS'),            # Unary minus operator
)

This declaration specifies that PLUS/MINUS have the same precedence level and are left-associative and that TIMES/DIVIDE have the same precedence and are left-associative. Within the precedence declaration, tokens are ordered from lowest to highest precedence. Thus, this declaration specifies that TIMES/DIVIDE have higher precedence than PLUS/MINUS (since they appear later in the precedence specification).

Now, in the grammar file, we can write our unary minus rule like this:

def p_expr_uminus(p):
    'expression : MINUS expression %prec UMINUS'
    p[0] = -p[2]

In this case, %prec UMINUS overrides the default rule precedence–setting it to that of UMINUS in the precedence specifier.

In the following example taken from the exercise #3 of the Laboratory #3 we can see how to use precedence rules and also the usage of the keyword “prec” in order to override the default precedence rule.

from myLexer import *
import ply.yacc as yacc
 
class MyParser:
 
    # CONSTRUCTOR
    def __init__(self):
        print("Parser called")
        self.parser = yacc.yacc(module=self)
 
    # DESTRUCTOR
    def __del__(self):
        print('Parser destructor called.')
 
    tokens = MyLexer.tokens
 
    precedence = (
        ('left', 'LOGOP'),
        ('left', 'PLUS', 'MINUS'),
        ('left', 'DIVIDE', 'TIMES', 'MOD'),
        ('left', 'INCR', 'DECR'),
        ('nonassoc', 'IFX'),
    )
 
    # GRAMMAR START
 
    ...
 
    def p_if_stmt(self,p):
        '''
        if_stmt : IF RBOPEN cond RBCLOSED stmt  %prec IFX
    	            | IF RBOPEN cond RBCLOSED stmt ELSE stmt
        '''

Compilation

To compile correctly your programs you have to create the lexer and parser objects (when they are used, if they don't everything is inside the main function so you don't have to instantiate anything); for example:

# create objects MY LEXER and MY PARSER
myLex = MyLexer()
myPars = MyParser(myLex)

In the example above after creating it, the lexer is passed in the parser constructor (this is important to let the parser see through the lexer tokens).

To build the lexer, the function lex.lex() is used.

This function uses Python reflection (or introspection) to read the regular expression rules out of the calling context and build the lexer.

    # create object MY LEXER inside the main function
    myLex = MyLexer()
 
    # build the lexer by calling the lex.lex() method inside the myLexer class
    def __init__(self):
        print('Lexer constructor called.')
 
        self.lexer = lex.lex(module=self)
        # initialize all parameters
        self.file = open('output.html', 'w')

Once the lexer has been built, two methods can be used to control the lexer.

    lexer.input(data) # Reset the lexer and store a new input string.
 
    lexer.token() # Return the next token. Returns a special LexToken instance 
    # on success or None if the end of the input text has been reached. Sometimes
    # is used indirectly in the for cycle "for token in lexer:"

The following example shows how the main function looks like using only the lexer:

Laboratories/Lab2/Lab2_ex2/ex2.py

import ply.lex as lex
import ply.yacc as yacc
import sys
 
from myLexer import *
 
# create object MY LEXER
 
myLex = MyLexer()
 
# reading INPUT FILE
 
myFile = open(sys.argv[1])
 
lexer = myLex.lexer
 
with myFile as fp:
    for line in fp:
        try:
            lexer.input(line)
 
            for token in lexer:
                pass
 
        except EOFError:
            break

To build the parser, call the yacc.yacc() function.

    # create object MY PARSER inside the main function
    myLex = MyLexer()
    myPars = MyParser()
 
    # build the parser by calling the yacc.yacc() method inside the myParser class
    def __init__(self):
        print("Parser called")
        self.parser = yacc.yacc(module=self)

This function looks at the module and attempts to construct all of the LR parsing tables for the grammar you have specified. The first time yacc.yacc() is invoked, you will get a message such as this:

Generating LALR tables
calc >

Since table construction is relatively expensive (especially for large grammars), the resulting parsing table is written to a file called parsetab.py. In addition, a debugging file called parser.out is created. On subsequent executions, yacc will reload the table from parsetab.py unless it has detected a change in the underlying grammar (in which case the tables and parsetab.py file are regenerated). Both of these files are written to the same directory as the module in which the parser is specified. The name of the parsetab module can be changed using the tabmodule keyword argument to yacc().

Once the parser has been built you can use the parse method in order to control the parser.

parser.parse(data)

The following example shows how the main function looks like using both lexer and parser:

Laboratories/Lab3/Lab3_ex1/ex1.py

import ply.lex as lex
import ply.yacc as yacc
import sys
 
from myLexer import *
from myParser import *
 
# create objects MY LEXER and MY PARSER
myLex = MyLexer()
myPars = MyParser()
 
lex = myLex.lexer
parser = myPars.parser
 
# reading INPUT FILE
 
myFile = open(sys.argv[1])
parser.parse(myFile.read())

Table of Contents

Ply