PLY is a parsing tool written purely in Python. It is a re-implementation of Lex and Yacc originally in C-language. PLY uses the same LALR parsing technique as Lex and Yacc. It includes support for empty productions, precedence rules, error recovery, and ambiguous grammars.
PLY has two main modules which are part of the ply package; these modules are used to extend the classes used to develop the Laboratories (myLexer.py and myParser.py) as we’ll see later; the modules used are:
ply.lex - A re-implementation of Lex for lexical analysis
ply.yacc - A re-implementation of Yacc for parser creation
The two tools are meant to work together. Specifically, lex.py provides an external
interface in the form of a token() function that returns the next valid token on the
yacc.py calls this repeatedly to retrieve tokens and invoke grammar rules.
Finally the PLY parser will generate an LALR(1) parsing table.
In summary, PLY consists of two separate modules:
In the rest of this page you can find a description of the main functionalities of
MANY EXAMPLES ABOUT PLY: PLY EXAMPLES
To execute all the examples developed you have to run from the command line the following command:
python exercise_name.py myFile.txt
Remember to pass ALWAYS the
myFile.txt file: it contains all the input strings to be
passed to the lexer.
When some output files have to be generated, these are created inside the current directory of the exercise itself.
For simple projects or problems, it is possible to define the lexer (and its components) inside the main function itself.
import ply.lex as lex import ply.yacc as yacc import sys # list of TOKENS tokens = [ 'nl', 'letter', 'digit', 'id', 'path' ] # tokens DEFINITION t_nl = r'\n|\r|\r\n' t_letter = r'[^\n\r\\/:*?\"<> |0-9]' t_digit = r'[0-9]' t_id = r'('+t_digit+'|'+t_letter+')+' t_path = r'^(' + t_letter + r':)?(\\)?(' + t_id + r'\\)*' + t_id + r'("."' + t_id + r')?$' t_ignore = r'^.+$' def t_error(t): print("Illegal characters: ", t.value) t.lexer.skip(1) # reading INPUT FILE myFile = open(sys.argv) lexer = lex.lex() with myFile as fp: for line in fp: try: lexer.input(line) for token in lexer: if token.type == 'path': print("Path Correct: ", token.value) except EOFError: break
Text File to execute the code above: myFile.txt
python ex1.py myFile.txt
As you can see the code above is kind of simple and it doesn't need to specify a Lexer class to define all the tokens needed.
If you have to develop more complex solutions, putting the lexer (and also the parser) inside the main function isn't a smart solution.
The module option can be used to define lexers from instances of a class. For example:
import ply.lex as lex import ply.yacc as yacc import sys from myLexer import * # create objects MY LEXER and MY PARSER myLex = MyLexer() lexer = myLex.lexer # reading INPUT FILE myFile = open(sys.argv) with myFile as fp: for line in fp: try: lexer.input(line) for token in lexer: pass except EOFError: break
import ply.lex as lex import ply.yacc as yacc import sys from ply.lex import TOKEN class MyLexer(): # CONSTRUCTOR def __init__(self): print('Lexer constructor called.') self.lexer = lex.lex(module=self) # DESTRUCTOR def __del__(self): print('Lexer destructor called.') # list of TOKENS tokens = [ 'NUMBER', 'DIV', 'STAR', 'PLUS', 'MINUS', 'OPEN_BRACKET', 'CLOSE_BRACKET', 'EQUAL', 'error' ] # tokens DEFINITION def t_NUMBER(self,t): r'[1-9][0-9]*' print("NUMBER:", t.value) def t_PLUS(self,t): r'\+' print("PLUS") def t_MINUS(self, t): r'\-' print("MINUS") def t_STAR(self,t): r'\*' print("STAR") def t_DIV(self,t): r'\/' print("DIV") def t_OPEN_BRACKET(self,t): r'\(' print("OPEN_BRACKET") def t_CLOSE_BRACKET(self,t): r'\)' print("CLOSE_BRACKET") def t_EQUAL(self,t): r'\=' print("EQUAL\n") def t_nl(self,t): r'(\n|\r|\r\n)|\s|\t' # every symbol that doesn't match with almost one of the previous tokens is considered an error def t_error(self,t): r'.' print("ERROR:", t.value) t.lexer.skip(1)
Text File to execute the code above: myFile.txt
python example.py myFile.txt
As you can see the exercise was divided in two different files: the example.py which only istantiate the lexer object and pass to it the input file; the
myLexer.py file which contains the token definitions of the lexer.
In the first solution, to instantiate the lexer, you can simply use the following instruction
lexer = lex.lex()
In the second solution the initialization of the lexer happens inside the class constructor
def __init__(self): print('Lexer constructor called.') self.lexer = lex.lex(module=self)
In most laboratories the
myLexer.py class was defined; it was used to represents the lexer in order to tokenize an input string: as you can see, during the first exercises this class wasn’t used because it isn't necessary;
anyway, with more complex problems, it is a good practice to create a Lexer (and, if necessary, a Parser) class in order to extend the modules given by the PLY library because, as we'll see later during the Laboratories, in some cases the amount of variables and structures used by the lexer and the parser is enormous.
Inside the lexer class you can find:
# It contains all the tokens defined inside the class; it is especially important to let # the parser recognize all the tokens used in the grammar rules. # Tokens are usually given names to indicate what they are. For example: 'ID','EQUALS','NUMBER','PLUS','NUMBER','TIMES', 'LPAREN','ID','MINUS','ID','RPAREN'
# The identification of tokens is typically done by writing a series of regular expression rules. # The token could be defined in two different ways: # 1. Regular expression rules for simple tokens t_PLUS = r'\+' t_MINUS = r'-' t_TIMES = r'\*' t_DIVIDE = r'/' t_LPAREN = r'\(' t_RPAREN = r'\)' # 2. A regular expression rule with some action code def t_NUMBER(t): r'\d+' t.value = int(t.value) return t
a list of the states that can be matched by the lexer # list of STATES -> used only the one to catch comments states = ( ('COMMENT','exclusive'), )
def t_newline(t): r'\n+' t.lexer.lineno += len(t.value)
t_ignore = ' \t'
def t_error(t): print("Illegal character '%s'" % t.value) t.lexer.skip(1)
To use the lexer, you first need to give to him input text using its input() method. After that, repeated calls to token () produce tokens; for example:
# reading INPUT FILE ... myFile = open(sys.argv) with myFile as fp: for line in fp: try: lexer.input(line) for token in lexer: pass except EOFError: break ...
It is important to remember that all lexers must provide a list “tokens” that defines all
of the possible token names that can be produced by the lexer. The tokens list is also
used by the
yacc.py module to identify terminals.
For example, the list of tokens to be defined can be written as follows:
tokens = ( 'NUMBER', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN', )
As we said before, each token is specified by writing a regular expression rule
compatible with python's regex module. Each of these rules are defined by making declarations
with the prefix
t_, in order to be recognized as tokens.
I remind you that simple tokens, can be specified as strings such as follows:
t_PLUS = r'\+'
If some kind of action needs to be performed, a token rule can be specified as a function. For example:
def t_NUMBER(t): r'\d+' t.value = int(t.value) return t
this rule matches number and converts the string into a Python integer.
When a function is used, the regular expression rule is specified in the function documentation string. The function always takes a single argument which is an instance of LexToken. This object has different attributes:
By default, t.type is set to the name following the
It is important to remember that all tokens defined by functions are added in the same order as they appear in the lexer file
In some applications, you may want to define build tokens from as a series of more
complex regular expression rules. In this case is important to use the
which can be used by importing the token module:
from ply.lex import TOKEN
In the following example (from the exercise #2 of the Laboratory #2) we are using a
data_table array to store three different regex to be assembled all together in order to form a larger regex, which is the one used to recognize the “table” token.
data_table = ['(\<((TABLE)|(table)))(\ +', t_att, ')*(\>)'] @TOKEN("".join(data_table)) def t_table(self, t): self.numTables = self.numTables + 1 self.writeOut(t.value)
In advanced parsing applications, it may be useful to have different lexing states. For instance, you may want the occurrence of a certain token to trigger a different kind of lexing. PLY supports a feature that allows the underlying lexer to be put into a series of different states. Each state can have its own tokens, lexing rules, ecc…
To define a new lexing state, it must first be declared inside the “states” list in your lex file or class. For example:
states = ( ('foo','exclusive'), ('bar','inclusive'), )
This declaration declares two states, 'foo' and 'bar'. States may be of two types; 'exclusive' and 'inclusive'.
Exclusive state completely overrides the default behavior of the lexer.
That is, lex will only return tokens and apply rules defined specifically for that state.
Inclusive state adds additional tokens and rules to the default set of rules.
Thus, lex will return both the tokens defined by default in addition to those defined for the inclusive state.
By default, lexing operates in the
INITIAL state. This state includes all of
the normally defined tokens. For users who aren't using different states,
this fact is completely transparent. If, during lexing or parsing, you want
to change the lexing state, use the begin() method. For example:
def t_begin_foo(t): r'start_foo' t.lexer.begin('foo') # Starts 'foo' state
To get out of a state, you use begin() to switch back to the initial state. For example:
def t_foo_end(t): r'end_foo' t.lexer.begin('INITIAL') # Back to the initial state
The “state” mechanism was used in the laboratories to recognize the comment section of the code. As we'll see later PLY doesn't recognize the regex
"/*" ~ "*/"
so, a good way to work around the problem is to use the State to “enter” the comment section. In the following example we can see that, inside the lexer class, there are two different sections: the one regarding the INITIAL state and the COMMENT state.
import ply.lex as lex import ply.yacc as yacc import sys from ply.lex import TOKEN class MyLexer(): # CONSTRUCTOR def __init__(self): print('Lexer constructor called.') self.lexer = lex.lex(module=self) self.lexer.begin('INITIAL') # DESTRUCTOR def __del__(self): print('Lexer destructor called.') # list of TOKENS tokens = [ 'ID', 'EURO', 'INT', 'CM', 'S', 'C', 'SEP', ] # list of STATES -> used only the one to catch comments states = ( ('COMMENT','exclusive'), ) # tokens DEFINITION def t_ID(self,t): r'[a-zA-Z]+' return t def t_EURO(self,t): r'[0-9]*\.[0-9][0-9]' return t def t_INT(self,t): r'([1-9][0-9]*)|0' return t def t_CM(self,t): r'\,' return t def t_S(self, t): r'\;' return t def t_C(self,t): r'\:' return t def t_SEP(self,t): r'\%' return t def t_nl(self,t): r'(\n|\r|\r\n)|\s|\t' pass # every symbol that doesn't match with almost one of the previous tokens is considered an error def t_error(self,t): r'.' print("ERROR:", t.value) return t # COMMENT STATE def t_INITIAL_comm(self,t): r'\/\*' self.lexer.begin('COMMENT') def t_COMMENT_end(self,t): '\*\/' self.lexer.begin('INITIAL') def t_COMMENT_body(self,t): r'.' pass def t_COMMENT_nl(self,t): r'(\n|\r|\r\n)|\s|\t' pass def t_COMMENT_error(self,t): r'.' print("ERROR:", t.value) return t
yacc.py is used to parse language syntax.
It represents the class used to recognize the grammar rules defined to develop the laboratories.
Inside the class you can find:
# Each grammar rule is defined by a Python function where the docstring to that function # contains the appropriate context-free grammar specification. # The statements that make up the function body implement the semantic actions of the rule. # Each function accepts a single argument p that is a sequence containing the values # of each grammar symbol in the corresponding rule. # The values of p[i] are mapped to grammar symbols as shown here: def p_expression_plus(p): 'expression : expression PLUS term' # ^ ^ ^ ^ # p p p p p = p + p
# For your grammar rules in some case you have to define precedence rules in order to avoid conflicts; # to reach this purpose you have to define a list of precedences. For example: precedence = ( ('nonassoc', 'LESSTHAN', 'GREATERTHAN'), # Nonassociative ('left', 'PLUS', 'MINUS'), ('left', 'TIMES', 'DIVIDE'), ('right', 'UMINUS'), # Unary minus operator )
def p_error(p): if p: print("Syntax error at token", p.type) # Just discard the token and tell the parser it's okay. parser.errok() else: print("Syntax error at EOF")
Every grammar rule must begin with the prefix
p_ in order to be identified by the parser itself: it is the same
mechanism used for the lexer to recognize the token whose name starts with prefix
We take into consideration the parser class definition of the
from myLexer import * import ply.yacc as yacc import sys class MyParser: # CONSTRUCTOR def __init__(self,lexer): print("Parser constructor called") self.parser = yacc.yacc(module=self) self.lexer = lexer # DESTRUCTOR def __del__(self): print('Parser destructor called.') tokens = MyLexer.tokens # GRAMMAR START def p_expr_list(self,p): ''' expr_list : expr_list expr | expr ''' def p_expr(self,p): ''' expr : e EQUAL ''' print("PARSER: expression recognized") def p_e(self,p): ''' e : e PLUS t | e MINUS t | t ''' def p_t(self,p): ''' t : OBRACKET e CBRACKET | NUMBER ''' def p_error(self,p): ''' '''
The first rule defined in the yacc specification determines the starting grammar
symbol (in this case, the
p_expr_list rule appears first).
Whenever the starting rule is reduced by the parser and no more input is
available, parsing stops and the final value is returned
(this value will be whatever the top-most rule placed in p).
For tokens, the “value” of the corresponding p[i] is the same as the p.value attribute assigned in the lexer module. For non-terminals, the value is determined by whatever is placed in p when rules are reduced. This value can be anything at all.
yacc.py can handle empty productions by defining a rule like this:
def p_empty(p): 'empty :' pass
Now to use the empty production, simply use 'empty' as a symbol. For example:
def p_optitem(p): 'optitem : item' ' | empty' ...
In many situations, it is extremely difficult or awkward to write grammars without conflicts.
When an ambiguous grammar is given to
yacc.py it will print messages about
shift/reduce conflicts or
A shift/reduce conflict is caused when the parser generator can't decide
whether or not to reduce a rule or shift a symbol on the parsing stack.
By default, all shift/reduce conflicts are resolved in favor of shifting.
To resolve ambiguity, especially in expression grammars,
individual tokens to be assigned a precedence level and associativity.
This is done by adding a variable precedence to the grammar file of the parser like this:
precedence = ( ('left', 'PLUS', 'MINUS'), ('left', 'TIMES', 'DIVIDE'), ('right', 'UMINUS'), # Unary minus operator )
This declaration specifies that PLUS/MINUS have the same precedence level and are left-associative and that TIMES/DIVIDE have the same precedence and are left-associative. Within the precedence declaration, tokens are ordered from lowest to highest precedence. Thus, this declaration specifies that TIMES/DIVIDE have higher precedence than PLUS/MINUS (since they appear later in the precedence specification).
Now, in the grammar file, we can write our unary minus rule like this:
def p_expr_uminus(p): 'expression : MINUS expression %prec UMINUS' p = -p
In this case,
%prec UMINUS overrides the default rule precedence–setting
it to that of
UMINUS in the precedence specifier.
In the following example taken from the
exercise #3 of the Laboratory #3 we can see how to use
precedence rules and also the usage of the keyword “prec” in order to override the default precedence rule.
from myLexer import * import ply.yacc as yacc class MyParser: # CONSTRUCTOR def __init__(self): print("Parser called") self.parser = yacc.yacc(module=self) # DESTRUCTOR def __del__(self): print('Parser destructor called.') tokens = MyLexer.tokens precedence = ( ('left', 'LOGOP'), ('left', 'PLUS', 'MINUS'), ('left', 'DIVIDE', 'TIMES', 'MOD'), ('left', 'INCR', 'DECR'), ('nonassoc', 'IFX'), ) # GRAMMAR START ... def p_if_stmt(self,p): ''' if_stmt : IF RBOPEN cond RBCLOSED stmt %prec IFX | IF RBOPEN cond RBCLOSED stmt ELSE stmt '''
To compile correctly your programs you have to create the lexer and parser objects (when they are used, if they don't everything is inside the main function so you don't have to instantiate anything); for example:
# create objects MY LEXER and MY PARSER myLex = MyLexer() myPars = MyParser(myLex)
In the example above after creating it, the lexer is passed in the parser constructor (this is important to let the parser see through the lexer tokens).
To build the lexer, the function
lex.lex() is used.
This function uses Python reflection (or introspection) to read the regular expression rules out of the calling context and build the lexer.
# create object MY LEXER inside the main function myLex = MyLexer() # build the lexer by calling the lex.lex() method inside the myLexer class def __init__(self): print('Lexer constructor called.') self.lexer = lex.lex(module=self) # initialize all parameters self.file = open('output.html', 'w')
Once the lexer has been built, two methods can be used to control the lexer.
lexer.input(data) # Reset the lexer and store a new input string. lexer.token() # Return the next token. Returns a special LexToken instance # on success or None if the end of the input text has been reached. Sometimes # is used indirectly in the for cycle "for token in lexer:"
The following example shows how the main function looks like using only the lexer:
import ply.lex as lex import ply.yacc as yacc import sys from myLexer import * # create object MY LEXER myLex = MyLexer() # reading INPUT FILE myFile = open(sys.argv) lexer = myLex.lexer with myFile as fp: for line in fp: try: lexer.input(line) for token in lexer: pass except EOFError: break
To build the parser, call the
# create object MY PARSER inside the main function myLex = MyLexer() myPars = MyParser() # build the parser by calling the yacc.yacc() method inside the myParser class def __init__(self): print("Parser called") self.parser = yacc.yacc(module=self)
This function looks at the module and attempts to construct all of the LR
parsing tables for the grammar you have specified. The first time
is invoked, you will get a message such as this:
Generating LALR tables calc >
Since table construction is relatively expensive (especially for large grammars),
the resulting parsing table is written to a file called
In addition, a debugging file called
parser.out is created.
On subsequent executions, yacc will reload the table from
it has detected a change in the underlying grammar (in which case the tables
and parsetab.py file are regenerated). Both of these files are written to the
same directory as the module in which the parser is specified. The name of the
parsetab module can be changed using the tabmodule keyword argument to
Once the parser has been built you can use the parse method in order to control the parser.
The following example shows how the main function looks like using both lexer and parser:
import ply.lex as lex import ply.yacc as yacc import sys from myLexer import * from myParser import * # create objects MY LEXER and MY PARSER myLex = MyLexer() myPars = MyParser() lex = myLex.lexer parser = myPars.parser # reading INPUT FILE myFile = open(sys.argv) parser.parse(myFile.read())