(Soon to be) the fastest pure-Python PEG parser I could muster
Parsimonious aims to be the fastest arbitrary-lookahead parser written in pure Python—and the most usable. It's based on parsing expression grammars (PEGs), which means you feed it a simplified sort of EBNF notation. Parsimonious was designed to undergird a MediaWiki parser that wouldn't take 5 seconds or a GB of RAM to do one page, but it's applicable to all sorts of languages.
:Code: https://github.com/erikrose/parsimonious/ :Issues: https://github.com/erikrose/parsimonious/issues :License: MIT License (MIT) :Package: https://pypi.org/project/parsimonious/
To install Parsimonious, run::
$ pip install parsimonious
Here's how to build a simple grammar:
.. code:: python
>>> from parsimonious.grammar import Grammar
>>> grammar = Grammar(
... """
... bold_text = bold_open text bold_close
... text = ~"[A-Z 0-9]*"i
... bold_open = "(("
... bold_close = "))"
... """)
You can have forward references and even right recursion; it's all taken care of by the grammar compiler. The first rule is taken to be the default start symbol, but you can override that.
Next, let's parse something and get an abstract syntax tree:
.. code:: python
>>> print(grammar.parse('((bold stuff))'))
<Node called "bold_text" matching "((bold stuff))">
<Node called "bold_open" matching "((">
<RegexNode called "text" matching "bold stuff">
<Node called "bold_close" matching "))">
You'd typically then use a nodes.NodeVisitor
subclass (see below) to walk
the tree and do something useful with it.
Another example would be to implement a parser for .ini
-files. Consider the following:
.. code:: python
grammar = Grammar(
r"""
expr = (entry / emptyline)*
entry = section pair*
section = lpar word rpar ws
pair = key equal value ws?
key = word+
value = (word / quoted)+
word = ~r"[-\w]+"
quoted = ~'"[^\"]+"'
equal = ws? "=" ws?
lpar = "["
rpar = "]"
ws = ~"\s*"
emptyline = ws+
"""
)
We could now implement a subclass of NodeVisitor
like so:
.. code:: python
class IniVisitor(NodeVisitor):
def visit_expr(self, node, visited_children):
""" Returns the overall output. """
output = {}
for child in visited_children:
output.update(child[0])
return output
def visit_entry(self, node, visited_children):
""" Makes a dict of the section (as key) and the key/value pairs. """
key, values = visited_children
return {key: dict(values)}
def visit_section(self, node, visited_children):
""" Gets the section name. """
_, section, *_ = visited_children
return section.text
def visit_pair(self, node, visited_children):
""" Gets each key/value pair, returns a tuple. """
key, _, value, *_ = node.children
return key.text, value.text
def generic_visit(self, node, visited_children):
""" The generic visit method. """
return visited_children or node
And call it like that:
.. code:: python
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
data = """[section]
somekey = somevalue
someotherkey=someothervalue
[anothersection]
key123 = "what the heck?"
key456="yet another one here"
"""
tree = grammar.parse(data)
iv = IniVisitor()
output = iv.visit(tree)
print(output)
This would yield
.. code:: python
{'section': {'somekey': 'somevalue', 'someotherkey': 'someothervalue'}, 'anothersection': {'key123': '"what the heck?"', 'key456': '"yet another one here"'}}
repr
methods of expressions, grammars,
and nodes are clear and helpful as well. The Grammar
ones are
even round-trippable!PEG parsers don't draw a distinction between lexing and parsing; everything is done at once. As a result, there is no lookahead limit, as there is with, for instance, Yacc. And, due to both of these properties, PEG grammars are easier to write: they're basically just a more practical dialect of EBNF. With caching, they take O(grammar size * text length) memory (though I plan to do better), but they run in O(text length) time.
PEGs can describe a superset of LL(k) languages, any deterministic LR(k)
language, and many others—including some that aren't context-free
(http://www.brynosaurus.com/pub/lang/peg.pdf). They can also deal with what
would be ambiguous languages if described in canonical EBNF. They do this by
trading the |
alternation operator for the /
operator, which works the
same except that it makes priority explicit: a / b / c
first tries matching
a
. If that fails, it tries b
, and, failing that, moves on to c
.
Thus, ambiguity is resolved by always yielding the first successful recognition.
Grammars are defined by a series of rules. The syntax should be familiar to anyone who uses regexes or reads programming language manuals. An example will serve best:
.. code:: python
my_grammar = Grammar(r"""
styled_text = bold_text / italic_text
bold_text = "((" text "))"
italic_text = "''" text "''"
text = ~"[A-Z 0-9]*"i
""")
You can wrap a rule across multiple lines if you like; the syntax is very forgiving.
==================== ========================================================
"some literal"
Used to quote literals. Backslash escaping and Python
conventions for "raw" and Unicode strings help support
fiddly characters.
b"some literal"
A bytes literal. Using bytes literals and regular
expressions allows your grammar to parse binary files.
Note that all literals and regular expressions must be
of the same type within a grammar. In grammars that
process bytestrings, you should make the grammar string
an r"""string"""
so that byte literals like \xff
work correctly.
[space] Sequences are made out of space- or tab-delimited
things. a b c
matches spots where those 3
terms appear in that order.
a / b / c
Alternatives. The first to succeed of a / b / c
wins.
thing?
An optional expression. This is greedy, always consuming
thing
if it exists.
&thing
A lookahead assertion. Ensures thing
matches at the
current position but does not consume it.
!thing
A negative lookahead assertion. Matches if thing
isn't found here. Doesn't consume any text.
things*
Zero or more things. This is greedy, always consuming as
many repetitions as it can.
things+
One or more things. This is greedy, always consuming as
many repetitions as it can.
~r"regex"ilmsuxa
Regexes have ~
in front and are quoted like
literals. Any flags_ (asilmx
) follow the end quotes
as single chars. Regexes are good for representing
character classes ([a-z0-9]
) and optimizing for
speed. The downside is that they won't be able to take
advantage of our fancy debugging, once we get that
working. Ultimately, I'd like to deprecate explicit
regexes and instead have Parsimonious dynamically build
them out of simpler primitives. Parsimonious uses the
regex_ library instead of the built-in re module.
~br"regex"
A bytes regex; required if your grammar parses
bytestrings.
(things)
Parentheses are used for grouping, like in every other
language.
thing{n}
Exactly n
repetitions of thing
.
thing{n,m}
Between n
and m
repititions (inclusive.)
thing{,m}
At most m
repetitions of thing
.
thing{n,}
At least n
repetitions of thing
.
==================== ========================================================
.. _flags: https://docs.python.org/3/howto/regex.html#compilation .. _regex: https://github.com/mrabarnett/mrab-regex
If you need a ~"[a-z0-9]"i
at two points in your grammar, don't type it
twice. Make it a rule of its own, and reference it from wherever you need it.
You'll get the most out of the caching this way, since cache lookups are by
expression object identity (for speed).
Even if you have an expression that's very simple, not repeating it will save RAM, as there can, at worst, be a cached int for every char in the text you're parsing. In the future, we may identify repeated subexpressions automatically and factor them up while building the grammar.
How much should you shove into one regex, versus how much should you break them up to not repeat yourself? That's a fine balance and worthy of benchmarking. More stuff jammed into a regex will execute faster, because it doesn't have to run any Python between pieces, but a broken-up one will give better cache performance if the individual pieces are re-used elsewhere. If the pieces of a regex aren't used anywhere else, by all means keep the whole thing together.
Bring your ?
and *
quantifiers up to the highest level you
can. Otherwise, lower-level patterns could succeed but be empty and put a bunch
of useless nodes in your tree that didn't really match anything.
A parse tree has a node for each expression matched, even if it matched a
zero-length string, like "thing"?
might.
The NodeVisitor
class provides an inversion-of-control framework for
walking a tree and returning a new construct (tree, string, or whatever) based
on it. For now, have a look at its docstrings for more detail. There's also a
good example in grammar.RuleVisitor
. Notice how we take advantage of nodes'
iterability by using tuple unpacks in the formal parameter lists:
.. code:: python
def visit_or_term(self, or_term, (slash, _, term)):
...
For reference, here is the production the above unpacks::
or_term = "/" _ term
When something goes wrong in your visitor, you get a nice error like this::
[normal traceback here...]
VisitationException: 'Node' object has no attribute 'foo'
Parse tree:
<Node called "rules" matching "number = ~"[0-9]+""> <-- *** We were here. ***
<Node matching "number = ~"[0-9]+"">
<Node called "rule" matching "number = ~"[0-9]+"">
<Node matching "">
<Node called "label" matching "number">
<Node matching " ">
<Node called "_" matching " ">
<Node matching "=">
<Node matching " ">
<Node called "_" matching " ">
<Node called "rhs" matching "~"[0-9]+"">
<Node called "term" matching "~"[0-9]+"">
<Node called "atom" matching "~"[0-9]+"">
<Node called "regex" matching "~"[0-9]+"">
<Node matching "~">
<Node called "literal" matching ""[0-9]+"">
<Node matching "">
<Node matching "">
<Node called "eol" matching "
">
<Node matching "">
The parse tree is tacked onto the exception, and the node whose visitor method raised the error is pointed out.
Some have asked why we don't process the tree as we go, SAX-style. There are two main reasons:
It wouldn't work. With a PEG parser, no parsing decision is final until the whole text is parsed. If we had to change a decision, we'd have to backtrack and redo the SAX-style interpretation as well, which would involve reconstituting part of the AST and quite possibly scuttling whatever you were doing with the streaming output. (Note that some bursty SAX-style processing may be possible in the future if we use cuts.)
It interferes with the ability to derive multiple representations from the AST: for example, turning wiki markup into first HTML and then text.
(Next release)
0.10.0
{min,max}
repetition expressions (righthandabacus)*
and +
for token grammars (lucaswiman).. warning::
This release makes backward-incompatible changes:
* Fix precedence of string literal modifiers ``u/r/b``.
This will break grammars with no spaces between a
reference and a string literal. (lucaswiman)
0.9.0
Grammar.__repr__()
now correctly escapes backslashes (ingolemo)0.8.1
print
in the benchmark tests so we work
cleanly as a dependency on Python 3. (Edward Betts)0.8.0
__repr__
more like the
original input. (Lucas Wiman)six
at 1.9.0 to ensure we have python_2_unicode_compatible
.
(Sam Raker)0.7.0
Grammar.__repr__
which fails to work on Python 3 since the
string_escape codec is gone in Python 3. (Lucas Wiman)0.6.2
0.6.1
0.6 .. warning::
This release makes backward-incompatible changes:
* The ``default_rule`` arg to Grammar's constructor has been replaced
with a method, ``some_grammar.default('rule_name')``, which returns a
new grammar just like the old except with its default rule changed.
This is to free up the constructor kwargs for custom rules.
* ``UndefinedLabel`` is no longer a subclass of ``VisitationError``. This
matters only in the unlikely case that you were catching
``VisitationError`` exceptions and expecting to thus also catch
``UndefinedLabel``.
__getitem__
on Grammar at parse time.@rule
decorator, allowing grammars to be constructed out of
notations on NodeVisitor
methods. This saves looking back and forth
between the visitor and the grammar when there is only one visitor per
grammar.parse()
and match()
convenience methods to NodeVisitor
.
This makes the common case of parsing a string and applying exactly one
visitor to the AST shorter and simpler.unwrapped_exceptions
attribute to NodeVisitor
, letting you
name certain exceptions which propagate out of visitors without being
wrapped by VisitationError
exceptions.__init__
, making your imports
shorter.0.5 .. warning::
This release makes some backward-incompatible changes. See below.
None
,
parse()
and match()
raise ParseError
if they don't succeed.
This makes more sense, since you'd rarely attempt to parse something and
not care if it succeeds. It was too easy before to forget to check for a
None
result. ParseError
gives you a human-readable unicode
representation as well as some attributes that let you construct your own
custom presentation.ParseError
rather than BadGrammar
if it can't parse your rules.parse()
now takes an optional pos
argument, like match()
._str__()
method of UndefinedLabel
return the right type.0.4
import *
for parsimonious.expressions
.0.3
!
("not") operator, and parentheses in grammar
definition syntax.&
operator to a prefix operator to conform to the original
PEG syntax. The version in Parsing Techniques was infix, and that's what I
used as a reference. However, the unary version is more convenient, as it
lets you spell AB & A
as simply A &B
.print
statements out of the benchmark tests.__repr__
.0.2
match()
public and able to initialize a new cache. Add
match()
callthrough method to Grammar
.BadGrammar
exception (rather than crashing) when there are
mistakes in a grammar definition.NodeVisitor.lift_child
convenience method.VisitationException
to VisitationError
for consistency with
the standard Python exception hierarchy.repr
and str
values for grammars and expressions. Now they
both look like rule syntax. Grammars are even round-trippable! This fixes a
unicode encoding error when printing nodes that had parsed unicode text.0.1
Thanks to Wiki Loves Monuments Panama for showing their support with a generous gift.