PEG parsing

How can we implement a parser for our little grammar? Parsing means to recognize a language according to its grammar and if a given input can be successfully recognized we know that it conforms to the grammar. However, just recognizing is not enough, because we also want to do something. So, while parsing, we will also build up a data structure that encodes our description.

We will start with a couple of parsers of terminal symbols and combine them together for the full grammar. Each Parser takes an input string and returns a tuple that signals success and the remaining unparsed string.

Parser :: String -> (success::Boolean, unparsed::String)

Here are a few very simple parsers:

  • literal ('…' with the literal instead of …) to parse a specific sequence of characters
  • anyChar (.) to consume exactly one character
  • digit ([0-9]) to consume a decimal digit
  • space ()1 to recognize a blank space character
  • tab to recognize a tab character
  • ws for generic whitespace characters
  • empty (ε) which always succeeds and returns the input string unchanged
  • fail ()2 which always fails and returns the input string unchanged

The simplest parser recognizes terminal symbols, we can generate such a parser with the following function:

def literal(literal, unparsed):
	# Does unparsed start with the given literal?
	if unparsed[0:len(literal)] == literal:
		# Success! Return the remaining string
		return (True, unparsed[len(literal):])
	else:
		# Failure! Return the original string
		return (False, unparsed)

In addition we have a parser that always succeed without consuming input, and one that always fails:

def empty(unparsed): 
	return (True, unparsed)

def fail(unparsed):  
	return (False, unparsed)

Now it is of course nice to be able to parse an input string, but at the end we only know if we have been successful or not. To be more useful, we want to return a data structure that is derived from the input. To be able to do that we need to construct this structure during the parsing process from the underlying elements. This adds an additional field to our return tuple containing the construct from the parsed parts.

Builder a b :: a -> b
Parser a :: (Builder a) -> String -> (success::Boolean, parsed::a, unparsed::String)

It will be helpful to have a function that leaves the underlying element unchanged.

def id_(x): return x

In the case the parse is unsuccessful, we return $None$

def literal(literal, builder=id_, unparsed): # String -> String -> (Boolean, a, String)
	# Does unparsed start with the given literal?
	if unparsed[0:len(literal)] == literal:
		# Success! Return the remaining string
		return (True, builder(literal), unparsed[len(literal):])
	else:
		# Failure! Return the original string
		return (False, None, unparsed)
def empty(): return lambda ast, unparsed: (True, ast, unparsed)
def fail():  return lambda ast, unparsed: (False, ast, unparsed)```

Combinators

To build up a grammar we have to implement the different elements. We use combinators here. A combinator takes one or more parsers and transform them into a new one.

Combinator :: Parser -> Parser -> Parser
Combinator :: [Parser] -> Parser

The combinators we need are

  • opt (?) for an optional element,
  • zeroOrMore (*) for zero or more elements
  • oneOrMore (+) for one or more elements
  • andThen for one element following another (concatenation)
  • orElse (‘/’) for one element or if that fails another one
  • allOf a generalization of andThen
  • anyOf a generalization of orElse

(I did not implement andThen and orElse as these are equivalent to allOf and anyOf with exactly two parsers in the parser list.)

For example, to recognize any day of the week in the grammar description we write:

day = 'Mo' / 'Tu' / 'We' / 'Th' / 'Fr' / 'Sa' / 'Su'

which becomes in code:

day = anyOf([literal('Mo'), 
             literal('Tu'), 
             literal('We'), 
             literal('Th'), 
             literal('Fr'), 
             literal('Sa'), 
             literal('Su')])

However if you want to really implement this you will find that you do need a lot plumbing. Instead we make the parser return a parsing function that can be applied to the input code:

def apply(parser, unparsed): # :: Parser -> String -> (success::Boolean, parsed::a, unparsed::String)
	return parser(unparsed)
def pLiteral(literal, builder, unparsed): # String -> String -> (Boolean, a, String)
	# Does unparsed start with the given literal?
	if unparsed[0:len(literal)] == literal:
		# Success! Return the remaining string
		return (True, builder(literal), unparsed[len(literal):])
	else:
		# Failure! Return the original string
		return (False, None, unparsed)

def lit(literal, builder=id_): return lambda unparsed: pLiteral(literal, builder, unparsed)
def p_anyOf(parsers, builder, org_unparsed):
	# go through list
	unparsed = org_unparsed
	for parser in parsers:
		(success, parsed, unparsed) = parser(unparsed)
		if success:
			return (True, builder(parsed), unparsed)
		else: # backtrack
			unparsed = org_unparsed
	# no parser matched
	return (False, None, org_unparsed)

def anyOf(parsers, builder=id_): 
	return lambda unparsed: p_anyOf(parsers, builder, unparsed)

day = anyOf([lit('Mo'), lit('Tu'), …])

apply(day, 'Mo') # returns (True, 'Mo', '')
apply(day, 'Mu') # returns (False, None, 'Mu')
  1. ‘Bottom square bracket’ U+23B5 

  2. ‘Up tack’ U+22A5