← Home

lexical - a library for parsing streams in Go with parser combinators

Why write a parser anyway?

A few years ago, I ended up writing an XML parser for the .NET platform.

That sounds like a massive waste of time because .NET already has a great set of XML tools, XmlDocument, XmlStreamReader, and XDocument. They all have a critical drawback though - they require the XML to be valid.

At the the time, I was working for a translation company, creating tools to enable the translation of languages - receiving tens of thousands of documents every day (zips, MS Office files, HTML, XML, CSV, JSON, .po, resx, basically anything).

The process started by extracting text and images which needed to be translated from the source format, running through a process, then putting the translated text back into the source format, sometimes with other modifications, like setting whether flows of text should read right-to-left for Arabic and Hebrew text, or changing text encodings where a customer is expecting Shift-JIS encoded Japanese text rather than Unicode.

From this, I learned that real world XML and HTML is XML-like, but it’s missing closing tags, attribute quotes, namespace declarations, has undefined entities in it, contains invalid byte sequences, doesn’t follow whitespace rules, is in a different text encoding than the XML declaration states and doesn’t follow the specification in ways I couldn’t have imagined.

Regardless of the conformance to the specifications, we needed to return translations which were as close as possible to the source document, i.e. when you compared the documents, pretty much only the text should be different since all of those invalid bits turned out to be important.

The invalid byte sequences turn out to be field delimiters in another system or controlled the scrolling speed of a digital display, the HTML elements had to be uppercase or the embedded software which was displaying it would fail.

So, to do this, I needed to find a set of tools that would handle invalid XML.

First I tried all of the HTML and XML parsers in each programming language, but none of them could do what I needed.

This meant biting the bullet and writing something to handle it.

I tried modifying the Mono project’s XmlReader class XmlReader.cs [0] but it wouldn’t easily handle the cases I had like parsing HTML inside XML CDATA sections, or the invalid byte sequences.

I tried out ANTLR [1], which generates code from a grammar definition, but it was quite complicated. Once you’ve built a grammar definition and generated code from that, you’ve still got to convert the syntax tree into something higher level, e.g. an XML Element, or XML Attribute.

At the time I tried out ANTLR there wasn’t a Go target, but there is now. There’s an example of a JSON parser written using it over at [2]

During all this research into parsers, I read a few blog posts and papers on monadic parser combinators, and they looked like a good solution. Despite their complex-sounding name, it’s a basic idea of making ‘bigger’ parsers by combining ‘smaller’ ones.

I found Sprache [3], which is great and was able to build everything I needed over a few weeks. The style of building up small components allowed me to build suites of unit tests which would test each case very simply.

For me, the testability was the major reason to use the approach. It’s easy to write a parser for each element in your grammar, test it individually, and then test the combinations.

Using Go

Since I’m using Go to do a lot of work now, I wanted a similar tool to Sprache [4] for a few bits and pieces I’m working on so I built one that was as close to Sprache as I could without it not looking like Go any more.

I’ve put a lot of effort into unit testing, benchmarking and reducing the number of memory allocations and was also checking out Rob Pike’s talk on Lexical Scanning in Go for inspiration as I went along: [5]

The code is over at [6]

There’s a few examples at [7] including a toy XML parser, but this is a simple example to read through:

package main

import (
	"fmt"
	"reflect"
	"time"

	"github.com/a-h/lexical/input"
	"github.com/a-h/lexical/parse"
)

func main() {
	ip := input.NewFromString("2001-02-03")
	result := date(ip)
	if !result.Success {
		fmt.Println("failed to parse date")
		return
	}
	actual, ok := result.Item.(time.Time)
	if !ok {
		fmt.Printf("expected to receive a time.Time value back, but got %v", reflect.TypeOf(result.Item).Name())
		return
	}

	const hour, minute, sec, nsec = 0, 0, 0, 0
	expected := time.Date(2001, time.February, 3, hour, minute, sec, nsec, time.UTC)
	if !actual.Equal(expected) {
		fmt.Printf("expected %v, but got %v\n", expected, actual)
	} else {
		fmt.Println("parsed 2001-02-03 into the correct date")
	}
}

// parse.All parses a sequence of other parsers and returns success if all of them match.
// In this case, we take a year, month and day parser, separated by parsers which check for a hyhpen.
// The results of each are passed to the dateConverter function which combines them into a time.
var date = parse.All(dateConverter, year, parse.Rune('-'), month, parse.Rune('-'), day)

// parse.Many parses between min and max (4 and 4 in this case) of the parser. Since we're using parse.ZeroToNine,
// This will parse 0000-9999.
const yearDigitsMin, yearDigitsMax = 4, 4
var year = parse.Many(parse.WithIntegerCombiner, yearDigitsMin, yearDigitsMax, parse.ZeroToNine)

// Since there's only 12 months in the year, the first month digit can _only_ be 0 or 1.
var month = parse.All(parse.WithIntegerCombiner, parse.RuneIn("01"), parse.ZeroToNine)

// Since there's only 31 days in a month at most, the first day digit can _only_ be 0, 1, 2 or 3.
var day = parse.All(parse.WithIntegerCombiner, parse.RuneIn("0123"), parse.ZeroToNine)

// The date converter receives all of the parsed items from other parses and combines them into an output type.
// In this case, it gets the year, month and day integers from the year, month and day parsers.
// It's job is to take those and make a time.Time type from it.
func dateConverter(items []interface{}) (item interface{}, value bool) {
	yi, mi, di := items[0], items[2], items[4]
	y, ok := yi.(int)
	if !ok {
		return 0, false
	}
	m, ok := mi.(int)
	if !ok {
		return 0, false
	}
	d, ok := di.(int)
	if !ok {
		return 0, false
	}
	return time.Date(y, time.Month(m), d, 0, 0, 0, 0, time.UTC), true
}

I’ve tried to make the parsing code easy to read. As always, pull requests welcome. Hopefully the code will be useful to someone else.