Why write a parser anyway?
A few years ago, I ended up writing an XML parser for the .NET platform.
That sounds like a massive waste of time because .NET already has a great set of XML tools,
XDocument. They all have a critical drawback though - they require the XML to be valid.
At the the time, I was working for a translation company, creating tools to enable the translation of languages - receiving tens of thousands of documents every day (zips, MS Office files, HTML, XML, CSV, JSON, .po, resx, basically anything).
The process started by extracting text and images which needed to be translated from the source format, running through a process, then putting the translated text back into the source format, sometimes with other modifications, like setting whether flows of text should read right-to-left for Arabic and Hebrew text, or changing text encodings where a customer is expecting Shift-JIS encoded Japanese text rather than Unicode.
From this, I learned that real world XML and HTML is XML-like, but it’s missing closing tags, attribute quotes, namespace declarations, has undefined entities in it, contains invalid byte sequences, doesn’t follow whitespace rules, is in a different text encoding than the XML declaration states and doesn’t follow the specification in ways I couldn’t have imagined.
Regardless of the conformance to the specifications, we needed to return translations which were as close as possible to the source document, i.e. when you compared the documents, pretty much only the text should be different since all of those invalid bits turned out to be important.
The invalid byte sequences turn out to be field delimiters in another system or controlled the scrolling speed of a digital display, the HTML elements had to be uppercase or the embedded software which was displaying it would fail.
So, to do this, I needed to find a set of tools that would handle invalid XML.
First I tried all of the HTML and XML parsers in each programming language, but none of them could do what I needed.
This meant biting the bullet and writing something to handle it.
I tried modifying the Mono project’s
XmlReader class XmlReader.cs but it wouldn’t easily handle the cases I had like parsing HTML inside XML CDATA sections, or the invalid byte sequences.
I tried out ANTLR, which generates code from a grammar definition, but it was quite complicated. Once you’ve built a grammar definition and generated code from that, you’ve still got to convert the syntax tree into something higher level, e.g. an XML Element, or XML Attribute.
At the time I tried out ANTLR there wasn’t a Go target, but there is now. There’s an example of a JSON parser written using it over at https://github.com/shirou/antlr-grammars-v4-go/tree/master/json
During all this research into parsers, I read a few blog posts and papers on monadic parser combinators, and they looked like a good solution. Despite their complex-sounding name, it’s a basic idea of making ‘bigger’ parsers by combining ‘smaller’ ones.
I found Sprache, which is great and was able to build everything I needed over a few weeks. The style of building up small components allowed me to build suites of unit tests which would test each case very simply.
For me, the testability was the major reason to use the approach. It’s easy to write a parser for each element in your grammar, test it individually, and then test the combinations.
Since I’m using Go to do a lot of work now, I wanted a similar tool to Sprache for a few bits and pieces I’m working on so I built one that was as close to Sprache as I could without it not looking like Go any more.
I’ve put a lot of effort into unit testing, benchmarking and reducing the number of memory allocations and was also checking out Rob Pike’s talk on Lexical Scanning in Go for inspiration as I went along: https://www.youtube.com/watch?v=HxaD_trXwRE
The code is over at https://github.com/a-h/lexical
There’s a few examples at https://github.com/a-h/lexical/tree/master/examples including a toy XML parser, but this is a simple example to read through:
I’ve tried to make the parsing code easy to read. As always, pull requests welcome. Hopefully the code will be useful to someone else.