Recently, I’ve been writing parsers to handle documents that look like like XML but can’t be parsed by standard libraries.

For example, non of these can be parsed by .Net’s XDocument or XmlDocument libraries:

<key><value>Where's the end tag for value?</key>
<p>Invalid <b><i>nesting</b></i></p>
<p>The brand of this machine is &brand; but where's the DTD which references this named entity?

Handling invalid documents, is a common occurrence in the localisation industry and our skill is being able to handle them and being able to return the documents just as damaged as they arrived to us, since often, the customer is reliant on the spacing, invalid characters or other oddities for their system to work correctly, so a run through HTMLTidy or SgmlReader to produce balanced tags is not good enough.

One thing that surprised me during this work was that the HTML 5 specification specifically allows “missing” start and end tags, these are called Optional Tags.  The idea is that the tags might not be in the HTML document, but the object is always present in the HTML object model.  I thought that the XHTML rules are easier to understand, i.e. if you open it you close it, but people just aren’t writing HTML in that way, so we might as well reality head-on.

In the past, I’ve used a mixture of procedural code and Regular Expressions to transform input documents into an intermediate valid form, but I’m now skipping that step and using custom parsers to go direct to XLIFF.

I’ve even written an XPath-like expression engine, so that localisable areas of text can be defined within invalid documents.

Using Sprache [github] really helped with writing both the document and XPath parsers, and it’s a library I can see myself coming back to in the future for many places where I would have previously used Regular Expressions.  The library also lends itself really well to a unit testing approach.  Each token parser is a parser in its own right, so you can test the parsing of a HTML attribute independently to testing the parsing of a HTML element, so I was able to achieve 100% code coverage.

Here’s an example of how to parse a Star Date:

static readonly Parser<DateTime> StarTrek2009_StarDate = 
  from year in Parse.Digit.Many().Text()
  from delimiter in Parse.Char('.')
  from dayOfYear in Parse.Digit.Many().Text()
  select new DateTime(int.Parse(year), 1, 1).AddDays(int.Parse(dayOfYear) - 1);

Parsing 2259.55, with StarTrek2009_StarDate.Parse(“2259.55”) gives us a date of 24th February 2259 (the 55th day in 2259).

A regex equivalent is below, but the parser approach allows you to chain multiple parsers together in order of preference to convert a document into a stream of tokens.  If you use a Regex approach, you’d have to do this yourself, and it’s a major source of bugs.

private DateTime StarTrek2009_StarDate_Regex(string starDate)
{
	var regex = new Regex(@"(\d+)\.(\d{1,3})");
	var match = regex.Match(starDate);

	var year = int.Parse(match.Groups[1].Value);
	var day = int.Parse(match.Groups[2].Value);
	
	return new DateTime(year, 1, 1).AddDays(day - 1);
}

The Sprache examples include a toy XML parser, and a calculator.

A tool which can be used in the same way is ANTLR.  It has the upside of being cross platform, but at the cost of learning the language used to write grammars.  You can see an example of how grammars are constructed in ANTLR in DogeSharp [github] – specifically this grammar which converts from DogeSharp to C# syntax: https://github.com/returnString/DogeSharp/blob/master/DogeSharp/DogeSharp.g4

much programming, so parsing

Advertisements