Monthly Archives: December 2015

Parsing Ill-formed Text

An eventful week back in the office after a two week trip visiting Redmond, WA, London and Athens. Whilst many industries seem to slowly wind down to Christmas, it always seems to me a race to finish up many initiatives. I guess the good side of this is that we have a truly relaxing break and a clean slate in the new year. The trip was very productive and positive with the prospect of new business and a successful end to the first year of the FREME Project.

One activity that occupied my mind whilst travelling so much was how to build a parser for segments of text that contain combinations of ill-formed (un-balanced) HTML tags; well-known variable placeholders like “%s”; and proprietary inline tags for, well, just about anything. Having parsed the text, then serialize it as XLIFF 1.2 <trans-unit />’s.

Yes, I know there’s Okapi and HTML Agility Pack but these content chunks come to us from a customer’s proprietary localization format not supported by either of these frameworks. There is also Antlr but I’m always up for a challenge.

My solution is not revolutionary but it’s robust and only took me the combined flight times to write. It uses a recursive strategy, the Composite and Visitor design patterns, shallow parsing based on three regular expressions and a handy little library.

Using regular expressions for parsing quickly litters your code with accessors for the regex named matches:


Once you start using these within Substring( ) methods in order to pull out non-matched segments of the input string it gets nasty to continue without some helpers:

input.Substring(firstMatch.Group["openingTag"].Index + 
firstMatch.Group["openingTag"].Length, secondMatch.Group["openingTag"].Index);

The obvious solution to this was to create a MatchCollectionHelper class. With some careful crafting of composed methods starting with:

mch.TextBeforeMatch(int matchNumber);
mch.TextAfterMatch(int matchNumber);

through to:

mch.TextSegment(int i).

The helper library the paved the way to brief, expressive code for the recursive Parser. The parser uses one interface with three methods and a property:

interface IFragment
  void Parse();
  StringBuilder Print();
  bool HasChilden();
  void Accept(IVisitor);

Tags that are paired (<a …>…</a>) are capable of having children, are parsed outside in and have a FragmentType of PAIRED. Standalone tags have a FragmentType of STANDALONE and remaining text is FragementType.TEXT. STANDALONE and TEXT are leaf nodes in the composition.

You can then build Visitors to serialize the input as whatever you need:

"<p>Click the <b>Restart</b> button in order to <i>re-initialize</i> the system.</p><p>You should not use the <<emergency_procedure>> if a <b>Restart</b> is needed more than %d times.</p>"

<trans-unit id="..">
    <bpt id="1">&lt;p&gt;</bpt>Click the <bpt id="3">&lt;b&gt;</bpt>Restart<ept id="3">&lt;/b&gt;</ept> button in order to <bpt id="4">&lt;i&gt;<bpt>re-initialize<ept id="4">&lt;/i&gt;</ept> the system.<ept id="1">&lt;/p&gt;</ept><bpt id="2">&lt;p&gt;</bpt>You should not use the <ph id="5" ctype="standalone">&lt;&;t;emergency_procedure&gt;&gt;</ph> if a <bpt id="6">&lt;b&gt;</bpt>Restart<ept id="6">&lt;/b&gt;</ept> is needed more than <ph id="7" ctype="">%d</ph> times.<ept id="2">&lt;/p&gt;</ept>

Happy Holidays and look out in the New Year for an Ocelot 2.0 announcement!