An eventful week back in the office after a two week trip visiting Redmond, WA, London and Athens. Whilst many industries seem to slowly wind down to Christmas, it always seems to me a race to finish up many initiatives. I guess the good side of this is that we have a truly relaxing break and a clean slate in the new year. The trip was very productive and positive with the prospect of new business and a successful end to the first year of the FREME Project.
One activity that occupied my mind whilst travelling so much was how to build a parser for segments of text that contain combinations of ill-formed (un-balanced) HTML tags; well-known variable placeholders like “%s”; and proprietary inline tags for, well, just about anything. Having parsed the text, then serialize it as XLIFF 1.2 <trans-unit />’s.
Yes, I know there’s Okapi and HTML Agility Pack but these content chunks come to us from a customer’s proprietary localization format not supported by either of these frameworks. There is also Antlr but I’m always up for a challenge.
My solution is not revolutionary but it’s robust and only took me the combined flight times to write. It uses a recursive strategy, the Composite and Visitor design patterns, shallow parsing based on three regular expressions and a handy little library.
Using regular expressions for parsing quickly litters your code with accessors for the regex named matches:
firstMatch.Group["openingTag"].Index; firstMatch.Group["openingTag"].Length;
Once you start using these within Substring( ) methods in order to pull out non-matched segments of the input string it gets nasty to continue without some helpers:
input.Substring(firstMatch.Group["openingTag"].Index + firstMatch.Group["openingTag"].Length, secondMatch.Group["openingTag"].Index);
The obvious solution to this was to create a MatchCollectionHelper class. With some careful crafting of composed methods starting with:
mch.TextBeforeMatch(int matchNumber); mch.TextAfterMatch(int matchNumber);
through to:
mch.TextSegment(int i).
The helper library the paved the way to brief, expressive code for the recursive Parser. The parser uses one interface with three methods and a property:
interface IFragment { void Parse(); StringBuilder Print(); bool HasChilden(); void Accept(IVisitor); }
Tags that are paired (<a …>…</a>) are capable of having children, are parsed outside in and have a FragmentType of PAIRED. Standalone tags have a FragmentType of STANDALONE and remaining text is FragementType.TEXT. STANDALONE and TEXT are leaf nodes in the composition.
You can then build Visitors to serialize the input as whatever you need:
"<p>Click the <b>Restart</b> button in order to <i>re-initialize</i> the system.</p><p>You should not use the <<emergency_procedure>> if a <b>Restart</b> is needed more than %d times.</p>" <trans-unit id=".."> <source> <bpt id="1"><p></bpt>Click the <bpt id="3"><b></bpt>Restart<ept id="3"></b></ept> button in order to <bpt id="4"><i><bpt>re-initialize<ept id="4"></i></ept> the system.<ept id="1"></p></ept><bpt id="2"><p></bpt>You should not use the <ph id="5" ctype="standalone"><&;t;emergency_procedure>></ph> if a <bpt id="6"><b></bpt>Restart<ept id="6"></b></ept> is needed more than <ph id="7" ctype="">%d</ph> times.<ept id="2"></p></ept> </source> </trans-unit>
Happy Holidays and look out in the New Year for an Ocelot 2.0 announcement!