Tag Archives: C#

Parsing Ill-formed Text

An eventful week back in the office after a two week trip visiting Redmond, WA, London and Athens. Whilst many industries seem to slowly wind down to Christmas, it always seems to me a race to finish up many initiatives. I guess the good side of this is that we have a truly relaxing break and a clean slate in the new year. The trip was very productive and positive with the prospect of new business and a successful end to the first year of the FREME Project.

One activity that occupied my mind whilst travelling so much was how to build a parser for segments of text that contain combinations of ill-formed (un-balanced) HTML tags; well-known variable placeholders like “%s”; and proprietary inline tags for, well, just about anything. Having parsed the text, then serialize it as XLIFF 1.2 <trans-unit />’s.

Yes, I know there’s Okapi and HTML Agility Pack but these content chunks come to us from a customer’s proprietary localization format not supported by either of these frameworks. There is also Antlr but I’m always up for a challenge.

My solution is not revolutionary but it’s robust and only took me the combined flight times to write. It uses a recursive strategy, the Composite and Visitor design patterns, shallow parsing based on three regular expressions and a handy little library.

Using regular expressions for parsing quickly litters your code with accessors for the regex named matches:

 firstMatch.Group["openingTag"].Index;
 firstMatch.Group["openingTag"].Length;

Once you start using these within Substring( ) methods in order to pull out non-matched segments of the input string it gets nasty to continue without some helpers:

input.Substring(firstMatch.Group["openingTag"].Index + 
firstMatch.Group["openingTag"].Length, secondMatch.Group["openingTag"].Index);

The obvious solution to this was to create a MatchCollectionHelper class. With some careful crafting of composed methods starting with:

mch.TextBeforeMatch(int matchNumber);
mch.TextAfterMatch(int matchNumber);

through to:

mch.TextSegment(int i).

The helper library the paved the way to brief, expressive code for the recursive Parser. The parser uses one interface with three methods and a property:

interface IFragment
{
  void Parse();
  StringBuilder Print();
  bool HasChilden();
  void Accept(IVisitor);
}

Tags that are paired (<a …>…</a>) are capable of having children, are parsed outside in and have a FragmentType of PAIRED. Standalone tags have a FragmentType of STANDALONE and remaining text is FragementType.TEXT. STANDALONE and TEXT are leaf nodes in the composition.

You can then build Visitors to serialize the input as whatever you need:

"<p>Click the <b>Restart</b> button in order to <i>re-initialize</i> the system.</p><p>You should not use the <<emergency_procedure>> if a <b>Restart</b> is needed more than %d times.</p>"

<trans-unit id="..">
  <source>
    <bpt id="1">&lt;p&gt;</bpt>Click the <bpt id="3">&lt;b&gt;</bpt>Restart<ept id="3">&lt;/b&gt;</ept> button in order to <bpt id="4">&lt;i&gt;<bpt>re-initialize<ept id="4">&lt;/i&gt;</ept> the system.<ept id="1">&lt;/p&gt;</ept><bpt id="2">&lt;p&gt;</bpt>You should not use the <ph id="5" ctype="standalone">&lt;&;t;emergency_procedure&gt;&gt;</ph> if a <bpt id="6">&lt;b&gt;</bpt>Restart<ept id="6">&lt;/b&gt;</ept> is needed more than <ph id="7" ctype="">%d</ph> times.<ept id="2">&lt;/p&gt;</ept>
 </source>
</trans-unit>

Happy Holidays and look out in the New Year for an Ocelot 2.0 announcement!

Using ICU From C#

For some time I have wanted to use International Components for Unicode (ICU), specifically its boundary functions, from within a C# application. There are two .NET projects that I could find which started to address this: (icu4net and icu-dotnet) but neither is complete and (&&) active. So I thought I’d have a go myself at the integration.

A good while ago I looked at Uniscribe which is Microsoft’s API for rendering complex scripts. I don’t doubt that it’s powerful as the API looks fairly impenetrable.

ICU has a small .NET application (genicuwrapper.exe) which generates a C# class file which contains a complete set of P/Invoke function signatures for the accessible functions in icuuc\d\d.dll.

genicuwrapper uses the cygwin C++ pre-processor (cpp) so first task is to install Cygwin. Cygwin does have a nice (once you’re used to it) setup program. You can run it from the web and have it connect directly to, and install directly from, mirror sites. When you get to the Select Packages page of the installer you need to make sure that “gcc-core” and “gcc-g++” are checked. In the search textbox type “gcc”, expand the “devel” category and then click the circular arrows icon to set them to selected for installation. After installation was complete I then had to add the bin sub-folder of the cygwin installation folder to my system path.

My first attempt using Windows 2003 Server ended in failure because of a “NTVDM encountered a hard error”. The cure for this – other than requesting a hot fix from Microsoft – was to concatenate all of the ICU header files into a single header file, copy it to my cygwin home directory and execute cpp from the bash shell.

Having got this far the next hurdle was working out how to call the externalized C functions from within a safe block, compile with appropriate switches for calling into unmanaged code and convert between C structures and string types and CLR equivalents. It is also clunky from the perspective of not working with objects. All tedious and requiring of much reading.

Having done some further reading I decided to go back to the method employed in icu4net: write a C++ CLI/R wrapper class and then use the wrapper objects and methods from C#. This still requires some understanding of type marshaling between the unmanaged code and the wrapper but I eventually found several good resources for that: Managed C++ Wrapper For Unmanaged Code, How To: Marshal ANSI Strings Using C++ Interop, Using C++ Interop (Implicit PInvoke), and Chapter 8 of Pro .NET Performance.

What does all this effort allow me to do? Given a plain text string and a locale I can split the string into constituent “words” according to the Unicode Text Segmentation rules. It even works for Chinese, Japanese, Korean and Thai.

我和伊朗伊斯蘭共和國建立了一種牢不可破的關係,因為伊朗真的是我的第二故鄉。

is segmented (using *) as

我*和*伊朗*伊斯蘭*共和國*建立*了*一種*牢不可破*的*關係*,*因為*伊朗*真的是*我的*第二*故鄉*。