Tag Archives: Unicode

The Devil is in the Detail

I really enjoyed attending Unicode 41 this week. Following changes to my role some years back and the fact that the conference is always held on the West Coast of the US, I hadn’t been in a while but I will definitely put it back on my conference agenda. It was great bumping into customers and old friends and seeing the new generation of researchers and engineers address what is essentially the challenge of worldwide communications.

The conference kicked off with a very interesting, entertaining and thought-provoking keynote entitled “Can We Escape Alphabetic Order”, given by Thomas S. Mullaney.

The remainder of the conference sessions I attended covered: predictive models used by Google in their Android keyboards; dynamic translation resource bundles developed by Uber for their mobile apps; enhancements to ICU (International Components for Unicode); Nextflix’s approaches to bi-directional and vertical subtitles and captions; Javascript libraries for internationalization; support for Emoji in Unicode; and NLP techniques for identifying fraudulent names across many languages.

It is quite incredible the degree to which companies are enabling and adapting their products in order to have them accepted in target regions. I’m not talking about translations and number formats here: it’s about supporting all writing directions, accurate and detailed rendering of complex scripts and perfect fluency in generated messages that involve levels of plurality, gender, formality and style. And the open and collaborative nature of the efforts to document this information in the form of the Common Locale Data Repository is commendable.

Using ICU From C#

For some time I have wanted to use International Components for Unicode (ICU), specifically its boundary functions, from within a C# application. There are two .NET projects that I could find which started to address this: (icu4net and icu-dotnet) but neither is complete and (&&) active. So I thought I’d have a go myself at the integration.

A good while ago I looked at Uniscribe which is Microsoft’s API for rendering complex scripts. I don’t doubt that it’s powerful as the API looks fairly impenetrable.

ICU has a small .NET application (genicuwrapper.exe) which generates a C# class file which contains a complete set of P/Invoke function signatures for the accessible functions in icuuc\d\d.dll.

genicuwrapper uses the cygwin C++ pre-processor (cpp) so first task is to install Cygwin. Cygwin does have a nice (once you’re used to it) setup program. You can run it from the web and have it connect directly to, and install directly from, mirror sites. When you get to the Select Packages page of the installer you need to make sure that “gcc-core” and “gcc-g++” are checked. In the search textbox type “gcc”, expand the “devel” category and then click the circular arrows icon to set them to selected for installation. After installation was complete I then had to add the bin sub-folder of the cygwin installation folder to my system path.

My first attempt using Windows 2003 Server ended in failure because of a “NTVDM encountered a hard error”. The cure for this – other than requesting a hot fix from Microsoft – was to concatenate all of the ICU header files into a single header file, copy it to my cygwin home directory and execute cpp from the bash shell.

Having got this far the next hurdle was working out how to call the externalized C functions from within a safe block, compile with appropriate switches for calling into unmanaged code and convert between C structures and string types and CLR equivalents. It is also clunky from the perspective of not working with objects. All tedious and requiring of much reading.

Having done some further reading I decided to go back to the method employed in icu4net: write a C++ CLI/R wrapper class and then use the wrapper objects and methods from C#. This still requires some understanding of type marshaling between the unmanaged code and the wrapper but I eventually found several good resources for that: Managed C++ Wrapper For Unmanaged Code, How To: Marshal ANSI Strings Using C++ Interop, Using C++ Interop (Implicit PInvoke), and Chapter 8 of Pro .NET Performance.

What does all this effort allow me to do? Given a plain text string and a locale I can split the string into constituent “words” according to the Unicode Text Segmentation rules. It even works for Chinese, Japanese, Korean and Thai.

我和伊朗伊斯蘭共和國建立了一種牢不可破的關係,因為伊朗真的是我的第二故鄉。

is segmented (using *) as

我*和*伊朗*伊斯蘭*共和國*建立*了*一種*牢不可破*的*關係*,*因為*伊朗*真的是*我的*第二*故鄉*。