For some time I have wanted to use International Components for Unicode (ICU), specifically its boundary functions, from within a C# application. There are two .NET projects that I could find which started to address this: (icu4net and icu-dotnet) but neither is complete and (&&) active. So I thought I’d have a go myself at the integration.
A good while ago I looked at Uniscribe which is Microsoft’s API for rendering complex scripts. I don’t doubt that it’s powerful as the API looks fairly impenetrable.
ICU has a small .NET application (genicuwrapper.exe) which generates a C# class file which contains a complete set of P/Invoke function signatures for the accessible functions in icuuc\d\d.dll.
genicuwrapper uses the cygwin C++ pre-processor (cpp) so first task is to install Cygwin. Cygwin does have a nice (once you’re used to it) setup program. You can run it from the web and have it connect directly to, and install directly from, mirror sites. When you get to the Select Packages page of the installer you need to make sure that “gcc-core” and “gcc-g++” are checked. In the search textbox type “gcc”, expand the “devel” category and then click the circular arrows icon to set them to selected for installation. After installation was complete I then had to add the bin sub-folder of the cygwin installation folder to my system path.
My first attempt using Windows 2003 Server ended in failure because of a “NTVDM encountered a hard error”. The cure for this – other than requesting a hot fix from Microsoft – was to concatenate all of the ICU header files into a single header file, copy it to my cygwin home directory and execute cpp from the bash shell.
Having got this far the next hurdle was working out how to call the externalized C functions from within a safe block, compile with appropriate switches for calling into unmanaged code and convert between C structures and string types and CLR equivalents. It is also clunky from the perspective of not working with objects. All tedious and requiring of much reading.
Having done some further reading I decided to go back to the method employed in icu4net: write a C++ CLI/R wrapper class and then use the wrapper objects and methods from C#. This still requires some understanding of type marshaling between the unmanaged code and the wrapper but I eventually found several good resources for that: Managed C++ Wrapper For Unmanaged Code, How To: Marshal ANSI Strings Using C++ Interop, Using C++ Interop (Implicit PInvoke), and Chapter 8 of Pro .NET Performance.
What does all this effort allow me to do? Given a plain text string and a locale I can split the string into constituent “words” according to the Unicode Text Segmentation rules. It even works for Chinese, Japanese, Korean and Thai.
is segmented (using *) as