Tag Archives: XLIFF

Exploring JLIFF

I have published a web application where you can submit XLIFF 2.x files and get back a JLIFF serialization as text.

JLIFF is a new XLIFF serialization format currently in working draft with the OASIS Object Model and Other Serializations Technical Committee.

The application uses my open source JliffGraphTools library.

I am working on a conversion of XLIFF 1.2 to JLIFF but as the content model is structurally different it’s tricky.

I was careful to implement it in a way that means no data is persisted. I don’t even collect any metadata about what is submitted. That way people can feel confident of about the privacy of their data.

What did it all mean?

I gave two presentations at the SEMANTiCS 2016 conference in Leipzig last week. Both were related to the H2020 FREME Project that I have been a participant of. The first was on the e-Internationalization service which we have contributed to significantly. The second (containing contributions from Felix Sasaki) was on the use of standards (de-facto and ratified) within the FREME e-services in general and our business case implementation in particular.

This was my third attendance at the conference and it once again contained interesting and inspiring presentations, ideas and use cases around linked data.

I sometimes return from these types of conferences, full of innovation and enthusiasm for applying new ideas, to the day-to-day operations of work and become discouraged by the inertia for change and the race to the bottom in terms of price. It is almost impossible to innovate in such an atmosphere. We have looked at the application of machine learning, text classification and various natural language processing algorithms and whilst people may acknowledge that the ideas are good, no-one wants to pilot or evaluate them let alone pay for them.

Any how, I remain inspired by the fields of NLP, Linked-Data, Deep Learning and Semantic networks and may be my day will come.

Inline Markup

Even trying to decide how to title this post was not immediately obvious. In many situations what I am writing about is not even visible to the content consumers. The topic is what can be referred to as placeholders, inline markup, inline tags, format specifiers, placeables, replacement variables…

I’ve spent a lot of time with them recently. They’re a good idea with a solid Use Case. They generally follow internationalization guidelines by separating geographic region and culture affected information, typically provided by an operating system or platform, from textual content which is generally passed to translators for adaptation. They can also be used to represent some item of structure or unalterable content which you just want to protect from accidental modification during localization.

Examples are:

  • “Dear %s, your calendar appointment for %s will start in %d minutes.” (resource file)
  • “The all new [product-name] is amazing! [branded-tag-line-format text=”Ultra Super Gizmo Thingydoodle”] Get it now!” (proprietary content format)
  • “<p>The Stranglers single <a title=”Golden Brown” href=”https://en.wikipedia.org/wiki/Golden_Brown””>Golden Brown</a> was amonst their best.</p>” (HTML)
  • “<source>All I want is some text where a <bpt id=”1″>&lt;strong&gt;<bpt>word<ept id=”1″>&lt;/strong&gt;</ept> is in bold.</source>” (XLIFF 1.2)
  • “Select File->Edit->Search & Replace” (potential confusion)

They can be challenging because there are so many ways in which to represent them (sometimes depending upon the host content type), and they can be complex: the placeholder may itself contain content which should be translated by a human. They have to be considered during authoring, human and machine translation, file conversion and quality assurance.

Support for handling these within well-known applications and file formats is comprehensive and stable. However, outside of these you can be on your own and struggle without a good grasp of a document object model, and an appreciation of parsing and regular expressions.

One item of great news is that it looks like inline markup will be supported within the Change Tracking module of the up-coming XLIFF 2.1.

As always the devil can be in the detail.

Ocelot 2.1

We released Ocelot 2.1 today at http://bit.ly/296H0J4. It includes bidirectional language support, an awesome new totally configurable ITS LQI grid with intuitive hot keys, saving of Notes in XLIFF 1.2 and search and replace.

Machine Translation Pipeline Meets Business Trends

This week we will carry out final integration and deployment tests on our distributed pipeline for large scale and continuous translation scenarios that heavily leverage the power of machine translation.

We have built this platform as we recognised the demands and trends that are being reported by industry experts like Common Sense Advisory and Kantan.

The platform features several configurable services that can be switched on as required. These include:

  • automated source pre-editing prior to passing to a choice of custom machine translation engines;
  • integrated pre-MT translation memory leverage;
  • automated post-edit of raw machine translation prior to human post-edit;
  • in-process, low-friction capture of actionable feedback on MT output from humans;
  • automated post-processing of human post-edit;
  • automated capture of edit distance data for BI and reporting.

The only component missing that will be integrated during May is the text analysis and text classification algorithms which will give us the ability to do automated quality assurance of every single segment. Yes, everything – no spot-checking or limited scope audits.

The platform is distributed and utilises industry standard formats including XLIFF and ITS. Thus it is wholly scalable and extensible. Of note is that this platform delivers upon all six of the trends recently publicised by Kantan. Thanks to Olga O’Laoghaire who made significant contributions to the post-editing components and Ferenc Dobi, lead Architect and developer.

I’m very excited to see the fruition of this project. It doesn’t just represent the ability for us to generate millions of words of translated content, it delivers a controlled environment in which we can apply state-of-the-art techniques that are highly optimised at every stage, measurable and designed to target the goal of fully automated usable translation (FAUT).

Positive Thoughts for Blue Monday

Just before Christmas I joined the OASIS XLIFF Object Model and Other Serializations Technical Committee. I think it reflects the maturity and degree of adoption of XLIFF that this TC has been convened. It’s another opportunity to work with some technical thought leaders of the localization industry.

On Wednesday 13th I attended the launch of the ADAPT Centre in the Science Gallery at Trinity College. ADAPT is the evolution of the Centre for Next Generation Localization CSET into a national Research Center. Vistatec was a founding industrial partner of CNGL back in 2007 and I’m happy to continue our collaboration on topics such as machine translation, natural language processing, analytics, digital media and personalization. Unexpectedly but happily I was interviewed by RTE News and appeared on national television.

Like millions of people, I am saddened at the passing of David Bowie and Alan Rickman. Kudos to Bowie for releasing Blackstar and bequeathing such unique and thought-provoking art. The positive angle? The lessons of live and appreciate life to the full.

To a large extent my development plans for Q1 were approved. This includes extending the deployment of SkyNet to other accounts within Vistatec.

On January 11th we released Ocelot 2.0.

My Italian is coming along. Slowly but surely. We have a number of Italian bistro’s and ristorante in the town where I live so I have every opportunity to try it out.

On the coding front I’ve been looking at OAuth2, OpenID and ASP.NET MVC 6. I continue to be impressed by Microsoft’s transformation from an “invented here” company to one that both embraces and significantly contributes to open source.

Onward and Upward!

Ocelot 2.0

I am pleased and excited to announce the release of Ocelot 2.0 available as source code and binaries. Special thanks go to Kevin Lew, Marta Borriello and Chase Tingley who were the main engineers of this release.

The new features are:

  1. Support for XLIFF 2.0 Core;
  2. Translation Memory and Concordance lookup;
  3. Save XLIFF document as TMX;
  4. Language Quality Issue Grid;
  5. Set the fonts used for source and target content;
  6. Support for notes in XLIFF 2.0
  7. Serialization of Original Target in XLIFF 2.0 using the Change Tracking Module.

v2Overview

XLIFF 2.0

Ocelot now supports XLIFF 2.0 documents. It still supports XLIFF 1.2 documents and auto-detects the XLIFF version of the document being opened.

Translation Memory and Concordance lookup

A new window above the main editing grid now displays two tabs for Translation Memory lookup matches and Concordance Search results. If it is not visible then clicking on the splitter bar just below the font selection controls under the menu bar should reveal them.

Ocelot works with Translation Memory Exchange (TMX) files. The View->Configure TM menu option opens the TM Configuration dialog where you can specify which translation memories you want to use (in a fallback sequence), the penalty if any to be applied to the TM, the maximum number of results to display, and the minimum match threshold which matches must satisfy in order to be shown.

We have also added the ability to save the document as TMX.

Language Quality Issue Grid

Adding Internationalization Tag Set 2.0 Language Quality Issue metadata, even using the “Quick Add” mechanism of release 1 that could be configured in the rule.properties file, has been cumbersome. The LQI Grid reduces this to a one-click or one-hot key operation (excluding any comments you want to add).

The grid is customizable graphically allowing a matrix of issue severities (columns) and a user defined selection of types (rows) to be configured along with hot keys for every combination. Clicking any cell in the grid or typing its associated hotkey sequence will add the required Localization Quality Issue metadata. For example, clicking the cell at the intersection of a “Major” error severity column and “style” error category row will add an <its:locQualityIssue locQualityIssueSeverity="..." locQualityIssueType="style" /> entry to the relevant segment.

Source and Target Fonts

Just below the menu bar are two comboboxes which allow you to set the font family and size to be used for the source and target content.

XLIFF 2.0 Notes

On the View menu, the Configure Columns option allows you to display a Notes column. Text entered into cells in this column will be serialized as XLIFF <note /> elements.

Serialization of Original Target

Ocelot now captures and serializes the original target text if modified as a tracked change using the XLIFF 2.0 Change Track module. One limitation here, which we hope to address as part of XLIFF 2.1 is that only the text (and no inline markup) is saved.

I hope that these enhancements are useful.

Parsing Ill-formed Text

An eventful week back in the office after a two week trip visiting Redmond, WA, London and Athens. Whilst many industries seem to slowly wind down to Christmas, it always seems to me a race to finish up many initiatives. I guess the good side of this is that we have a truly relaxing break and a clean slate in the new year. The trip was very productive and positive with the prospect of new business and a successful end to the first year of the FREME Project.

One activity that occupied my mind whilst travelling so much was how to build a parser for segments of text that contain combinations of ill-formed (un-balanced) HTML tags; well-known variable placeholders like “%s”; and proprietary inline tags for, well, just about anything. Having parsed the text, then serialize it as XLIFF 1.2 <trans-unit />’s.

Yes, I know there’s Okapi and HTML Agility Pack but these content chunks come to us from a customer’s proprietary localization format not supported by either of these frameworks. There is also Antlr but I’m always up for a challenge.

My solution is not revolutionary but it’s robust and only took me the combined flight times to write. It uses a recursive strategy, the Composite and Visitor design patterns, shallow parsing based on three regular expressions and a handy little library.

Using regular expressions for parsing quickly litters your code with accessors for the regex named matches:

 firstMatch.Group["openingTag"].Index;
 firstMatch.Group["openingTag"].Length;

Once you start using these within Substring( ) methods in order to pull out non-matched segments of the input string it gets nasty to continue without some helpers:

input.Substring(firstMatch.Group["openingTag"].Index + 
firstMatch.Group["openingTag"].Length, secondMatch.Group["openingTag"].Index);

The obvious solution to this was to create a MatchCollectionHelper class. With some careful crafting of composed methods starting with:

mch.TextBeforeMatch(int matchNumber);
mch.TextAfterMatch(int matchNumber);

through to:

mch.TextSegment(int i).

The helper library the paved the way to brief, expressive code for the recursive Parser. The parser uses one interface with three methods and a property:

interface IFragment
{
  void Parse();
  StringBuilder Print();
  bool HasChilden();
  void Accept(IVisitor);
}

Tags that are paired (<a …>…</a>) are capable of having children, are parsed outside in and have a FragmentType of PAIRED. Standalone tags have a FragmentType of STANDALONE and remaining text is FragementType.TEXT. STANDALONE and TEXT are leaf nodes in the composition.

You can then build Visitors to serialize the input as whatever you need:

"<p>Click the <b>Restart</b> button in order to <i>re-initialize</i> the system.</p><p>You should not use the <<emergency_procedure>> if a <b>Restart</b> is needed more than %d times.</p>"

<trans-unit id="..">
  <source>
    <bpt id="1">&lt;p&gt;</bpt>Click the <bpt id="3">&lt;b&gt;</bpt>Restart<ept id="3">&lt;/b&gt;</ept> button in order to <bpt id="4">&lt;i&gt;<bpt>re-initialize<ept id="4">&lt;/i&gt;</ept> the system.<ept id="1">&lt;/p&gt;</ept><bpt id="2">&lt;p&gt;</bpt>You should not use the <ph id="5" ctype="standalone">&lt;&;t;emergency_procedure&gt;&gt;</ph> if a <bpt id="6">&lt;b&gt;</bpt>Restart<ept id="6">&lt;/b&gt;</ept> is needed more than <ph id="7" ctype="">%d</ph> times.<ept id="2">&lt;/p&gt;</ept>
 </source>
</trans-unit>

Happy Holidays and look out in the New Year for an Ocelot 2.0 announcement!

Slavic Love Story at EDF

On the 16th and 17th I attended the European data forum in Luxembourg. I was there to talk about and demonstrate our use case for the content enrichment services of the FREME project. Our stand was well visited and I think the project has made great progress and is well positioned coming up to the end of its first year. Though all project members were tweeting about our presence at the conference, our best publicity came from one of the consortium’s technology partners: Tilde. At one of the European Commission’s bureaucratic centers, with a suit density of 90%, and much talk of data and analytics, Tatjana Gornostaja pulled a master stroke and presented her Love Story.

IMG_7216

IMG_7211

Last week we submitted a proposed amendment to the Change Tracking module of XLIFF 2.0. The amendment would mean that change tracking <item /> elements would support all of the in-line mark up that <source /> and <target /> do. Hopefully it will be accepted and make it into XLIFF 2.1.

I am making slow but steady progress with learning Italian. Having a second natural language (I am fluent in 4 programming) has been a long ambition and I love the stories circulating about the associated benefits such as less risk of dementia and better recovery from heart attack. This is yet another aspect of my life made possible by technology. No physical attendance at classes necessary.

Work and domestic activity levels seem set to continue at the current high rates until Christmas. This will make the holiday a well earned one though I say it myself. Ciao.

Public Defrag

I’m using this post to both keep my blog live and also organise my own thoughts on everything that’s been going on over the last six weeks.

We deployed the distributed components of our XTM integration to production and have pushed a number of projects through it. We delivered this project through a series of backlog driven sprints. It wasn’t without its challenges: requirements changes, unforeseen constraints and aggressive timelines. Some elements of test-driven development were employed (and very useful) as were domain-driven design.

On Wednesday I was delighted to receive a tweet from @kevin_irl announcing the open source publication of their .NET XLIFF 2.0 Object Model on GitHub. Coincidentally Wednesday was also the day that my copy of Bryan Schnabel’s “A Practical Guide to XLIFF 2.0” arrived from Amazon. One of my developers, Marta Borriello, is currently working on a prototype of the XLIFF Change Tracking module which includes support for inline markup inside of Ocelot with the hope that this will be part of XLIFF 2.1.

Machine Learning is one of my primary interests but sadly a tertiary focus after the “day job” and family (don’t you listen to their denials). Hot on the heels of machine learning is programming paradigms and languages. So with a peak in travel I decided to combine both and downloaded “Machine Learning Projects for .NET Developers” by Mathias Brandewinder and four F# courses from Pluralsight and got myself up to speed on functional programming. This turned out to be a really valuable exercise because I got to understand that functional programming gives you much more than immutable data and functions as first class objects. There’s sequences, partial application and composition to name a few. One day I plan to re-implement Review Sentinel’s core algorithms in F# but don’t hold your breath for a post about that.

I’m just back from a FREME Project hackathon in hyp-zig. Two days of enthusiasm-fuelled fun creating and querying linked data with guys at the forefront of of this exciting technology. We hacked topic/domain detection/identification using the services of the FREME platform.

IMG_7193

 

FREME Team: Francisc Tar, Felix Sasaki, me, Jan Nehring, Martin Brümmer, Milan Dojchinovski and Papoutsis Georgios.

Today I attended an ADAPT Governance Board meeting. ADAPT is the evolution of the CNGL CSET into a Research Centre. I think ADAPT has a great research agenda and has numerous world-class domain thought leaders. I’m looking forward to working with the ADAPT team during 2016 to push a few envelopes. My engagement with academia and research bodies at both national and European level over the last 10 years has been of great tangible and intangible value (no I don’t just mean drinking buddies). I have to thank Peter Reynolds (who will never let me forget it), Fred Hollowood (who will have an opinion about it) and Dave Lewis (who will be typically English and modest about it) for helping me overcome the initial inertia and Felix Sasaki who has made the bulk of it demanding, rewarding and enjoyable.

There, that’s better. Neuron activity stabilized and synapses clear.