Tag Archives: XLIFF

Parsing, Recursion and Observer Pattern

I have worked for a while now with two serializations of the XLIFF Object Model: XLIFF and JLIFF (which is still in draft). I have had occasion to write out each as the result of parsing some proprietary content format in order to facilitate easy interoperability within our tool chain, and round-tripping one serialization with the other.

Whilst both are hierarchical formats when parsing them recursively they require different strategies.

With XLIFF (XML) each opening element has all of its attributes available immediately. This means you can construct an object graph as you go: instantiate the object, set all of its attributes and make any decisions based on them, and add the object to a stack so that you can keep track of where you are in the object model. This all works nicely with the Observer pattern: you can subscribe to events which fire upon each new element no matter how nested.

<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="" srcLang="">
<file id="f1">
<group id="g1" type="cms:field">
<unit id="u1" type="cms:slug">
<segment id="s1">
<source>/>
<target/>
</segment>
</unit>
</group>
</file>
</xliff>

With JLIFF (json) you have to wait (assuming you’re doing a depth-first token read) to read all of the properties of nested objects until you can access all the properties of the parents. Thus you have to build an object graph before you can then traverse it again and use the Observer pattern in an efficient way to build another representation.

{
"jliff": "2.1",
"srcLang": "en-US",
"trgLang": "fr-FR",
"files": [
{
"id": "f1",
"kind": "file",
"subfiles": [
{
"canResegment": "no",
"id": "u2",
"kind": "unit",
"locQualityIssues": {
"items": []
},
"notes": [],
"subunits": [
{
"canResegment": "no",
"id": "s2",
"kind": "segment",
"source": [],
"target": []
}
]
},
]
}
]
}

Differences are also apparent when dealing with items which require nesting to convey their semantics. This classically happens in localization with trying to represent rich text (text with formatting).

XLIFF handles this nicely when serialized.

<source>Vacation homes in <sc id="fmt1" disp="underline" type="fmt" subType="xlf:u" dataRef=""/>Orlando<ec dataRef=""/>

Whilst JLIFF is somewhat fragmented.

"source": [
{
"text": "Vacation homes in "
},
{
"id": "mrk1",
"kind": "sm",
"type": "term"
},
{
"text": "Orlando"
},
{
"kind": "em",
"startRef": {
"token": "mrk1"
}
}
]

Content Interoperability

I am working on a project which is very familiar in the localization industry: moving content from the Content Management System (CMS) in which it is authored to a Translation Management System (TMS) in which it will be localized and then moved back to the CMS for publication.

These seemingly straight-forward scenarios often require far more effort than seems worthy. As the developer working on the interoperability you often have to have:

  • Knowledge of the CMS API and content model. (The content model being the representation which the article has inside the CMS and when exported.
  • Knowledge of the TMS API and the content formats that it is capable of parsing/filtering.

In this project the CMS is built on top of a “document database” and stores and exports content in JSON format.

One of the complexities is that rich text (text which includes formatting such as text emphasis – bold, italic – and embedded metadata such as hyperlinks and pointers to images) cause sentences to become fragmented when exported.

For example. the text:

“For more information refer to our User’s Guide or Community Forum.”

Becomes:

{
"content": [
{
"value": "For more information refer to our ",
"nodeType": "text"
},
{
"data": { "uri": "https://ficticious.com/help" },
"content": [{
"value": "User's Guide",
"nodeType": "text"
}],
"nodeType": "hyperlink"
},
{
"value": " or Community Forum.",
"nodeType": "text"
}],
"nodeType": "document"
}

If I simply let the TMS parse the JSON I know it will present the rich text sentence as three segments rather than one and it will be frustrating for translators to relocate the hyperlink within the overall sentence. Ironically, JLIFF suffers from the same problem.

What I need is a structured format that has the flexibility to enable me to express the sentence as a single string but also have the high fidelity to convert back without information loss. Luckily the industry has the XML Localization Interchange File Format (XLIFF).

I have three choices for programming the conversion, all of which are open source:

I wanted to exercise my own code a bit so I went with the third option.

JliffGraphTools contains a Jliff builder class and Xliff12 and xliff20 filter classes (for XLIFF 1.2 and 2.0 respectively). These event based classes allow a publish/subscribe interaction where elements in the XLIFF cause subscribing methods in the Jliff builder to be executed and thus create a JliffDocument.

I decided to use this pattern for the conversion of the above CMS’ JSON content model to XLIFF.

Slide1

It turns out that this approach wasn’t as straight-forward as anticipated but I’ll have to document that in another post.

Exploring JLIFF

I have published a web application where you can submit XLIFF 2.x files and get back a JLIFF serialization as text.

JLIFF is a new XLIFF serialization format currently in working draft with the OASIS Object Model and Other Serializations Technical Committee.

The application uses my open source JliffGraphTools library.

I am working on a conversion of XLIFF 1.2 to JLIFF but as the content model is structurally different it’s tricky.

I was careful to implement it in a way that means no data is persisted. I don’t even collect any metadata about what is submitted. That way people can feel confident of about the privacy of their data.

What did it all mean?

I gave two presentations at the SEMANTiCS 2016 conference in Leipzig last week. Both were related to the H2020 FREME Project that I have been a participant of. The first was on the e-Internationalization service which we have contributed to significantly. The second (containing contributions from Felix Sasaki) was on the use of standards (de-facto and ratified) within the FREME e-services in general and our business case implementation in particular.

This was my third attendance at the conference and it once again contained interesting and inspiring presentations, ideas and use cases around linked data.

I sometimes return from these types of conferences, full of innovation and enthusiasm for applying new ideas, to the day-to-day operations of work and become discouraged by the inertia for change and the race to the bottom in terms of price. It is almost impossible to innovate in such an atmosphere. We have looked at the application of machine learning, text classification and various natural language processing algorithms and whilst people may acknowledge that the ideas are good, no-one wants to pilot or evaluate them let alone pay for them.

Any how, I remain inspired by the fields of NLP, Linked-Data, Deep Learning and Semantic networks and may be my day will come.

Inline Markup

Even trying to decide how to title this post was not immediately obvious. In many situations what I am writing about is not even visible to the content consumers. The topic is what can be referred to as placeholders, inline markup, inline tags, format specifiers, placeables, replacement variables…

I’ve spent a lot of time with them recently. They’re a good idea with a solid Use Case. They generally follow internationalization guidelines by separating geographic region and culture affected information, typically provided by an operating system or platform, from textual content which is generally passed to translators for adaptation. They can also be used to represent some item of structure or unalterable content which you just want to protect from accidental modification during localization.

Examples are:

  • “Dear %s, your calendar appointment for %s will start in %d minutes.” (resource file)
  • “The all new [product-name] is amazing! [branded-tag-line-format text=”Ultra Super Gizmo Thingydoodle”] Get it now!” (proprietary content format)
  • “<p>The Stranglers single <a title=”Golden Brown” href=”https://en.wikipedia.org/wiki/Golden_Brown””>Golden Brown</a> was amonst their best.</p>” (HTML)
  • “<source>All I want is some text where a <bpt id=”1″>&lt;strong&gt;<bpt>word<ept id=”1″>&lt;/strong&gt;</ept> is in bold.</source>” (XLIFF 1.2)
  • “Select File->Edit->Search & Replace” (potential confusion)

They can be challenging because there are so many ways in which to represent them (sometimes depending upon the host content type), and they can be complex: the placeholder may itself contain content which should be translated by a human. They have to be considered during authoring, human and machine translation, file conversion and quality assurance.

Support for handling these within well-known applications and file formats is comprehensive and stable. However, outside of these you can be on your own and struggle without a good grasp of a document object model, and an appreciation of parsing and regular expressions.

One item of great news is that it looks like inline markup will be supported within the Change Tracking module of the up-coming XLIFF 2.1.

As always the devil can be in the detail.

Ocelot 2.1

We released Ocelot 2.1 today at http://bit.ly/296H0J4. It includes bidirectional language support, an awesome new totally configurable ITS LQI grid with intuitive hot keys, saving of Notes in XLIFF 1.2 and search and replace.

Machine Translation Pipeline Meets Business Trends

This week we will carry out final integration and deployment tests on our distributed pipeline for large scale and continuous translation scenarios that heavily leverage the power of machine translation.

We have built this platform as we recognised the demands and trends that are being reported by industry experts like Common Sense Advisory and Kantan.

The platform features several configurable services that can be switched on as required. These include:

  • automated source pre-editing prior to passing to a choice of custom machine translation engines;
  • integrated pre-MT translation memory leverage;
  • automated post-edit of raw machine translation prior to human post-edit;
  • in-process, low-friction capture of actionable feedback on MT output from humans;
  • automated post-processing of human post-edit;
  • automated capture of edit distance data for BI and reporting.

The only component missing that will be integrated during May is the text analysis and text classification algorithms which will give us the ability to do automated quality assurance of every single segment. Yes, everything – no spot-checking or limited scope audits.

The platform is distributed and utilises industry standard formats including XLIFF and ITS. Thus it is wholly scalable and extensible. Of note is that this platform delivers upon all six of the trends recently publicised by Kantan. Thanks to Olga O’Laoghaire who made significant contributions to the post-editing components and Ferenc Dobi, lead Architect and developer.

I’m very excited to see the fruition of this project. It doesn’t just represent the ability for us to generate millions of words of translated content, it delivers a controlled environment in which we can apply state-of-the-art techniques that are highly optimised at every stage, measurable and designed to target the goal of fully automated usable translation (FAUT).

Positive Thoughts for Blue Monday

Just before Christmas I joined the OASIS XLIFF Object Model and Other Serializations Technical Committee. I think it reflects the maturity and degree of adoption of XLIFF that this TC has been convened. It’s another opportunity to work with some technical thought leaders of the localization industry.

On Wednesday 13th I attended the launch of the ADAPT Centre in the Science Gallery at Trinity College. ADAPT is the evolution of the Centre for Next Generation Localization CSET into a national Research Center. Vistatec was a founding industrial partner of CNGL back in 2007 and I’m happy to continue our collaboration on topics such as machine translation, natural language processing, analytics, digital media and personalization. Unexpectedly but happily I was interviewed by RTE News and appeared on national television.

Like millions of people, I am saddened at the passing of David Bowie and Alan Rickman. Kudos to Bowie for releasing Blackstar and bequeathing such unique and thought-provoking art. The positive angle? The lessons of live and appreciate life to the full.

To a large extent my development plans for Q1 were approved. This includes extending the deployment of SkyNet to other accounts within Vistatec.

On January 11th we released Ocelot 2.0.

My Italian is coming along. Slowly but surely. We have a number of Italian bistro’s and ristorante in the town where I live so I have every opportunity to try it out.

On the coding front I’ve been looking at OAuth2, OpenID and ASP.NET MVC 6. I continue to be impressed by Microsoft’s transformation from an “invented here” company to one that both embraces and significantly contributes to open source.

Onward and Upward!

Ocelot 2.0

I am pleased and excited to announce the release of Ocelot 2.0 available as source code and binaries. Special thanks go to Kevin Lew, Marta Borriello and Chase Tingley who were the main engineers of this release.

The new features are:

  1. Support for XLIFF 2.0 Core;
  2. Translation Memory and Concordance lookup;
  3. Save XLIFF document as TMX;
  4. Language Quality Issue Grid;
  5. Set the fonts used for source and target content;
  6. Support for notes in XLIFF 2.0
  7. Serialization of Original Target in XLIFF 2.0 using the Change Tracking Module.

v2Overview

XLIFF 2.0

Ocelot now supports XLIFF 2.0 documents. It still supports XLIFF 1.2 documents and auto-detects the XLIFF version of the document being opened.

Translation Memory and Concordance lookup

A new window above the main editing grid now displays two tabs for Translation Memory lookup matches and Concordance Search results. If it is not visible then clicking on the splitter bar just below the font selection controls under the menu bar should reveal them.

Ocelot works with Translation Memory Exchange (TMX) files. The View->Configure TM menu option opens the TM Configuration dialog where you can specify which translation memories you want to use (in a fallback sequence), the penalty if any to be applied to the TM, the maximum number of results to display, and the minimum match threshold which matches must satisfy in order to be shown.

We have also added the ability to save the document as TMX.

Language Quality Issue Grid

Adding Internationalization Tag Set 2.0 Language Quality Issue metadata, even using the “Quick Add” mechanism of release 1 that could be configured in the rule.properties file, has been cumbersome. The LQI Grid reduces this to a one-click or one-hot key operation (excluding any comments you want to add).

The grid is customizable graphically allowing a matrix of issue severities (columns) and a user defined selection of types (rows) to be configured along with hot keys for every combination. Clicking any cell in the grid or typing its associated hotkey sequence will add the required Localization Quality Issue metadata. For example, clicking the cell at the intersection of a “Major” error severity column and “style” error category row will add an <its:locQualityIssue locQualityIssueSeverity="..." locQualityIssueType="style" /> entry to the relevant segment.

Source and Target Fonts

Just below the menu bar are two comboboxes which allow you to set the font family and size to be used for the source and target content.

XLIFF 2.0 Notes

On the View menu, the Configure Columns option allows you to display a Notes column. Text entered into cells in this column will be serialized as XLIFF <note /> elements.

Serialization of Original Target

Ocelot now captures and serializes the original target text if modified as a tracked change using the XLIFF 2.0 Change Track module. One limitation here, which we hope to address as part of XLIFF 2.1 is that only the text (and no inline markup) is saved.

I hope that these enhancements are useful.

Parsing Ill-formed Text

An eventful week back in the office after a two week trip visiting Redmond, WA, London and Athens. Whilst many industries seem to slowly wind down to Christmas, it always seems to me a race to finish up many initiatives. I guess the good side of this is that we have a truly relaxing break and a clean slate in the new year. The trip was very productive and positive with the prospect of new business and a successful end to the first year of the FREME Project.

One activity that occupied my mind whilst travelling so much was how to build a parser for segments of text that contain combinations of ill-formed (un-balanced) HTML tags; well-known variable placeholders like “%s”; and proprietary inline tags for, well, just about anything. Having parsed the text, then serialize it as XLIFF 1.2 <trans-unit />’s.

Yes, I know there’s Okapi and HTML Agility Pack but these content chunks come to us from a customer’s proprietary localization format not supported by either of these frameworks. There is also Antlr but I’m always up for a challenge.

My solution is not revolutionary but it’s robust and only took me the combined flight times to write. It uses a recursive strategy, the Composite and Visitor design patterns, shallow parsing based on three regular expressions and a handy little library.

Using regular expressions for parsing quickly litters your code with accessors for the regex named matches:

 firstMatch.Group["openingTag"].Index;
 firstMatch.Group["openingTag"].Length;

Once you start using these within Substring( ) methods in order to pull out non-matched segments of the input string it gets nasty to continue without some helpers:

input.Substring(firstMatch.Group["openingTag"].Index + 
firstMatch.Group["openingTag"].Length, secondMatch.Group["openingTag"].Index);

The obvious solution to this was to create a MatchCollectionHelper class. With some careful crafting of composed methods starting with:

mch.TextBeforeMatch(int matchNumber);
mch.TextAfterMatch(int matchNumber);

through to:

mch.TextSegment(int i).

The helper library the paved the way to brief, expressive code for the recursive Parser. The parser uses one interface with three methods and a property:

interface IFragment
{
  void Parse();
  StringBuilder Print();
  bool HasChilden();
  void Accept(IVisitor);
}

Tags that are paired (<a …>…</a>) are capable of having children, are parsed outside in and have a FragmentType of PAIRED. Standalone tags have a FragmentType of STANDALONE and remaining text is FragementType.TEXT. STANDALONE and TEXT are leaf nodes in the composition.

You can then build Visitors to serialize the input as whatever you need:

"<p>Click the <b>Restart</b> button in order to <i>re-initialize</i> the system.</p><p>You should not use the <<emergency_procedure>> if a <b>Restart</b> is needed more than %d times.</p>"

<trans-unit id="..">
  <source>
    <bpt id="1">&lt;p&gt;</bpt>Click the <bpt id="3">&lt;b&gt;</bpt>Restart<ept id="3">&lt;/b&gt;</ept> button in order to <bpt id="4">&lt;i&gt;<bpt>re-initialize<ept id="4">&lt;/i&gt;</ept> the system.<ept id="1">&lt;/p&gt;</ept><bpt id="2">&lt;p&gt;</bpt>You should not use the <ph id="5" ctype="standalone">&lt;&;t;emergency_procedure&gt;&gt;</ph> if a <bpt id="6">&lt;b&gt;</bpt>Restart<ept id="6">&lt;/b&gt;</ept> is needed more than <ph id="7" ctype="">%d</ph> times.<ept id="2">&lt;/p&gt;</ept>
 </source>
</trans-unit>

Happy Holidays and look out in the New Year for an Ocelot 2.0 announcement!