Tag Archives: JLIFF

Parsing, Recursion and Observer Pattern

I have worked for a while now with two serializations of the XLIFF Object Model: XLIFF and JLIFF (which is still in draft). I have had occasion to write out each as the result of parsing some proprietary content format in order to facilitate easy interoperability within our tool chain, and round-tripping one serialization with the other.

Whilst both are hierarchical formats when parsing them recursively they require different strategies.

With XLIFF (XML) each opening element has all of its attributes available immediately. This means you can construct an object graph as you go: instantiate the object, set all of its attributes and make any decisions based on them, and add the object to a stack so that you can keep track of where you are in the object model. This all works nicely with the Observer pattern: you can subscribe to events which fire upon each new element no matter how nested.

<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="" srcLang="">
<file id="f1">
<group id="g1" type="cms:field">
<unit id="u1" type="cms:slug">
<segment id="s1">
<source>/>
<target/>
</segment>
</unit>
</group>
</file>
</xliff>

With JLIFF (json) you have to wait (assuming you’re doing a depth-first token read) to read all of the properties of nested objects until you can access all the properties of the parents. Thus you have to build an object graph before you can then traverse it again and use the Observer pattern in an efficient way to build another representation.

{
"jliff": "2.1",
"srcLang": "en-US",
"trgLang": "fr-FR",
"files": [
{
"id": "f1",
"kind": "file",
"subfiles": [
{
"canResegment": "no",
"id": "u2",
"kind": "unit",
"locQualityIssues": {
"items": []
},
"notes": [],
"subunits": [
{
"canResegment": "no",
"id": "s2",
"kind": "segment",
"source": [],
"target": []
}
]
},
]
}
]
}

Differences are also apparent when dealing with items which require nesting to convey their semantics. This classically happens in localization with trying to represent rich text (text with formatting).

XLIFF handles this nicely when serialized.

<source>Vacation homes in <sc id="fmt1" disp="underline" type="fmt" subType="xlf:u" dataRef=""/>Orlando<ec dataRef=""/>

Whilst JLIFF is somewhat fragmented.

"source": [
{
"text": "Vacation homes in "
},
{
"id": "mrk1",
"kind": "sm",
"type": "term"
},
{
"text": "Orlando"
},
{
"kind": "em",
"startRef": {
"token": "mrk1"
}
}
]

Content Interoperability

I am working on a project which is very familiar in the localization industry: moving content from the Content Management System (CMS) in which it is authored to a Translation Management System (TMS) in which it will be localized and then moved back to the CMS for publication.

These seemingly straight-forward scenarios often require far more effort than seems worthy. As the developer working on the interoperability you often have to have:

  • Knowledge of the CMS API and content model. (The content model being the representation which the article has inside the CMS and when exported.
  • Knowledge of the TMS API and the content formats that it is capable of parsing/filtering.

In this project the CMS is built on top of a “document database” and stores and exports content in JSON format.

One of the complexities is that rich text (text which includes formatting such as text emphasis – bold, italic – and embedded metadata such as hyperlinks and pointers to images) cause sentences to become fragmented when exported.

For example. the text:

“For more information refer to our User’s Guide or Community Forum.”

Becomes:

{
"content": [
{
"value": "For more information refer to our ",
"nodeType": "text"
},
{
"data": { "uri": "https://ficticious.com/help" },
"content": [{
"value": "User's Guide",
"nodeType": "text"
}],
"nodeType": "hyperlink"
},
{
"value": " or Community Forum.",
"nodeType": "text"
}],
"nodeType": "document"
}

If I simply let the TMS parse the JSON I know it will present the rich text sentence as three segments rather than one and it will be frustrating for translators to relocate the hyperlink within the overall sentence. Ironically, JLIFF suffers from the same problem.

What I need is a structured format that has the flexibility to enable me to express the sentence as a single string but also have the high fidelity to convert back without information loss. Luckily the industry has the XML Localization Interchange File Format (XLIFF).

I have three choices for programming the conversion, all of which are open source:

I wanted to exercise my own code a bit so I went with the third option.

JliffGraphTools contains a Jliff builder class and Xliff12 and xliff20 filter classes (for XLIFF 1.2 and 2.0 respectively). These event based classes allow a publish/subscribe interaction where elements in the XLIFF cause subscribing methods in the Jliff builder to be executed and thus create a JliffDocument.

I decided to use this pattern for the conversion of the above CMS’ JSON content model to XLIFF.

Slide1

It turns out that this approach wasn’t as straight-forward as anticipated but I’ll have to document that in another post.

Exploring JLIFF

I have published a web application where you can submit XLIFF 2.x files and get back a JLIFF serialization as text.

JLIFF is a new XLIFF serialization format currently in working draft with the OASIS Object Model and Other Serializations Technical Committee.

The application uses my open source JliffGraphTools library.

I am working on a conversion of XLIFF 1.2 to JLIFF but as the content model is structurally different it’s tricky.

I was careful to implement it in a way that means no data is persisted. I don’t even collect any metadata about what is submitted. That way people can feel confident of about the privacy of their data.

Serverless Machine Translation

It is well known that you can produce relatively good quality machine translations by doing the following:

  • Carry out some processing on the source language.
    Such as remove text which serves no purpose in the translations (say, imperial measurements in content destined for Europe); re-order some lengthy sentences; mark the boundaries of embedded tags, etc.
  • Use custom domain trained machine translation engines.
    This is possible with several machine translation providers. If you have an amount of good quality bilingual and monolingual corpora relevant to your subject matter then you can train and build engines which will produce higher quality output than a general public domain engine.
  • Post process the raw machine translation output to correct recurrent errors.
    To improve overall fluency; replace specific terminology, etc.

We decided to implement this in a fully automated Azure Functions pipeline.

NOTE: Some MT providers have this capability built into their services but we wanted the centralized flexibility to control the pre- and post-editing rules and to be able to mix and match which MT providers we get the translations from.

The pipeline consists of three functions: preedit, translate and postedit. The json payload used for inter-function communication is Jliff. Jliff is an open object graph serialization format specification being designed by an OASIS Technical Committee.

NOTE: Jliff is still in design phase but I’m impatient and it seemed like a good way to test the current snapshot of the format.

The whole thing is easily re-configured and re-deployed, and has all the advantages of an Azure consumption plan.

We can see that this pipeline would be a good candidate for durable functions so once we have time we’ll take a look at those.

Serializing and Deserializing JLIFF

I’ve been having all kinds of fun saving text (json) representations of translation units (pairs of source and target language strings), sending them from one cloud based service to another and then rebuilding the in-memory object representations from the text representation.

I know that any software engineer will be yawning about now because libraries for doing this kind of thing have existed for a long time. However, it’s been fun for me partly because I’m doing it inside the new Azure Function service, and because some of the objects have abstract relationships (interfaces and sub-classes) introducing some subtleties to getting this to work which took a lot of research to implement.

It relates to the work of the OASIS OMOS TC whose evolving schema for what has been dubbed JLIFF can be seen on GitHub.

The two parts of the object graph requiring the special handling are the array containing the Segment‘s and Ignorable‘s (which implement the ISubUnit interface in my implementation), and the array containing the text and inline markup elements of the Source and Target containers (which implement the IElement interface and subclass AbstractElement in my implementation).

When deserializing the components of these arrays each needs a class which derives from Newtonsoft.Json.JsonConverter.

namespace JliffModel
{
    using System;
    using Newtonsoft.Json;
    using Newtonsoft.Json.Linq;

    public class ISubUnitConverter : JsonConverter
    {
        public override bool CanConvert(Type objectType)
        {
            var canConvert = false;

            if (objectType.Name.Equals("ISubUnit")) canConvert = true;

            return canConvert;
        }

        public override object ReadJson(JsonReader reader, Type objectType, object existingValue, JsonSerializer serializer)
        {
            var jobject = JObject.Load(reader);

            object resolvedType = null;

            if (jobject["type"].Value().Equals("segment")) resolvedType = new Segment();

            serializer.Populate(jobject.CreateReader(), resolvedType);

            return resolvedType;
        }

        public override void WriteJson(JsonWriter writer, object value, JsonSerializer serializer)
        {
            writer.WriteValue(value.ToString());
        }
    }
}

Then the classes derived from JsonConverter are passed into the Deserialize method.

    Fragment modelin = JsonConvert.DeserializeObject<Fragment>(output,
        new ISubUnitConverter(),
        new IElementConverter());