It is well known that you can produce relatively good quality machine translations by doing the following:
- Carry out some processing on the source language.
Such as remove text which serves no purpose in the translations (say, imperial measurements in content destined for Europe); re-order some lengthy sentences; mark the boundaries of embedded tags, etc.
- Use custom domain trained machine translation engines.
This is possible with several machine translation providers. If you have an amount of good quality bilingual and monolingual corpora relevant to your subject matter then you can train and build engines which will produce higher quality output than a general public domain engine.
- Post process the raw machine translation output to correct recurrent errors.
To improve overall fluency; replace specific terminology, etc.
We decided to implement this in a fully automated Azure Functions pipeline.
NOTE: Some MT providers have this capability built into their services but we wanted the centralized flexibility to control the pre- and post-editing rules and to be able to mix and match which MT providers we get the translations from.
The pipeline consists of three functions: preedit, translate and postedit. The json payload used for inter-function communication is Jliff. Jliff is an open object graph serialization format specification being designed by an OASIS Technical Committee.
NOTE: Jliff is still in design phase but I’m impatient and it seemed like a good way to test the current snapshot of the format.
The whole thing is easily re-configured and re-deployed, and has all the advantages of an Azure consumption plan.
We can see that this pipeline would be a good candidate for durable functions so once we have time we’ll take a look at those.
I’ve been having all kinds of fun saving text (json) representations of translation units (pairs of source and target language strings), sending them from one cloud based service to another and then rebuilding the in-memory object representations from the text representation.
I know that any software engineer will be yawning about now because libraries for doing this kind of thing have existed for a long time. However, it’s been fun for me partly because I’m doing it inside the new Azure Function service, and because some of the objects have abstract relationships (interfaces and sub-classes) introducing some subtleties to getting this to work which took a lot of research to implement.
It relates to the work of the OASIS OMOS TC whose evolving schema for what has been dubbed JLIFF can be seen on GitHub.
The two parts of the object graph requiring the special handling are the array containing the Segment‘s and Ignorable‘s (which implement the ISubUnit interface in my implementation), and the array containing the text and inline markup elements of the Source and Target containers (which implement the IElement interface and subclass AbstractElement in my implementation).
When deserializing the components of these arrays each needs a class which derives from Newtonsoft.Json.JsonConverter.
public class ISubUnitConverter : JsonConverter
public override bool CanConvert(Type objectType)
var canConvert = false;
if (objectType.Name.Equals("ISubUnit")) canConvert = true;
public override object ReadJson(JsonReader reader, Type objectType, object existingValue, JsonSerializer serializer)
var jobject = JObject.Load(reader);
object resolvedType = null;
if (jobject["type"].Value().Equals("segment")) resolvedType = new Segment();
public override void WriteJson(JsonWriter writer, object value, JsonSerializer serializer)
Then the classes derived from JsonConverter are passed into the Deserialize method.
Fragment modelin = JsonConvert.DeserializeObject<Fragment>(output,
Just before Christmas I joined the OASIS XLIFF Object Model and Other Serializations Technical Committee. I think it reflects the maturity and degree of adoption of XLIFF that this TC has been convened. It’s another opportunity to work with some technical thought leaders of the localization industry.
On Wednesday 13th I attended the launch of the ADAPT Centre in the Science Gallery at Trinity College. ADAPT is the evolution of the Centre for Next Generation Localization CSET into a national Research Center. Vistatec was a founding industrial partner of CNGL back in 2007 and I’m happy to continue our collaboration on topics such as machine translation, natural language processing, analytics, digital media and personalization. Unexpectedly but happily I was interviewed by RTE News and appeared on national television.
Like millions of people, I am saddened at the passing of David Bowie and Alan Rickman. Kudos to Bowie for releasing Blackstar and bequeathing such unique and thought-provoking art. The positive angle? The lessons of live and appreciate life to the full.
To a large extent my development plans for Q1 were approved. This includes extending the deployment of SkyNet to other accounts within Vistatec.
On January 11th we released Ocelot 2.0.
My Italian is coming along. Slowly but surely. We have a number of Italian bistro’s and ristorante in the town where I live so I have every opportunity to try it out.
On the coding front I’ve been looking at OAuth2, OpenID and ASP.NET MVC 6. I continue to be impressed by Microsoft’s transformation from an “invented here” company to one that both embraces and significantly contributes to open source.
Onward and Upward!