Tag Archives: Azure

Serverless Machine Translation

It is well known that you can produce relatively good quality machine translations by doing the following:

  • Carry out some processing on the source language.
    Such as remove text which serves no purpose in the translations (say, imperial measurements in content destined for Europe); re-order some lengthy sentences; mark the boundaries of embedded tags, etc.
  • Use custom domain trained machine translation engines.
    This is possible with several machine translation providers. If you have an amount of good quality bilingual and monolingual corpora relevant to your subject matter then you can train and build engines which will produce higher quality output than a general public domain engine.
  • Post process the raw machine translation output to correct recurrent errors.
    To improve overall fluency; replace specific terminology, etc.

We decided to implement this in a fully automated Azure Functions pipeline.

NOTE: Some MT providers have this capability built into their services but we wanted the centralized flexibility to control the pre- and post-editing rules and to be able to mix and match which MT providers we get the translations from.

The pipeline consists of three functions: preedit, translate and postedit. The json payload used for inter-function communication is Jliff. Jliff is an open object graph serialization format specification being designed by an OASIS Technical Committee.

NOTE: Jliff is still in design phase but I’m impatient and it seemed like a good way to test the current snapshot of the format.

The whole thing is easily re-configured and re-deployed, and has all the advantages of an Azure consumption plan.

We can see that this pipeline would be a good candidate for durable functions so once we have time we’ll take a look at those.

Fast and Loose

About a year ago we started to think about the cloud and how it could help us. Should we put our relational databases in the cloud and stop having to worry about their size? Should we try to put our network and compute heavy processes in the cloud freeing up internal compute and network bandwidth? Could it just make us more agile and less capital expenditure sensitive in responding to compute and storage requirements?

We prevaricated on these questions for a while because in hindsight we didn’t understand the mindset change. To me it was like moving from procedural, interpreted languages to compiled, object oriented.

We finally “got it” when we had the challenge of trying to efficiently¬†produce results for an unpredictable number of concurrent new compute heavy tasks.

After optimizing algorithms, our initial perception was that all we needed to do was throw more processors at it. How wrong we were. Next up, after making computation parallel was data storage and memory requirements – the bottleneck had shifted from compute to storage and retrieval (I/O). Seriously, I/O can slow things down considerably. After finding a solution to that it was how do we scale out and not up – more processors rather than bigger processors.

Eventually we came to understand that a true cloud architecture makes use of many patterns or paradigms: fault tolerant service bus (message queue), several compute instances, noSql data storage, web service endpoints and thinking of every operation as asynchronous.

What you end up with is a loosely coupled but highly fault tolerant, highly scalable, flexible, separated concerns configuration.

We are close to deploying this new platform and I’m very excited about it. We have no single points of failure, unlimited extensibility points – a very robust and scalable infrastructure.

We’ve prototyped bits of this on both AWS and Azure and are confident that deployment on either is workable.

No doubt we’ll hit limitations or problems at some point but right now the return on investment looks unbeatable.