It is well known that you can produce relatively good quality machine translations by doing the following:
- Carry out some processing on the source language.
Such as remove text which serves no purpose in the translations (say, imperial measurements in content destined for Europe); re-order some lengthy sentences; mark the boundaries of embedded tags, etc.
- Use custom domain trained machine translation engines.
This is possible with several machine translation providers. If you have an amount of good quality bilingual and monolingual corpora relevant to your subject matter then you can train and build engines which will produce higher quality output than a general public domain engine.
- Post process the raw machine translation output to correct recurrent errors.
To improve overall fluency; replace specific terminology, etc.
We decided to implement this in a fully automated Azure Functions pipeline.
NOTE: Some MT providers have this capability built into their services but we wanted the centralized flexibility to control the pre- and post-editing rules and to be able to mix and match which MT providers we get the translations from.
The pipeline consists of three functions: preedit, translate and postedit. The json payload used for inter-function communication is Jliff. Jliff is an open object graph serialization format specification being designed by an OASIS Technical Committee.
NOTE: Jliff is still in design phase but I’m impatient and it seemed like a good way to test the current snapshot of the format.
The whole thing is easily re-configured and re-deployed, and has all the advantages of an Azure consumption plan.
We can see that this pipeline would be a good candidate for durable functions so once we have time we’ll take a look at those.