We released Ocelot 2.1 today at http://bit.ly/296H0J4. It includes bidirectional language support, an awesome new totally configurable ITS LQI grid with intuitive hot keys, saving of Notes in XLIFF 1.2 and search and replace.
This week we will carry out final integration and deployment tests on our distributed pipeline for large scale and continuous translation scenarios that heavily leverage the power of machine translation.
The platform features several configurable services that can be switched on as required. These include:
- automated source pre-editing prior to passing to a choice of custom machine translation engines;
- integrated pre-MT translation memory leverage;
- automated post-edit of raw machine translation prior to human post-edit;
- in-process, low-friction capture of actionable feedback on MT output from humans;
- automated post-processing of human post-edit;
- automated capture of edit distance data for BI and reporting.
The only component missing that will be integrated during May is the text analysis and text classification algorithms which will give us the ability to do automated quality assurance of every single segment. Yes, everything – no spot-checking or limited scope audits.
The platform is distributed and utilises industry standard formats including XLIFF and ITS. Thus it is wholly scalable and extensible. Of note is that this platform delivers upon all six of the trends recently publicised by Kantan. Thanks to Olga O’Laoghaire who made significant contributions to the post-editing components and Ferenc Dobi, lead Architect and developer.
I’m very excited to see the fruition of this project. It doesn’t just represent the ability for us to generate millions of words of translated content, it delivers a controlled environment in which we can apply state-of-the-art techniques that are highly optimised at every stage, measurable and designed to target the goal of fully automated usable translation (FAUT).
I have been working on improving the physical integration of Review Sentinel with SDL WorldServer using Microsoft Azure Blob Storage and SendGrid notifications.
Programming to the Enterprise Service Bus architectural model always makes me smile. Loosely-coupled applications which collaborate to provide a distributed, scalable, fault-tolerant workflow.
The diagram below shows the overall architecture. The small bit of impedence is the lack of support in SDL WorldServer for the ITS 2.0 Localisation Quality metadata category which Review Sentinel uses to serialise its conformance scores within XLIFF.
For the last three days I have been at META-FORUM 2013 in Berlin.
Myself and many of my Multilingual Web – Language Technologies partners were presenting our project final deliverables. My presentation was well received though didn’t generate as much coffee break interaction as I’d hoped. Nevertheless one of those conversations is worth following up on. [Update 2013-10-23: My presentation can be viewed on YouTube.]
Overall MLW-LT received great feedback from our Project Manager within the EU. I attribute our success to our lead coordinator, Felix Sasaki of DFKI, and the unique mix of project partners. I would certainly want to work with them again.
The conference was well attended – 260 I think was the figure. This was probably due to the fact that Kimmo Rossi of the DG – CNECT was presenting on Horizon 2020 and CEF.
One of the highlights of the conference were the keynote by Daniel Marcu, Chief Science Officer of SDL, who’s talk was entertaining, engaging, honest and thought provoking. I wish I had recorded it on my iPad but that is distracting and hopefully it’ll be available on the conference web site eventually.
It was great to meet Sebastian Hellmann of University of Leipzig in person. Sebastian is lead editor of the NLP Interchange Format 2.0 specification.
Good trip though I didn’t get to see much of Berlin this time around.
On 27th May we finalised release 1.0 of Reviewer’s Workbench. RW represents the culmination of several strands of research and development that I had been involved with over the last couple of years.
In 2011 I set up Digital Linguistics to sell Review Sentinel. Review Sentinel is the world’s first Text Analytics based Language Quality Assurance technology. I first publicly presented Review Sentinel at the TAUS User Conference held in Seattle in October, 2012.
In January 2012 I became a member of the Multilingual Web Language Technologies Working Group. Funded by the EU, this Working Group of the W3C is responsible for defining and publishing the ITS 2.0 Standard. ITS 2.0 is now in Final Call stage of the W3C process.
I can safely assert that Reviewer’s Workbench is the world’s first editor to utilise text analytics and other metadata that can be encoded with ITS 2.0 – such as machine translation confidence scores, term disambiguation information and translation process related provenance – to bring new levels of performance to the tasks of linguistic review and post-editing. What’s more is that Reviewers Workbench is completely inter-operable with industry standards like XLIFF and toolsets such as the Okapi Framework.
Reviewer’s Workbench allows you to personalise the visualisation of all this available important, contextual and useful data to inform and direct post-editing and linguistic review effort.
This is just the beginning. Feature set planning for release 2.0 is already very advanced and includes more state-of-the-art facilities. Stay tuned!
I presented on the Internationalization Tag Set 2.0 and gave a demonstration of Reviewer’s Workbench at yesterday’s GALA “Innovations in Language Technology” pre-Think Latin America event. It seemed to go well: I couldn’t spot anyone sleeping.
Highlights of the various presentations
Vincent Wade, CNGL – Research at CNGL
Prof. Vincent Wade, Director of CNGL set the stage for the afternoon by talking about the challenges of volume, variety and velocity and the arrival of Intelligent Content followed by an overview of the research activities at the Centre.
Steve Gotz talked knowledgeably (as he always does) about the differences between invention and innovation. Seemingly our industry has been guilty of only doing incremental innovation rather than disruptive invention. Luckily CNGL can help with the latter.
Tony O’Dowd, Kantan – Machine Translation and Quality
Tony talked about the dichotomy of machine translation quality metrics used by system developers versus the measurements that are more of interest to those downstream from the raw MT output: Post-Editors, Project Managers, etc. He proposed an interesting way of bridging this divide.
Reinhard Schäler, Rosetta Foundation – Collaborative Translation and Non-market Localization Models
Reinhard talked about the great work that is being done by volunteer translators and how this highly collaborative model could influence the future of the industry in the medium to long term. He also covered the Open Source Solas localization platform which is the backbone of the Rosetta production environment and includes a component called “Solas Match”: a dating application for “connecting translators to content”.
Between presentations there was some stimulating and interesting discussions around the impact that disruptive technologies could have on the industry, the challenges of carrying out innovation in the industry, the future of Language Service Providers and non-market localization.
There’s probably not enough of this type of conversation that happens in the industry, particularly between the service providers, possibly because we are all concerned about differentiating our offerings. However, as Arle Lommel pointed out to me, if those differentiating factors can be assimilated by someone else within the space of an afternoon, it probably wasn’t much of a differentiator!
Global Intelligent Content
As Chief Technology Officer of VistaTEC, I was fortunate to be one of the founding Industrial Partners of the Science Foundation Ireland funded Centre for Next Generation Localisation (CNGL). CNGL has just received support for a further term with the overall research theme of “Global Intelligent Content”. I therefore thought it appropriate that my first post should actively demonstrate and support this vision.
So, what’s so “intelligent” about this post?
If you have any basic understanding of HTML you’ll know that the page you’re reading is composed of mark-up tags (elements) such as
<h1>, etc. The mark-up allows your browser to display the page such that it is easy to comprehend (i.e. headings, paragraphs, bold, italic, etc.) and also interact with (i.e. hyperlinks to other related web documents). You may also know that it can contain “keywords” or “tags”: individual words or phrases which indicate to search engines what the subject matter of this post is. The post certainly does contain all of these.
The page also includes a lot of “metadata“. This metadata conforms to two standards each of which is set to transform the way in which multilingual intelligent content is produced, published, discovered and consumed.
Resource Description Format in Attributes
In layman’s terms RDFa is a way of embedding sense and definition into a document in such a way that non-human agents (machines and computer programs) can read and “understand” the content. RDFa is one mechanism for building the Multilingual Semantic Web.
If you right-click this page in your browser and choose “View Source” you’ll see that it contains attributes (things which allow generic HTML tags to have more unique characteristics) such as
typeof. These allow web robots to understand those parts of the content that I have decorated at a much more fundamental level. For example, that I created the page, the vocabulary that I have used to describe people, organisations and concepts within the document, and details about them. This data can form the basis of wider inferences regarding personal and content relationships.
Internationalization Tag Set 2.0
ITS 2.0 is a brand new W3C standard which is being funded through the European Commission as the Multilingual Web (Language Technologies) Working Group; part of the W3C Internationalization Activity. Its goal is to define categories of metadata relating to the production and publishing of multilingual web content.
To exemplify this, the overview of ITS 2.0 below was translated from German to English using the Microsoft Bing machine translation engine. Viewing the source of this page and searching for “
its-” will locate ITS Localization Quality metadata that I annotated the translations with so as to capture my review of the target English.
“The goal of MultilingualWeb LT (multilingual Web – language technologies) it is to demonstrate how such metadata encoded, safely passed on and used in various processes such as Lokalisierungsworkflows can be and frameworks such as Okapi, machine translation, or CMS systems like Drupal.
Instead of a theoretical and institutional approach to standardization, LT-Web aims to develop implementations, which concretely demonstrates the value of metadata with real systems and users. The resulting conventions and results are documented and published as a W3C standard, including the necessary documentation, data and test suite, as the W3C standardization process requires it.”
I’m very excited about Global Intelligent Content. This post is a very small and personal contribution to the vision but hopefully it illustrates in a simple way what it is about and some of its possibilities.