Variety is the Spice of Life

January has been wonderfully varied on the work front. Most days have brought new learning.

In thinking about my R&D Manifesto I decided it was time to revisit Neural Networks and Semantics. I’m not adverse to learning new development languages and environments but when you want to evaluate ideas quickly one will tend towards the familiar. For this initial reason Encog looks interesting.

The Centre for Global Intelligent Content, which VistaTEC have been industry partners of since its inception, received funding for a further two and a half years in October of last year. As a consequence there have been numerous meetings. It’s really exciting to see how the centre has evolved and honed its process for bringing innovation to the market. A key element of this process is the d.lab under the direction of Steve Gotz. In my view Steve has been one of the notable personalities in CNGL. He has a great broad knowledge of the technology, innovation and start-up landscapes and excellent business acumen. Two interesting pieces of technology were shown to centre members recently. The first named MTMPrime is a component which in real-time can assess translation memory matches along side of machine translation output and based on confidences recommend which one to use. The second is a machine translation incremental learning component which can profile a document and suggest the most efficient path to translating it given the algorithm’s analysis of the incremental benefit that would be realized from translating segments in a particular order. Basically it works out the bang-for-buck for translating segments.

In discussing semantics and disambiguation Steve pointed me at Open Calais. This is a service which like Enrycher parses content and automatically adds semantic metadata for named entities and subjects that it “recognizes”. The picture below shows the result of assign this post through the Open Calais Viewer.


We’ve had some very interesting customer inquiries too. Too early to talk about them but I hope that we get more requests for these types of engagements and services. If any come to fruition I’ll blog about them later.

Finally, we did some small updates to Ocelot:

  • New configuration file for plug-ins,
  • Native launch experience for Windows and Mac, and
  • Native hot-key experiences within Windows and Mac interfaces.

Long may this variety continue.

GALA Innovations in Language Technology

I presented on the Internationalization Tag Set 2.0 and gave a demonstration of Reviewer’s Workbench at yesterday’s GALA “Innovations in Language Technology” pre-Think Latin America event. It seemed to go well: I couldn’t spot anyone sleeping.

Highlights of the various presentations

Vincent Wade, CNGL – Research at CNGL

Prof. Vincent Wade, Director of CNGL set the stage for the afternoon by talking about the challenges of volume, variety and velocity and the arrival of Intelligent Content followed by an overview of the research activities at the Centre.

Steve Gotz talked knowledgeably (as he always does) about the differences between invention and innovation. Seemingly our industry has been guilty of only doing incremental innovation rather than disruptive invention. Luckily CNGL can help with the latter.

Tony O’Dowd, Kantan – Machine Translation and Quality

Tony talked about the dichotomy of machine translation quality metrics used by system developers versus the measurements that are more of interest to those downstream from the raw MT output: Post-Editors, Project Managers, etc. He proposed an interesting way of bridging this divide.

Reinhard Schäler, Rosetta Foundation – Collaborative Translation and Non-market Localization Models

Reinhard talked about the great work that is being done by volunteer translators and how this highly collaborative model could influence the future of the industry in the medium to long term. He also covered the Open Source Solas localization platform which is the backbone of the Rosetta production environment and includes a component called “Solas Match”: a dating application for “connecting translators to content”.


Between presentations there was some stimulating and interesting discussions around the impact that disruptive technologies could have on the industry, the challenges of carrying out innovation in the industry, the future of Language Service Providers and non-market localization.

There’s probably not enough of this type of conversation that happens in the industry, particularly between the service providers, possibly because we are all concerned about differentiating our offerings. However, as Arle Lommel pointed out to me, if those differentiating factors can be assimilated by someone else within the space of an afternoon, it probably wasn’t much of a differentiator!

A Personal Contribution to Global Intelligent Content

Global Intelligent Content

As Chief Technology Officer of VistaTEC, I was fortunate to be one of the founding Industrial Partners of the Science Foundation Ireland funded Centre for Next Generation Localisation (CNGL). CNGL has just received support for a further term with the overall research theme of “Global Intelligent Content”. I therefore thought it appropriate that my first post should actively demonstrate and support this vision.

So, what’s so “intelligent” about this post?

If you have any basic understanding of HTML you’ll know that the page you’re reading is composed of mark-up tags (elements) such as <p>, <span>, and <h1>, etc. The mark-up allows your browser to display the page such that it is easy to comprehend (i.e. headings, paragraphs, bold, italic, etc.) and also interact with (i.e. hyperlinks to other related web documents). You may also know that it can contain “keywords” or “tags”: individual words or phrases which indicate to search engines what the subject matter of this post is. The post certainly does contain all of these.

The page also includes a lot of “metadata“. This metadata conforms to two standards each of which is set to transform the way in which multilingual intelligent content is produced, published, discovered and consumed.

Resource Description Format in Attributes

In layman’s terms RDFa is a way of embedding sense and definition into a document in such a way that non-human agents (machines and computer programs) can read and “understand” the content. RDFa is one mechanism for building the Multilingual Semantic Web.

If you right-click this page in your browser and choose “View Source” you’ll see that it contains attributes (things which allow generic HTML tags to have more unique characteristics) such as property and typeof. These allow web robots to understand those parts of the content that I have decorated at a much more fundamental level. For example, that I created the page, the vocabulary that I have used to describe people, organisations and concepts within the document, and details about them. This data can form the basis of wider inferences regarding personal and content relationships.

Internationalization Tag Set 2.0

ITS 2.0 is a brand new W3C standard which is being funded through the European Commission as the Multilingual Web (Language Technologies) Working Group; part of the W3C Internationalization Activity. Its goal is to define categories of metadata relating to the production and publishing of multilingual web content.

To exemplify this, the overview of ITS 2.0 below was translated from German to English using the Microsoft Bing machine translation engine. Viewing the source of this page and searching for “its-” will locate ITS Localization Quality metadata that I annotated the translations with so as to capture my review of the target English.

“The goal of MultilingualWeb LT (multilingual Web – language technologies) it is to demonstrate how such metadata encoded, safely passed on and used in various processes such as Lokalisierungsworkflows can be and frameworks such as Okapi, machine translation, or CMS systems like Drupal.
Instead of a theoretical and institutional approach to standardization, LT-Web aims to develop implementations, which concretely demonstrates the value of metadata with real systems and users. The resulting conventions and results are documented and published as a W3C standard, including the necessary documentation, data and test suite, as the W3C standardization process requires it.”


I’m very excited about Global Intelligent Content. This post is a very small and personal contribution to the vision but hopefully it illustrates in a simple way what it is about and some of its possibilities.