Tag Archives: Semantic Web


The web site for our new European Commission funded Horizon 2020 project went live on 2015-03-27. I’m very excited about this project. It encompasses many important current topics: Big Linguistic Linked Data; The Semantic Web; NLP Technologies; Linguistic Linked Data Interoperability and Intelligent and Enriched Content.

My goals for the project include new features for our open sourced editor, Ocelot. The planned features will further integrate it with other linguistic technologies and standards, not least the Semantic Web and Linked Linguistic Data Clouds themselves.

Having missed the project kick-off in Berlin in February, I’m looking forward to meeting all of the world-class academic and industry partners.


My crazy idea for NIF

I was recently invited to join a LIDER call to talk about my Use Case idea’s for NIF. Here’s what the is:


We often have to provide translations for content which is non-literal but rather more metaphorical/idiomatic. For example, “My destination is only a ‘hop-and-a-skip’ from my home.” might get translated as “Mein Ziel ist nur ein ‘Katzen-sprung’ von meinem Zuhause.”.

Describing in NIF with relationship

I suggest that you could model this translation as:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix vt: <http://www.vistatec.com/rdf> .

a nif:Context ;
a nif:RFC5147String ;
nif:beginIndex "0" ;
nif:endIndex "178" ;
nif:isString "My destination is only a 'hop-and-a-skip' from my home." .

nif:beginIndex "26" ;
nif:endIndex "40" ;
a nif:RFC5147String ;
itsrdf:hasLocQualityIssue [
a itsrdf:LocQualityIssue ;
itsrdf:locQualityIssueType "uncategorized" ;
nif:referenceContext <http://example.com/exampledoc.html#char=0,178>.,/code>

a nif:Context ;
a nif:RFC5147String ;
nif:beginIndex "0" ;
nif:endIndex "178" ;
nif:isString "Mein Ziel ist nur ein 'Katzen-sprung' von meinem Zuhause." .

nif:beginIndex "23" ;
nif:endIndex "36" ;
a nif:RFC5147String ;
itsrdf:hasLocQualityIssue [
a itsrdf:LocQualityIssue ;
itsrdf:locQualityIssueType "uncategorized" ;
nif:referenceContext <http://example.com/exampledoc-de.html#char=0,57>.

vt:translatedAs <http://example.com/exampledoc-de.html#char=23,36>.

Surely then this model could be extended to give a comprehensive representation of such non-literal translations in a way that NLP tools could consume.


Linking Data in the Meditteranean

I am just back from the European Data Forum and the Linked Data for Language Technologies Workshop. The co-location of these two events meant that many of the leaders in linked data and the digital representation of linguistic and knowledge concepts were in one place.

The presentation that stood out for me at the EDF was the second day keynote by Ralf-Peter Schaefer of TomTom. Seeing how they use their 9 trillion and counting data points to find patterns in, and make predictions for, traffic conditions was very interesting.

The LD4LT Workshop was very productive despite the virtually non-existent free, and totally non-existent pay-as-you-go, wi-fi connectivity for the whole duration of the two events. It definitely brought home to me the importance of connectivity these days. I convinced myself to leave my mi-fi at home – will not do so in future.

Presentations I took note of during LD4LT were about webLyzard and Rozeta.

I presented my three industry challenges and my hopes for how linguistically motivated linked data might help me solve them.

I used the travel time to learn about the AngularJS project. Pretty impressive stuff. I particularly liked the way the framework handles the binding and referencing of data within the rendered HTML UI without explicitly needing to use data-* attributes to store object id’s.

I was hoping to be able to document trials I have been doing with DITA and XLIFF+ITS using the XLIFF-DITA Roundtrip toolkit but I got stuck at the last hurdle and being able to figure out the issue would require my having a deeper understanding of the toolkit.

We finished our assessment of supporting XLIFF 2.0 in Ocelot. It currently looks as though we will introduce a layer of abstract data model façade objects between the Okapi XLIFF filter classes and Ocelot.

I must dedicate some time to my H2020 proposal.

I compensated for my lack of my regular 2 kilometre daily walk with a day long urban hike around Athens taking in the Acropolis (naturally), it’s museum (I’d like to point out that Thomas Bruce, 7th Earl of Elgin was Scottish) and the chapel of St. George at the summit of Lycabettus Hill which offers truly stunning 360 degree views over Athens as far as the Aegean Sea. Unfortunately I didn’t have time to visit Kastella Hill near Pireaus.

A Personal Contribution to Global Intelligent Content

Global Intelligent Content

As Chief Technology Officer of VistaTEC, I was fortunate to be one of the founding Industrial Partners of the Science Foundation Ireland funded Centre for Next Generation Localisation (CNGL). CNGL has just received support for a further term with the overall research theme of “Global Intelligent Content”. I therefore thought it appropriate that my first post should actively demonstrate and support this vision.

So, what’s so “intelligent” about this post?

If you have any basic understanding of HTML you’ll know that the page you’re reading is composed of mark-up tags (elements) such as <p>, <span>, and <h1>, etc. The mark-up allows your browser to display the page such that it is easy to comprehend (i.e. headings, paragraphs, bold, italic, etc.) and also interact with (i.e. hyperlinks to other related web documents). You may also know that it can contain “keywords” or “tags”: individual words or phrases which indicate to search engines what the subject matter of this post is. The post certainly does contain all of these.

The page also includes a lot of “metadata“. This metadata conforms to two standards each of which is set to transform the way in which multilingual intelligent content is produced, published, discovered and consumed.

Resource Description Format in Attributes

In layman’s terms RDFa is a way of embedding sense and definition into a document in such a way that non-human agents (machines and computer programs) can read and “understand” the content. RDFa is one mechanism for building the Multilingual Semantic Web.

If you right-click this page in your browser and choose “View Source” you’ll see that it contains attributes (things which allow generic HTML tags to have more unique characteristics) such as property and typeof. These allow web robots to understand those parts of the content that I have decorated at a much more fundamental level. For example, that I created the page, the vocabulary that I have used to describe people, organisations and concepts within the document, and details about them. This data can form the basis of wider inferences regarding personal and content relationships.

Internationalization Tag Set 2.0

ITS 2.0 is a brand new W3C standard which is being funded through the European Commission as the Multilingual Web (Language Technologies) Working Group; part of the W3C Internationalization Activity. Its goal is to define categories of metadata relating to the production and publishing of multilingual web content.

To exemplify this, the overview of ITS 2.0 below was translated from German to English using the Microsoft Bing machine translation engine. Viewing the source of this page and searching for “its-” will locate ITS Localization Quality metadata that I annotated the translations with so as to capture my review of the target English.

“The goal of MultilingualWeb LT (multilingual Web – language technologies) it is to demonstrate how such metadata encoded, safely passed on and used in various processes such as Lokalisierungsworkflows can be and frameworks such as Okapi, machine translation, or CMS systems like Drupal.
Instead of a theoretical and institutional approach to standardization, LT-Web aims to develop implementations, which concretely demonstrates the value of metadata with real systems and users. The resulting conventions and results are documented and published as a W3C standard, including the necessary documentation, data and test suite, as the W3C standardization process requires it.”


I’m very excited about Global Intelligent Content. This post is a very small and personal contribution to the vision but hopefully it illustrates in a simple way what it is about and some of its possibilities.