Parsing, Recursion and Observer Pattern

I have worked for a while now with two serializations of the XLIFF Object Model: XLIFF and JLIFF (which is still in draft). I have had occasion to write out each as the result of parsing some proprietary content format in order to facilitate easy interoperability within our tool chain, and round-tripping one serialization with the other.

Whilst both are hierarchical formats when parsing them recursively they require different strategies.

With XLIFF (XML) each opening element has all of its attributes available immediately. This means you can construct an object graph as you go: instantiate the object, set all of its attributes and make any decisions based on them, and add the object to a stack so that you can keep track of where you are in the object model. This all works nicely with the Observer pattern: you can subscribe to events which fire upon each new element no matter how nested.

<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="" srcLang="">
<file id="f1">
<group id="g1" type="cms:field">
<unit id="u1" type="cms:slug">
<segment id="s1">
<source>/>
<target/>
</segment>
</unit>
</group>
</file>
</xliff>

With JLIFF (json) you have to wait (assuming you’re doing a depth-first token read) to read all of the properties of nested objects until you can access all the properties of the parents. Thus you have to build an object graph before you can then traverse it again and use the Observer pattern in an efficient way to build another representation.

{
"jliff": "2.1",
"srcLang": "en-US",
"trgLang": "fr-FR",
"files": [
{
"id": "f1",
"kind": "file",
"subfiles": [
{
"canResegment": "no",
"id": "u2",
"kind": "unit",
"locQualityIssues": {
"items": []
},
"notes": [],
"subunits": [
{
"canResegment": "no",
"id": "s2",
"kind": "segment",
"source": [],
"target": []
}
]
},
]
}
]
}

Differences are also apparent when dealing with items which require nesting to convey their semantics. This classically happens in localization with trying to represent rich text (text with formatting).

XLIFF handles this nicely when serialized.

<source>Vacation homes in <sc id="fmt1" disp="underline" type="fmt" subType="xlf:u" dataRef=""/>Orlando<ec dataRef=""/>

Whilst JLIFF is somewhat fragmented.

"source": [
{
"text": "Vacation homes in "
},
{
"id": "mrk1",
"kind": "sm",
"type": "term"
},
{
"text": "Orlando"
},
{
"kind": "em",
"startRef": {
"token": "mrk1"
}
}
]

Leave a Reply

Your email address will not be published. Required fields are marked *