• Deutsch
  • English
  • Español
  • Français
  • Italiano
  • Magyar
  • Polski
  • Português
  • Slovenčina
  • Čeština
  • Русский
  • 简体中文
  • 日本語
  • العربية
  • עברית
Dal 1994…

Clienti soddisfatti da oltre 15 anni
Fornitore di servizi linguistici e tecnologici personalizzati per clienti internazionali
e aziende specializzate nei campi dell'IT, del software, dei prodotti multimediali,
dei videogiochi, della formazione e dell'eLearning, dell'industria e del turismo.

The Use of XML in Localization
April 1, 2003

A joke among developers goes like this: a few of them meet one Saturday morning for a round of golf, and one is complaining about some trouble he’s in. A colleague from another company says, “Don’t worry. I know what to do. In our company, when there’s a problem, someone somewhere always says, ‘Let’s use XML,’ and a miracle occurs!”

There is often an element of truth in such jokes. This little story is funny because we all expect so much from XML that somewhere along the line it has, to a certain degree, become a witchcraft accessory — a talisman against evil, a remedy for all our pains, hopes and desires. An increasing number of companies insist on using XML in all their projects, although they are not always sure why.
How has a simple markup language become so popular, and where do the true advantages in using XML lie? Hundreds of books have been written on the subject, and there is so much XML activity in the computer world that an article of this length can only skim the surface. We will thus concentrate on a single area in which XML is indeed becoming a major painkiller: localization.

To explain the benefits of XML in localization, it is worth looking back down the road traveled so far. We first met markup languages more than twenty years ago. The mother of them all was Standard Generalized Markup Language (SGML), a meta-language soon to be followed by a great intuitive language called Standard Digital Markup Language (SDML). Some people associate SGML with IBM and SDML with DEC. While both companies participated extensively in the development of these languages, made them popular and used them as a priority for their own needs, many other organizations were involved in defining and spreading these standards.

As we look at markup languages of the 1980s, we see that one thing is true for them all: at that time they were exclusively used for documentation. Building software or client-server applications using SGML was unthinkable. Developers’ fashionable languages were Pascal and C or the object-oriented C++. SGML and SDML applications included browsers, parsers and utilities. Consisting of pure text, they can run in a heterogeneous proprietary environment on any PC, mini or mainframe. They are, however, resource intensive, especially when the content is voluminous or when even basic windowing features are used. And their extreme complexity required indepth programming knowledge if you wanted to use them extensively.

This combination of factors probably explains why markup languages went out of fashion. Much lighter proprietary PC editors soon replaced them, and WordPerfect, Microsoft Word, Interleaf, FrameMaker and PageMaker promptly entered the battlefield. IBM still uses SGML for its own end-user documentation, and both SDML and SGML have always been popular in large corporations and administrations in sectors such as aerospace, automotive and telecommunications, not to mention banks and the US tax administration.

Light but “closed” proprietary PC editors gave the emerging localization industry a terrific shove. Seemingly easy to use, the early versions of these applications were in reality tricky, unstable and unpredictable when porting files from one version to another. You needed an expert team to deliver the turnkey document set in 12 languages on time. They did, however, allow for a decentralized organization of the production chain, and at least all actors involved were working in the same PC environment.

The real jolt for localization came from the profusion of small companies developing and marketing clever utilities and tool sets that finally evolved into memory tools and CAT program suites. In the resulting boom, most of the big localization players still on the market took off. But the enormous success of PC editors which produced content exclusively in proprietary formats, among other causes, forced us to abandon hope in getting machine translation off the ground one day.
WordPerfect, WinWord and FrameMaker quickly became the most popular editing tools, and we were still struggling with incompatible RTF versions when a new storm appeared on the horizon. The Internet had brought HTML to the forefront. Until Office 97, you needed third-party converters and a lot of tweaking to get your content displayed on the Web.

However, after the thriving success of comet companies that cashed in on HTML converters, big players such as Adobe, Corel and Microsoft soon adapted to demand by providing built-in conversion facilities for their editors. These converters produced — and still produce — a proprietary form of markup language. The reason seems obvious: besides displaying content in a browser, “proprietary” HTML has other important objectives. Ideally, it should render the initial application’s WYSIWYG quality without too much loss, should be backward convertible and should be portable to most of the other applications in the program suite. Some cynics, however, venture another reason: to make sure that the HTML and, more recently, XML files produced by these program suites will not work properly in their competitors’ environments.

Converting between competing editors, between different applications within the same suite or even between different versions of the same application has always meant a lot of reformatting, especially if you want the same look and feel for printed documentation, WinHelp and the Web. More than just a technical challenge, this also represents a commercial issue: why would anyone want to wait and pay a lot of money to get the PDF, HLP and HTML versions of localized text when only a few words change between the different output formats?

Localization tools restrict their users to a limited number of file formats, often don’t even support all the features inherent in these formats and store proprietary formatting information in the memory. This is the main reason why leveraging from Microsoft RTF (for example, WinHelp) memory is disappointing when translating standard HTML files or the other way around.

HTML went on to pave the way for XML. To understand the growing importance of markup languages, it is worth considering the contrary effect the Internet produced on localization. On the one hand, since everyone quickly acknowledged that English is not yet the world language shared by all educated people connecting to the Web, the Internet brought with it a growing need for localization. On the other hand, long before the economic slowdown large corporations started to realize that they were allocating an ever-increasing share of their budgets to localization, while in other domains they “got more for less.”

When companies began putting content on the Internet, they had to start the content creation and localization processes more or less from scratch. All proprietary formats, including those of the most popular editors, have limited conversion capacities due to their conceptual legacy. Without even thinking about Flash screens or multimedia animations, converting even the simplest content is never straightforward. If you produce millions of words of content every year, some of it output into different formats (manuals, support, on-line, training, marketing and so on), you may want to try to find the most efficient way to manage the content your writers produce.

Under the pressure of growing demand, reduced resources and the restricted budgets induced by an economic slowdown in traditionally multilingual content-producing fields, advances in technology finally saw the light of day. Highly sophisticated content management systems (CMS) appeared on the market along with other innovations. Initial versions were not really localization-enabled, but most of the 300 major products available today take localization into account to some extent. However, CMS systems still do not work with proprietary files or else they become far less efficient, and XML is increasingly considered to be the ideal format for storing content designed for a large variety of uses and outputs.

A group of SGML specialists working around Jon Bosak developed XML in 1996. They started with SGML, reduced the complexity of the meta-language, discarded everything except the structure and truly important features, and came up with XML. While SGML is fully able to configure a set of features for markup languages, XML only has a limited, pre-defined set of these SGML features. However, the XML collective saw the new language’s potential. In his paper on the future of the Web, published in 1997, Bosak was already insisting on the advantages of using XML as an application for data representation, especially when combined with Java. Thus, right from the beginning, XML was the universal solution for all Web-related content, covering everything from document standard to programming language. Moreover, since XML is not contingent to the Web, it has quickly become the solution for content creation, localization and management at large. Easy to understand and use, it has been widely adopted by writers and developers, although some started implementing XML or XML-derived products without properly understanding one fundamental point: XML is an open standard. This means it is both open and a standard.

The power of XML

All this talk of “inventing your own tags” was still pretty foggy, and the first XML files we got for localization caused lots of problems. They were not correctly encoded; they were full of self-made tags; the DTD and XSL either were not provided or not defined or didn’t correspond; quotes and brackets were forgotten or used in the text; and in any case, our localization tools simply did not support XML. We really took the hard way to discover the power of XML.

Unlike HTML, there is a simple way to control XML before trying to use it for whatever purposes. Since XML is a standardized and hierarchical structured language, you can use a parser to check integrity. Parsers, however, come in all shapes and sizes; and some (called nonvalidating) merely check that basic rules are respected, such as non-overlapping or closing of all opened tags, quotes around attributes and XML specific text characters represented as entities. Validating parsers will do the same thing, but additionally check whether the element tags are legal, whether the attribute names make sense, whether every element nested inside another element belongs there and so on.

Now that more and more editors and localization tools support XML, we discover that XML makes life much easier for localization, if properly used, and it answers several of our long-time requests. XML separates content from formatting. The formatting is
defined in a separate style sheet (XSL). Since the content can be processed, stored and managed separately, localization becomes far more efficient. XML always uses the same ISO character set and represents unsupported characters with a simple escape mechanism. Standardized character set and language identification attributes in the header section make multilingual content easy to handle.
XML represents the contextual meaning of the data, and XML tags supply valuable, usable information. Not only is this beneficial for the translator, but we can also easily build utilities (many are already freely available on the Web because XML is an open standard) and use temporarily added tags to manipulate, process and manage the content and put it back in the right place when done. Temporary tags are removed properly at the end of the process.
If you want to change or extend the language, you simply change the DTD. The parser then accepts the newly defined rules. If your system needs to interoperate with other systems, you may choose to adopt a standard DTD such as XML/EDI so that other systems will automatically understand your vocabulary and vice versa. You can even use XML to write little “bridges” for programs that do not otherwise communicate with each other.

Since we only localize the content and don’t touch either the XML tags, the DTD or the XSL, our client can store the localized version together with the sources — and split sources and targets into smaller parts as needed — before outputting it to any requested format just by changing the XSLT. XSLT is immensely powerful because it can be used not only to format or add structure to a document (such as using a CSS with HTML), but also to completely rearrange the input elements for a particular purpose or output format. Using exactly the same XML file or file set, you can output, for example, to PDF, HTML or WAP and display the content on a mainframe client screen, on your PDA or on a TV set. A programmer gets the job done in a few hours, thus avoiding the need to spend time and money retranslating or reformatting the entire text several times. With XPath, XSLT, XPointer and XLink operating on the abstract logical structure of XML rather than on the surface syntax, there are virtually no limits. XML thus becomes a powerful, portable programming language that can operate cross platform in heterogeneous environments.

If you visit the Cover Pages Web site hosted by OASIS, you will find an impressive list of current open XML initiatives and possible or existing XML applications. Founded in 1993 under the name SGML Open, OASIS is particularly active in producing e-business standards. OASIS’ work, however, reaches well beyond the Web to produce open standards for markup languages in diverse domains, including Web services, business transactions, publishing and programming — with a special eye on interoperability within and between otherwise noncommunicating systems.

It is only natural that OASIS is also interested in XML-based localization standards such as Translation Memory eXchange (TMX), Terminology dataBase eXchange (TBX), the Open Lexicon Interchange Format (OLIF) and the XML Localization Interchange File Format (XLIFF), which are certainly the most important standards the localization industry ever produced. They extensively cover such major areas as translation memory, terminology, machine translation and file exchange. OSCAR, the special LISA interest group officially responsible for the definition of TMX and TBX, wants TMX to provide standard methods for exchanging translation memory data among tools and/or translation vendors with little or no loss of critical data during the process. With TMX, files are well-formed XML documents, and translation tools can process them without explicit reference to the TMX DTD. A “valid” TMX file must conform to the standard TMX DTD, however. If you encounter a suspicious TMX file, you can always check it against the TMX DTD using a validating XML parser. TMX files can be multilingual, although they become difficult to maintain.

OSCAR’s TBX is an open XML-based standard for terminological data. Once software packages that include terminological databases can import and export TBX files, this will greatly facilitate the flow and consistent use of terminological information inside major organizations or between these organizations and their external partners. OSCAR is convinced that through TBX, terminology will become much more accessible and more easily integrated into existing terminological databases.

OLIF will be a standard for natural language processing (NLP) systems, such as machine translation, designed “to provide coverage of a wide and detailed range of linguistic features.” I speak of OLIF in the future and give only vague definitions because, compared to the other three localization standards, OLIF seems to be less advanced, in spite of participation in the project by the EU and the world’s biggest software companies, tools manufacturers and service providers. OLIF represents markup language’s comeback to machine translation and illustrates the importance of open standards for the development of NLP systems. A major issue for NLP systems is the total lack of interoperability. The handful of players who currently share this market seems to be gridlocked in early capitalistic considerations of “having it all” and show little interest in agreeing on a common standard.

We can anticipate, however, that a growing demand for costeffective content localization will produce more pressure to adapt, implement and develop the standard.

Probably the most exciting localization standard is XLIFF. In 2001, a group of specialists from the localization industry embarked on defining a file format that would make it possible to exchange localizable data from virtually any source format. XLIFF is based on Open Tag, borrowing some of its tags as well as a few ideas from TMX. Although bilingual and not multilingual, it takes full advantage of certain XML features — providing, for example, project-related information, pretranslation, alternative translations, history, versioning, binary objects, glossaries, references and so on.

XLIFF stores text extracted from virtually any object: documents, graphics, PDF or software-type files and carries the data through the different steps of the localization process. In a certain way, it is process-oriented. A multitude of optional attributes designed for the localization process means that we can adapt XLIFF files to our internal procedures and make sure the data is safely extracted, processed, managed and reinjected. Although XLIFF differs from XML in that it restricts the possibilities of tag creation and usage, it offers better interoperability in exchange. Even if no initiatives have been taken in this direction, XLIFF may become an alternative to the less dynamic OLIF for NLP systems.

While conflicting interests have always been the hard spot for industrial evolution, they nevertheless serve to drive it forward. Once an opportunity arises, you can bet that someone will seize it. XML’s obvious vitality is further proof that open standards stimulate the industry, and it’s not just a buzz word: a lot of XML business is already going on. For localization specialists, it seems obvious that XML will make a profound change in the way we work. With no major advances in our industry these last ten years, it seems that XML has changed the course of things to come.

For more information:


Hans-Günther Höser, Former Managing Director of WH&P - After a first publication in Multilingual Computing & Technology, this article has been republished recently in the 2004 edition of the Annual Localization Reader by the Localization Research Centre (LRC).