Thursday 16 May 2013

True art selects and paraphrases, but seldom gives a verbatim translation

In my professional life I am sometimes required to provide technical support to one of our salesmen. I find this an interesting change in pace though sometimes challenging.

Occasionally I fail to clearly convey the solution we are trying to sell because of my tendency to focus on detail the customer probably does not need to understand but I think is the interesting part of the problem.

Conversely sometimes the sales people gloss over important technology choices which have a deeper impact on the overall solution. I was recently in such a situation where as part of a larger project the subject of internationalisation (you can see why it gets abbreviated to i18n) was raised.

I had little direct personal experience with handling this within a project workflow so could not give any guidance but the salesman recommended the Transifex service as he had seen it used before, indicated integration was simple and we moved onto the next topic.

Unfortunately previous experience tells me that sometime in the near future someone is going to ask me hard technical questions about i18n and possibly how to integrate Transifex into their workflow (or at least give a good estimate on the work required).

Learning

Being an engineer I have few coping strategies available for situations when I do not know how something works. The approach I best know how to employ is to give myself a practical crash course and write up what I learned...so I did.

I proceeded to do all the usual things you do when approaching something unfamiliar (wikipedia, google, colleagues etc.) and got a basic understanding of internationalisation and localisation and how they fit together.

This enabled me to understand that the Transifex workflow proposed only covered the translation part of the problem and that, as Aldrich observed in my title quote, there is an awful lot more to translation than I suspected.

Platforms

My research indicated that there are numerous translation platforms available for both open source and commercial projects and Transifex is one of many solutions.

Although the specific platform used was Transifex most of these observations apply to all these other platforms. The main lesson though is that all platforms are special snowflakes and once a project invests effort and time into one platform it will result in the dreaded lock in. The effort to move to another platform afterwards is at least as great as the initial implementation.

It became apparent to me that all of these services, regardless of their type, boil down to a very simple data structure. They appear to be a trivial table of Key:Language:Value wrapped in a selection of tools to perform format conversions and interfaces to manipulate the data.

There may be facilities to attach additional metadata to the table such as groupings for specific sets of keys (often referred to as resources) or translator hints to provide context but the fundamental operation is common.

The pseudo workflow is:
  • Import a set of keys
  • Provide a resource grouping for the keys.
  • Import any existing translations for these keys.
  • Use the services platform to provide additional translations
  • Export the resources in the desired languages.
The first three steps are almost always performed together by the uploading of a resource file containing an initial set of translations in the "default" language and  due to the world being the way it is this is almost always english (some services are so poorly tested with other defaults they fail if this is not the case!)

The platforms I looked at generally follow this pattern with a greater or lesser degree of freedom in what the keys are, how the groupings into resources are made and the languages that can be used. The most common issue with these platforms (especially open source ones) is that the input convertors will only accept a very limited number of formats and often restricted to just GNU gettext PO files. This means that to use those platforms a project would have to be able to convert any internal resources into gettext translatable format. 

The prevalence of the PO format pushes assumptions into almost every platform I examined, mainly that a resource is for a single language translation and that the Key (msgid in gettext terms) is the untranslated default language string in the C locale.

The Transifex service does at least allow for the Key values to be arbitrary although the resources are separated by language.

Even assuming a project uses gettext PO files and UTF-8 character encoding (and please can we kill every other character encoding and move the whole world to UTF-8) the tools to integrate the import/export into the project must be written.

A project must decide some pretty important policies, including:
  • Will they use a single service to provide all their translations.
  • Will they allow updates to the files in their revision control system and how those will be integrated.
  • Will there be a verification step and if so who and how will that be performed. Especially important is the question of a reviewer understanding the translated language being integrated and how that is controlled.
  • Will the project be paying for translations
  • Will the project allow machine translations, if not can they be used as an initial hint (sometimes useful if the translators are weak in the "default" language
These are project policy decisions and, as I discovered, just as difficult to answer as the technical challenges.

Armed with my basic understanding it was time to move on and see how the transifex platform could be integrated into a real project workflow.

Implementing

Proof of concept

My first exercise was to take a trivial command line tool, use xgettext to generate a PO file and add the relevant libintl calls to produce gettext internationalised tool.

A transifex project was created and the english po file uploaded as the initial resource. A french language was added and the online editor used to provide translations for some strings. The PO resource file for french was exported and the tool executed with LANGUAGE=fr and the french translation seen.

This proved the trivial workflow was straightforward to implement. It also provided insight into the need to automate the process as the manual website operation would soon become exceptionally tedious and error prone.

Something more useful

To get a better understanding of a real world workflow I needed a project that:
  • Already internationalised but had limited language localisation 
  • Did not directly use gettext 
  • Had a code base I understood
  • Could be modified reasonably easily.
  • Might find the result useful rather than it being a purely academic exercise.
I selected the NetSurf web browser as it best fit this list.

Internally NetSurf keeps all the translated messages in a simple associative array this is serialised to an equally straightforward file named FatMessages. The file is UTF-8 encoded with keys separated from values by a colon. The Key is constrained to be ASCII characters with no colons and is structured as language.toolkit.identifier and is unique on identifier part alone.

This file is processed at build time into a simple identifier:value dictionary for each language and toolkit.

Transifex can import several resource formats similar to this, after experimenting with YAML and Android Resource format I immediately discovered a problem, the services import and export routines were somewhat buggy.

These routines coped ok with simple use cases but having more complex characters such as angle brackets and quotation marks in the translated strings would completely defeat the escaping mechanisms employed by both these formats (through entity escaping in android resource format XML is problematic anyway)

Finally the Java property file format was used (with UTF-8 encoding) which while having bugs in the import and export escaping these could at least be worked around. The existing tool that was used to process the FatMessages file was rewritten to cope with generating different output formats and a second tool to merge the java property format resources.

To create these two tools I enlisted the assistance of my colleague Vivek Dasmohapatra as his Perl language skills exceeded my own. He eventually managed to overcome the format translation issues and produce correct input and output.

I used the Transifex platforms free open source product, created a new project and configured it for free machine translation from the Microsoft service, all of which is pretty clearly documented by Transifex.

Once this was done the messages file was split up tinto resources for the supported languages and uploaded to the transifex system.

I manually marked all the uploaded translations as "verified" and then added a few machine translations to a couple of languages. I also created spanish as a new language and machine translated most of the keys.

The resources for each language were then downloaded and merged and the resulting FatMessages file checked for differences and verified only the changes I expected appeared.

I quickly determined that manually downloading the language resources every time was not going to work with any form of automation, so I wrote a perl script to retrieve the resources automatically (might be useful for other projects too).

Once these tools were written and integrated into the build system I could finally make an evaluation as to how successful this exercise had been.

Conclusions

The main things I learned from this investigation were:

  • Internationalisation has a number of complex areas
  • Localisation to a specific locale is more than a mechanical process.
  • The majority of platforms and services are oriented around textural language translation
  • There is a concentration on the gettext mode of operation in many platforms
  • Integration to any of these platforms requires both workflow and technical changes.
  • At best tools to integrate existing resources into the selected platform need to be created
  • Many project will require format conversion tools, necessitating additional developer time to create.
  • The social issues within an open source project may require compromise on the workflow.
  • The external platform may offer little benefits beyond a pretty user interface.
  • External platforms introduce an external dependency unless the project is prepared and able to run its own platform instance.
  • Real people are still required to do the translations and verify them.
Overall I think the final observation has to be that integrating translation services is not a straightforward operation and each project has unique challenges and requirements which reduce the existing platforms to compromise solutions.