Localization/Workflows/PO Summit

From KDE TechBase
Revision as of 14:06, 29 November 2008 by Ilic (talk | contribs) (Customization: compendium and vivification.)


Localization/Workflows/PO_Summit


Translating in Summit
On Localization   Workflows
Prerequisites   Language Coordinator
Related Articles   Pology
External Reading   n/a

Software Branches and Translation

The obvious approach to releasing software is for developers to focus on one central body of code, a single "branch", fixing bugs and adding new features to it, and from time to time taking a "snapshot" of the branch and packing it as next release. One approach to make the release process more robust, is for development to proceed in two parallel branches. From the "stable" branch the actual releases are made, mostly with bugs fixed and only very important features added. The other, "unstable" branch, is used to develop new and redesign existing features, and at some point will become the next stable branch. Depending on the project, releases may be made from the unstable branch as well, in parallel with stable releases, in order for eager users to help testing the novelties.

In KDE, all three release models may be present at any given moment. Core KDE modules, which are together labeled as "the KDE" and released in unison, follow the two branch model, with stable and trunk branch, and releases from stable branch only. KDE extragear applications may do the same, but they are not required to; they may instead make releases from the trunk branch only, or also use stable and trunk branch, but make releases from both. This presents translators with the workflow as on the following figure.

Classic translation by branches
Classic translation by branches

The KDE repository automation (aka "Scripty") is preparing template POs and merging them with language POs. From the point of view of language teams, real people who perform global actions (e.g. moving around POs for all languages when an application moves from module to another) can also be grouped here. However, the translation team is still presented with two branches of POs to translate.

In general, this means that, just like programmers, at times translators have to work on both branches, and propagate fixes from one branch to another ("backport" from trunk to stable, or "forwardport" from stable to trunk). This may be prone to coordinatorial confusion, and demand extra effort and attention when updating translations. Keeping up with two branches is easier in core KDE modules, as they are released from stable branch only and with singular schedule, but could be more taxing with extragear applications.

The exact amount of effort spent due to parallel translation, porting of fixes, and coordination, depends on the translation team. For example, a well-manned and well-coordinated team, with established style and terminology guides, custom automation to support these, may be able to keep all POs in both branches fully translated at all times with ease. Or, a team may have worked out well-defined schedules, such that any given member of the team at one point switches from translating one branch to another, without looking back, thus effectively having everyone always working on one branch (even if not the same one).

If, however, you as the team coordinator have said and thought "Should do that in trunk too, bugger", "Didn't we fix that already?", "No, you should take it from stable", "It was released from WHAT branch?", more often than you would have liked, the following presents one possibility for sidestepping such issues.

Translating in Summit

For a PO catalog which exists in both stable and trunk branch, a new same-named catalog can be made by gathering all the unique messages from branch PO files, i.e. the summit of branch POs. Assuming that branch POs were not all that different, since they are two versions of the same catalog, the summit PO shouldn't have much more messages than either. Translators at all times work on the summit PO, from which the messages are periodically scattered back to original branch POs. Thus, translators can always work on summit POs, not having to do any parallel translation, branch switching or porting of fixes. The summit workflow is presented by the following figure.

Translation in summit
Translation in summit

In the summit mode, repository automation gathers summit PO templates from branch templates, and stores them separately from real branches, at trunk/l10n-support/templates/summit/. The language team also has a collection summit POs, at trunk/l10n-support/LANG/summit/, which is the sole location where translation happens. As before, repository automation handles merging of templates in branches, but merging of summit POs is done by the team coordinator; team coordinator can opt to manually merge branch POs as well (more on why-and-how of this later). From time to time, the team coordinator fills out branch POs by scattering from summit. Not to worry, each of these special actions upon the team coordinator is done with a single command.

While it is in principle obvious that two branch POs can be made into one summit PO with the union of their messages, there are some important details that should be handled the right way by the summitting system:

  • What if a PO changes modules in one branch, so that it no longer belongs to the same module in both branches (e.g. application is moved from one to another module in trunk)?
  • What if a PO changes its name in one branch, but not in the other (e.g. application is renamed in trunk)?
  • What if a PO in one branch is split into two POs in another (e.g. extracting a library out of monolithic application in trunk)?
  • Where to place messages unique to one branch in the summit PO? The original file context of a message, which messages precede it and which follow it, should be kept as much as possible.
  • If source references are used to achieve good ordering of messages in summit PO, what to do if some source file paths change in one branch (e.g. application gets restructured in trunk)?
  • How to handle messages with different plurality across branches (since messages are identified only by their msgctxt and msgid field, an not msgid_plural)?

And handle these details it does, the present summitting system. This also means that teams working in the summit need not take care of the first three issues above, which are affecting manual branch handling too.

Summit POs are normal, fully valid POs in their own right. A message in a summit PO is different from branch PO only by being equipped with another comment, #. +> ..., showing in which branches the message exists:

  1. . +> trunk
    kdeui/jobs/kwidgetjobtracker.cpp:469

msgctxt "The destination url of a job" msgid "Destination:" msgstr "" ⁠

  1. . +> stable
    kdeui/jobs/kwidgetjobtracker.cpp:469

msgid "Destination:" msgstr "" ⁠

  1. . +> trunk stable
    kdeui/jobs/kwidgetjobtracker.cpp:517

msgid "&Keep this window open after transfer is complete" msgstr ""

The first message above thus exists in trunk only, the second in stable only, and the third in both branches. The source reference always points to the source file in the first listed branch. Any extracted comments (#.) other than the branch list are also taken from the first listed branch.

Note that the two messages above are different only by context; the context was added in trunk, but not in stable, in order not to break message freeze. However, due to careful ordering of messages in summit POs, these two messages appear together, allowing translator to immediately make correction in stable branch too if the new context in trunk shows it to be necessary.

Setting Up and Daily Operation

Before initializing language summit, the team coordinator has to have all the necessary paths checked out from the KDE repository, and structured on the local machine exactly as in the repository. If the path to the root of KDE repository on the local machine is $KDEREPO, and the language code LANG, then the structure should be as follows, with leaf directories checked out in full:

$KDEREPO/

   trunk/
       l10n-kde4/
           scripts/
           templates/
           LANG/
       l10n-support/
           scripts/
           pology/
           templates/
           LANG/
   branches/
       stable/
           l10n-kde4/
               scripts/
               templates/
               LANG/

Summit operations are performed using the script posummit.py, which is part of Pology, residing in trunk/l10n-support/pology/. Therefore the first thing to do is to setup Pology, which amounts only to setting the proper path:

$ export PATH=$KDEREPO/trunk/l10n-support/pology/scripts:$PATH

To initialize the summit, by gathering from existing translation in branches, the team coordinator executes:

$ cd $KDEREPO/l10n-support $ posummit.py scripts/messages.summit LANG gather --create --force

Depending on the amount of translation, after some minutes the initial gathering will have been completed, and language summit located under $KDEREPO/trunk/l10n-support/LANG/summit/messages/. This is the only time when the coordinator performs the gather operation on language POs; it is daily done only on templates by repository automation. Then, the created language summit should be merged with current summit templates:

$ posummit.py scripts/messages.summit LANG merge

Merging the summit is something that the coordinator does periodically, with frequency of own desire. For example, it can be done daily, or with increasing frequency as the last day for translation for the next release approaches.

After the first merging, language summit is ready for active translation. The coordinator should now commit $KDEREPO/trunk/l10n-support/LANG/, and, importantly, notify team members to stop working on branch POs and focus exclusively on summit POs.

To scatter the summit, i.e. fill out POs in stable and trunk branch from the summit POs, the coordinator periodically executes:

$ posummit.py scripts/messages.summit LANG scatter

As with merging, there is no fixed schedule when scattering should be done. Of course, it must necessarily be done before the next release is tagged, and in between it is useful to scatter for runtime testing, or to have translation statistics by branches on l10n.kde.org up to date.

Periodic scattering and merging of the complete summit are basically all that a language team coordinator needs to do specifically to operate the summit. Also, since l10n-support/scripts/ and l10n-support/pology/ contain scripts and settings critical for proper functioning of summit operations, and may be tweaked at any time, they should always be updated from the repository together with PO files and templates (in fact, it is best to always update at once the whole tree as outlined above).

Note
For documentation POs, summit setup and operation is the same, only replacing every messages with docmessages in the command lines above. The user interface and documentation summits are fully independent, so for a trial period it is reasonable to work with the interface summit only, and engage documentation summit once the trial has been deemed successfull.


Operation Targets

Sometimes it is advantageous to merge or scatter just a single catalog, a single module, a single branch, or any combination thereof. To this end, scatter and merge operations accept any number of operation targets after the operation keyword, specified as one of CATALOG, BRANCH:CATALOG, MODULE/, BRANCH:MODULE/, and BRANCH:. For example, to scatter just to Dolphin's PO in stable branch, in order to test translation at runtime, one would execute:

$ cd $KDEREPO/l10n-support $ posummit.py scripts/messages.summit LANG scatter stable:dolphin

(note no .po ending on catalog name). Or, to scatter to every PO in kdeplasma-addons module in stable branch:

$ cd $KDEREPO/l10n-support $ posummit.py scripts/messages.summit LANG scatter stable:kdeplasma-addons/

(the trailing slash is mandatory, or else posummit.py would think that kdeplasma-addons is a catalog name). Finally, to scatter to all catalogs in the stable branch (with the trailing colon for the same reason as earlier):

$ cd $KDEREPO/l10n-support $ posummit.py scripts/messages.summit LANG scatter stable:

The command line arguments of posummit.py operations are intentionally arranged such that all the arguments fixed for one language are placed first. If operation targets are used frequently, then it is convenient to place the immutable part of the command line under a shell alias, with absolute path to the summit setup file:

$ alias posummit-kde-LANG="posummit.py $KDEREPO/trunk/l10n-support/scripts/messages.summit LANG" $ cd ANYWHERE $ posummit-kde-LANG scatter stable:dolphin

Summit Customization

File scripts/messages.summit (i.e. $KDEREPO/trunk/l10n-support/scripts/messages.summit), given as first argument to posummit.py, contains the general summit setup for all languages. This file is set to include customization file per language, if it exists at $KDEREPO/trunk/l10n-support/LANG/summit/messages.extras.summit, which contains additions and overrides to the general setup as desired by the language team. Same as the general setup file, the customization file is a Python source. It uses an external object named S to define summit settings, and can, of course, contain any helper Python code (the file is executed only once, at the beginning of a posummit.py run). It has the following general layout:

  1. -*- coding: UTF-8 -*-
  2. kate: syntax Python;
  3. This file is included by scripts/messages.summit
  4. for language-specific additions/overrides.
  5. ...
  6. ... operations with object S and other Python code ...
  7. ...

Object S provides both data attributes and methods to configure different aspects of summit operation. For example, while the general setup specifies that text fields in summit catalogs should be unwrapped and split on tags, this can be overriden using summit_unwrap and summit_split_tags attributes:

S.summit_unwrap = False S.summit_split_tags = False

Or, to get the path to a file relative to the summit customization file itself (rather than to current working directory, where posummit.py was executed), method resolve_path_rooted can be used:

relpath = S.resolve_path_rooted("../whatever.txt")

The following sections will present some typical customization possibilities.

Fully Local Merging

When scattering from the summit, sometimes there will be reports of "messages missing in the summit". This happens because of time rift created by the Scripty merging branch POs, gathering summit templates, and a team coordinator merging the language summit, thus making some messages in branch POs not always present in the summit. This condition is benign, as such warnings will start to disappear with the message freeze approaching, but can be annoying. For this reason, team coordinator can stop Scripty from merging branch POs, and have the posummit.py ... merge command alone merge not only summit POs, but stable and trunk POs as well, such that summit and branches are always in perfect sync.

First, to stop Scripty from merging branch POs, a file named no-auto-merge (with arbitrary content) should be committed to the roots of respective trees, e.g.:

$ touch $KDEREPO/trunk/l10n-kde4/messages/LANG/no-auto-merge $ touch $KDEREPO/branches/stable/l10n-kde4/LANG/messages/no-auto-merge

Then, to make summit posummit.py ... merge merge everything, the following lines should be added into the summit customization file:

  1. Set local merging for all branches.

for branch in S.branches:

   branch["merge_locally"] = True

Once local merging of all branches is set, the coordinator can also use operation targets for selective merging, e.g. to merge only stable branch:

$ cd $KDEREPO/l10n-support $ posummit.py scripts/messages.summit LANG merge stable:

Merging with Compendium

By default, msgmerge takes into account only the catalog itself to fill out near-match translations to new messages introduced by merging. i.e. to produce fuzzy messages. However, an arbitrary PO file can be given to msgmerge as another possible source of earlier translations, through the --compendium option. For maximum effect, this PO file is usually constructed as the collection of messages from all PO files project-wide, and hence called the compendium.

Compendium can be used in summit merge operations simply by setting compendium_on_merge attribute in summit customization file. If the compendium is located at $KDEREPO/trunk/l10n-support/LANG/summit/messages-compendium.po, then:

  1. Compendium to use when merging summit catalogs.

S.compendium_on_merge = S.resolve_path_rooted("messages-compendium.po")

Note that if not located as above, compendium should not be placed within .../LANG/summit/messages/, as then it would be considered a summit catalog itself. In case fully local merging is engaged, only summit POs will be merged with compendium; it makes no sense for branch POs, since they are not being directly translated.

See "Creating Compendia" section of Gettext manual for details on how to create a compendium out of current body of summit translations. Once you establish the precise commands to create the compendium (whether to collect fuzzies too, whether to include old compendium as one of sources for the new, etc.), you would periodically refresh and commit the updated compendium.

Creation of Summit Catalogs on Merging

Translation of a summit catalog normally starts the same way as it did for branch catalogs, by copying it over from template and initializing the header. This can be done manually, but it can also be quite automatic when using project manager as sometimes provided by dedicated PO editors. However, relying on the editor's project manager can be disadvantageous at times. For example, aside from the PO editor, other, frequently command line tools may be used to process the body of translation, and these tools might need to consider non-started templates as if they were empty POs (e.g. statistics). Team members who do not have full local checkout, or no project set up in the editor, may need to jump between language and template directories when looking for files to translate.

Therefore, summit setup can be customized to automatically create ("vivify") summit catalogs for every new summit template, so that there is never the need (for a tool or human) to specifically treat templates as empty catalogs. This is done by adding the following lines into customization file:

  1. Create empty summit catalog for every new summit template.

S.vivify_on_merge = True S.vivify_w_translator = "Noone Noonian <[email protected]>" S.vivify_w_langteam = "Neverneese <[email protected]>" S.vivify_w_plurals = "nplurals=2; plural=n != 1;"

  1. Minimum translation state to create branch catalog on scatter.

S.scatter_min_completeness = 0.9

The vivify_w_* attributes set the necessary data to initialize headers of newly created summit catalogs. Aside from those listed above, there is also the vivify_w_charset attribute, which is by default "UTF-8", and very probably should not be changed.

The scatter_min_completeness attribute does not refer to vivification of summit catalogs, but should invariably be set in this context. When scattering from summit, normally the branch catalog is automatically created if there is the corresponding summit catalog. When vivification is engaged, this would result in creation of empty branch catalogs too, which is not desired for two reasons. Firstly, empty branch catalogs would become part of the released language pack, for which there is no practical reason. Secondly, KDE's i18n system regards an application with installed catalog as translated, so messages coming from basic system catalogs (e.g. kdelibs4) would show through, resulting in mostly untranslated user interface with specks of translation -- many users consider this rather ungainly. Therefore, scatter_min_completeness sets how complete the translation of currently non-existing branch catalog should be after scatter (0.0 empty, 1.0 fully complete), for the branch catalog to actually be created.

If the compendium has been set for merging, every vivified summit catalog will also be merged against the compendium, to fill out as many of messages with approximate translations.

Scatter Hooks

((TODO. Checks, modifications...))

Merge Hooks

((TODO. Special header fields...))

Disadvantages to Summit and Remedies

Although hopefully shadowed by the advantages, working in summit is not without its disadvantages. These should be weighed when deciding of whether to try out the summit workflow.

Obviously, while summit operations are made to be quite automatic, some extra aptitude is asked of the team coordinator. Reasonable shell handling, understanding of version control operations, feeling the pulse of repository automation, are all prerequisites, and some scripting ability advantageous.

After the summit is put in operation, any changes made manually in branch POs will not propagate to summit, and will be soon lost to scattering -- summit translations override everything in branches. This means that the whole team must work in the summit, it is not possible for some members to use the summit, and some not.

A summit PO file will necessarily have more messages than either of the branch files. For example, in the KDE 4.0/4.1 and 4.1/4.2 cycle, summit POs of core KDE modules had on average less than 5% more words than their stable counterparts. However, the said percent is the top, never approached limit of wasted workload due to trunk messages coming and going, given that as the next feature KDE release approaches, more and more trunk messages will find their way into it.

Another, more pressing issue with increased size of summit POs is the following scenario: a stable release is around the corner, and the team has no time to update summit POs fully, but could update only stable messages in them. E.g. there are 1000 incomplete (untranslated and fuzzy) messages, out of which only 100 are from the stable branch. A clever dedicated PO editor could allow jumping only through incomplete messages also satisfying a general search criteria, which in this case would be that a comment matches #\.\+>.*stable regular expression. On the other hand, with some external help, it is enough if the PO editor can merely search through comments. Then, posieve.py script (ready to use next to posummit.py) can equip incomplete stable messages with incomplete flag (as in #, ..., incomplete comment), and this flag searched for in the PO editor:

$ posieve.py tag-incomplete -sbranch:stable PATH_TO_PO_FILES_OR_DIRS

The incomplete tag needs not be manually removed when the message is updated. It will automatically disappear on the next merge, as it is not among flags known to Gettext.

There is also the organizational issue with starting to use the summit, and, if it does not help as expected, stopping to use it. Team members have to be reminded to not send in branch POs at start, and then to be sent back to branch POs if summit is disbanded. On the plus side, disbanding summit is technically simple: just remove from the repository l10n-supprot/LANG/summit, possibly also no-auto-merge files if fully local merging was set up, and that is it.

Another Way to Improve Branch Handling

If summit seems a lot to digest, or is simply an overkill for team's needs, but still some improvement to manual handling of branches would be welcomed, KDE's dedicated PO editor Lokalize offers a branch sync mode. It works as follows.

In Lokalize project definition, the local paths of trunk and stable PO roots are set in Translation directory: and Branch directory: fields. Then, when a trunk PO file is opened, if it has a stable counterpart with same name and location as in the trunk, this stable PO is also going to be opened. For each trunk message in the main editing pane, if such a message exists in stable PO too, the stable message will be shown in the Secondary Sync pane; changes in the translation of trunk message will reflect to the stable message, and stable PO file will also be saved when the trunk is saved.

Furthermore, any team member can personally choose to work like this, there is no need to change the workflow of the language team as whole. When sending modifications to the coordinator, team members who rely on this feature of Lokalize simply send both trunk and stable POs that got modified.