Localization/Concepts/PO Odyssey

From KDE TechBase
The PO Format
On Localization   Concepts
Prerequisites   Text Encoding
Related Articles   XML Markup, Gettext Tools
External Reading   Gettext Manual

Getting Translations Into Place

Before going into technical details of the PO (or any other) format, it is useful to examine the conceptual ways in which the text can flow from the author, through the translator, and to the user. Let us call this chain the translation pipeline, and consider the following example of it:

  • the author prepares a text document, say, in OpenOffice Writer;
  • the translator gets that document, and translates it also in Writer, by replacing original for translated text paragraph by paragraph;
  • the user gets to read the translated document, in PDF, as output by Writer.

Clean and simple, everyone is happy, right? Wrong (you guessed it). This pipeline is, incidentally, the one which most people seem to imagine before they get involved in localization for real. But before explaining why it is wrong, and therefore not at all used in free software translation, let us cover more hypothetical ground.

The previous example was about static translation, such as of a text document or an HTML page. While the example pipeline was not appropriate, even with the proper pipeline the output for the end user must be a static translated document, such as PDF file or another HTML page. How does this map onto a translated user interface in an application, which has live code running in the background? For starters, we can be unimaginative and follow the static route: the programmer may keep all the user interface text strings in a text file, which gets built into the application executable files when the installation package is built. Following the same poor pipeline, the translator may translate that text file, replacing string after string, after which another installation package is built, this time a localized one. So, like in the case of PDF files, where in the end there must be one per language, there would also be one application package for each language.

However, the bulk of a PDF file is the text itself, so there is almost no needless duplication of language-independent content. On the contrary, in an average application, like a file manager or a web browser, the translatable text is a minuscule part of the total size of installation package. Thus, having one package per language would make for a paramount waste of digital space. If we imagine a typical operating system distribution these days, this would pretty much prevent default installation disks from carrying anything but original English packages. Suffices to say, static translation of applications is hardly an option with free software, which makes international reach one of its primary goals.

A more clever way of having localized applications is for them to draw translations at runtime. Returning to the previously mentioned file with user interface strings that the programmer had prepared (usually in English), instead of replacing it with translated version, now the translated files of the same structure, one for each language, are put alongside each other. The application is programmed to select strings from one of these files while running, based on user's language settings. We will call this dynamic translation.

Now we come back to the translation pipeline. As explained, regardless whether the translation is static (PDF files, HTML pages) or dynamic (application interfaces), in the end there is a file full of English text to be translated. Why is it then wrong to just open it up in OO Writer (or KWord, Abiword, etc.) and translate it by replacing paragraph for paragraph? There are two issues that make this approach infeasible:

Varying formats. While a pure text document, bound for PDF presentation, may be "just text", application user interface strings will need some extra data for application to be able to pick them at runtime. Also, interface strings will contain various special substrings, with constraints on what may be done to them in translation. These aspects tend to be different between different application frameworks (KDE, Gnome, etc.), which raises the question of validation of translated text--faulty translation may break the application behavior.

Maintenance. Software, be it applications themselves or their documentation, evolve through time according to users' needs. This means that the text also changes: new interface strings and documentation paragraphs are added, some removed, and some modified. If OO Writer would be used for translation, when the time comes to update the translation, how would the translator know that, say, a new paragraph got inserted between paragraphs 128 and 129, and that paragraphs 42 and 86 have had one sentence modified?

To handle these issues, free software, by large, went the following way: one translation file format has been organically evolved, and many independent but complementary tools have been built to translate, validate, maintain, and convert this format to and from various target formats. Thus, whether translating user interface (in various application frameworks), documentation, man pages, release notes, web content, the translator can efficiently do it by getting to know the translation pipeline built around this one file format.

Enter The PO Format

The PO format has been developed as the translation file format of the Gettext translation system, which is used today by the large part of free software. Given introductory considerations on translation pipelines, it is useful to explain what exactly is meant by used. There are three distinct uses of the PO format:

  • Intermediate static translations. Static text data, such as software documentation, is converted from its source format to PO format, translated, and converted back into the original format. Out of that the final documents for user consumption, such as PDF files or HTML pages, are built.
  • Intermediate dynamic translations. Some software keeps user interface strings in their own custom format, as is the case with e.g. Mozilla and OpenOffice. Such custom formats are converted into PO for translation, then converted back for runtime consumption by the respective applications.
  • Native dynamic translations. Finally, many applications use PO format as the native format for their user interface strings, so that no conversion is necessary. These include KDE and Gnome desktop environments, GNU tools, etc. To be usable at runtime, translated PO files are only compiled into binary MO files.

This distinction should be kept in mind, as while the PO format is one, the text exposed by it for translation will have embedded elements which are tightly coupled with the source of what is translated. For example, user interface strings will frequently contain format directives, while documentation strings may be written with HTML-like markup (examples provided later in the text). This means that the translator should be aware, in general, of what is being translated through a particular PO file.

The development of the PO format has been, and is, driven solely by the needs of its users, as in time these needs become well formulated and generalizable; hence the earlier remark of "organically evolved". Thanks to this, features of the PO format other than the very basic can be gradually introduced as necessary, and stay out of the way when they are not. The format is quite compact, human-readable and editable without special-purpose tools (though, of course, these come in handy). These aspects benefit the learning curve, everyday usage, and explanatory texts such as this one.

Although translators will frequently prefer to work on PO format files using dedicated PO editors, which purport to hide "technical details" such as the underlying file format, they should nevertheless understand the PO format very well. This is because the PO format is more than a mere vessel of text to be translated, but also, in light of the way it has been developed, reflects important concepts in the translation pipeline. Or, to put it more concretely, the translator should know how a given dedicated PO editor exposes all the bits of information provided by the PO format.

Format Basics

The PO format is a plain text format, written in files with .po extension. A PO file contains a number of messages, partly independent text segments to be translated, which have been grouped into one file according to some logical division of what is being translated. For example, a standalone application will frequently have all its user interface messages in one PO file, and all documentation messages in another; or, user interface may be split into several PO files by major application modules, documentation split by chapters, etc. PO files are also called message catalogs.

Without further ado, here is an excerpt from the middle of a PO file, showing three most basic messages, untranslated:

#: finddialog.cpp:38
msgid "Globular Clusters"
msgstr ""
⁠
#: finddialog.cpp:39
msgid "Gaseous Nebulae"
msgstr ""
⁠
#: finddialog.cpp:40
msgid "Planetary Nebulae"
msgstr ""

Each message contains the keyword msgid, which is followed by the text in English, wrapped in double quotes. The keyword msgstr marks the string which is supposed to be the translation of the English one, also double-quoted. Thus, after you have went through the PO file and added translations, these messages would read:

#: finddialog.cpp:38
msgid "Globular Clusters"
msgstr "Globularna jata"
⁠
#: finddialog.cpp:39
msgid "Gaseous Nebulae"
msgstr "Gasne magline"
⁠
#: finddialog.cpp:40
msgid "Planetary Nebulae"
msgstr "Planetarne magline"

Not terribly complicated, is it?

As usual with text formats, immediately something must be said about the encoding of a PO file: while you could use encodings other than UTF-8 if no non-ASCII letters are used in the original text, you really should use UTF-8 (in KDE this is even mandatory). The encoding is also specified within the PO file, and by default it is UTF-8; if you want to use another encoding, aside from writing out the file in it, you must specify it in the PO header.

Leaving some messages in the PO file untranslated is technically not a problem. For every untranslated messages, consumers of PO files (applications, format converters) will show the English original to the user, so that not all information is lost. Of course, you should strive to have the PO files under your maintenance completely translated, in order for the users not to be faced with mixed translated and English text.

Source References

Each message above also contains the source reference comment, which is the line starting with #:. It tells from which source code file of the application (or source document of any kind), and the exact line in it, the message has been extracted into the PO file. This piece of data may look strange at first -- of what use is it to translators, to merit inclusion in the PO file? Since PO format has been developed for localizing free software, the source reference enables you to actually look up the message in the source file, when you need more context to translate a certain message. This does not require that you be a programmer too, as source code is sometimes readable enough to be able to reason about message context without real understanding of the code. For example, in some languages the text in title position is usually written in noun form, and it may not be apparent from the PO file alone if the message:

#: addcatdialog.cpp:45
msgid "Import Catalog"
msgstr ""

is of that kind. Then, by following the source reference, you see this statement in the file addcatdialog.cpp, line 45:

setCaption( i18n( "Import Catalog" ) );

The setCaption bit here is probably a dead give-away of the message being used in a title position. Some dedicated PO editors provide very quick and comfortable source reference lookups, by pressing single shortcut, which makes this approach to context resolution that more viable.

String Wrapping

When a message is long or contains some logical line-breaks, its original and translation strings may be wrapped in the PO file (usually with boundary at column 80), such as this:

#: indimenu.cpp:96
msgid ""
"No INDI devices currently running. To run devices, please select devices "
"from the Device Manager in the devices menu."
msgstr ""

This wrapping is entirely irrelevant in the environment where the message is used, be it in application user interface, documentation, or elsewhere. PO processing tools produce wrapping mostly as a convenience to translators who would edit PO files with plain text editors. This means that you are free to wrap the translation (msgstr string) in the same way, differently, or not to wrap it at all--the result will be the same. You should only not forget to enclose each next wrapped line in double quotes, same as it is with msgid. For example, this translation of the previous message:

#: indimenu.cpp:96
msgid ""
"No INDI devices (...)"
"(...) in the devices menu."
msgstr ""
"Nema INDI uređaja (...)"
"(...) u meniju uređaja."

would be completely equivalent to this one:

#: indimenu.cpp:96
msgid ""
"No INDI devices (...)"
"(...) in the devices menu."
msgstr "Nema INDI uređaja (...) u meniju uređaja."

Dedicated PO editors may even not show wrapping to the user, or wrap on their own independent of the underlying PO file. Curiosly though, most of them seem to follow the original wrapping, at least by default. At any rate, if you would like to have all strings unwrapped, including msgid ones, or vice versa, there are command line tools to achieve this.

Uniqueness of Messages

A message in the PO file is uniquely identified by its msgid string (this is not entirely true, as will be explained later, but let us consider it approximately true for the moment). This means that, in the course of evolution of the source which is translated, a message may change some of its elements or the position within the PO file, but as long as it has the same msgid, it is the same message. Those non-identifying elements may be the translation, source reference comments, etc., and by the position we mean either raw line numbers, or relative ordering among other messages.

The first consequence of this fact is that the only reliable way to "report" a message is to state its msgid string in full, even if the person to whom you are reporting has access to its PO file. (You may want to point to a message when consulting with fellow translators, or when reporting a typo or another problem in the original text to the authors.) Newcomer translators are sometimes not briefed about this, and then they at first report the line number of the message, or its ordinal number in the range of all messages, without giving the msgid. Line numbers cannot work, for example, because of the line wrapping as described previously, which is arbitrary from one to another translator. Ordinals do not work because your PO file may be slightly older or newer than that of the other person, and the ordinals may have changed in the meantime.

The second consequence is that there cannot be two messages with the same msgid in the same PO file (again, not exactly true, see later). If the same text has been used two or more times in the source, then in the PO file it will appear as a single message, with its source reference comment (#:) listing all appearances. For example, the source reference of this message:

#: colorscheme.cpp:79 skycomponents/equator.cpp:31
msgid "Equator"
msgstr ""

shows that it is used at two places in the application source code. This feature of the PO format prevents needless duplication of work, by allowing you to go through any duplicate text in the source only once in the translation. However, this efficiency optimization can sometimes be a double-edged sword, but with an elegant solution for the problem that can arise, as we will see shortly.

The third, so to say, consequence, though more of a remark for clarity, is: you should never modify the msgid field. Not only that doing so would have no purpose, but if the msgid gets modified, a consumer of the translated PO file will not see the message as translated, since it will look for the message by matching the msgid field.

Message Context

Depending on the target language, sometimes it may be hard to translate a message well if treated in isolation, without any additional context. Naive translation may break style guidelines, or worse, misinterpret the meaning of the original text. To avoid this, there are several ways in which you can infer the context in which the message is used.

One way we have already seen: looking into the source file of the message, as pointed to by the source reference comment. But, this way can be tedious. Not only that to a programming-untrained translator the source code may look menacing, but also, while generally available, it is usually not very comfortable to keep all that source code laying around just for the sake of context lookups. This is a well understood difficulty, so more friendly context-pointers have been devised.

One simple way to keep track of the context is to, when translating a given message, keep in sight several messages before and after it. As a trivial example, the following four messages:

#: locationdialog.cpp:228
msgid "Really override original data for this city?"
msgstr ""
⁠
#: locationdialog.cpp:229
msgid "Override Existing Data?"
msgstr ""
⁠
#: locationdialog.cpp:229
msgid "Override Data"
msgstr ""
⁠
#: locationdialog.cpp:229
msgid "Do Not Override"
msgstr ""

are pretty obviously a question in some kind of a message dialog, title of that dialog, and the two answer buttons, so that you know exactly how the messages are related. Aside from the pure meaning, such conclusions may be further supported by the English user interface conventions (title word case for dialog titles, but also for push buttons), and the source reference comments (here they show all four messages to be in two adjacent lines of the same file). As time passes, you will start to pick up patterns of this kind which are typical for the source environment, and be more confident in your estimates.

Up to now, all the context gathering rested on the shoulders of the translator. However, when authors of the original text, for example application programmers, are themselves well-aware of the translation issues, they can explicitly provide some context for translators. This is particularly warranted when a message is quite strange, puts technical limitations on the translation, is used in a specific way, and the like.

Extracted Comments

One place where messages store explicit context provided by the authors is within extracted comments, those which start with #.. For example, the message:

#. i18n: A classical test phrase, with all letters of the English alphabet.
#. Replace it with a sample text in your language, such that it is
#. representative of language's writing system.
#: kdeui/fonts/kfontchooser.cpp:382
msgid "The Quick Brown Fox Jumps Over The Lazy Dog"
msgstr ""

has an extracted comment which tells you to avoid translating the English phrase for what it is, but to instead put there a phrase with the said properties in your language.

This kind of context usually begins with an agreed-upon keyword, which in the above case is i18n: (short for 'internationalization'), typical for KDE, but in principle depends on the source environment. In many other environments (e.g. Gnome) this keyword is the more direct TRANSLATORS:, which is the default for the Gettext translation system (under which the PO format is maintained).

Extracted comments can sometimes be provided not by a human author, but by a tool used to create or process PO files. For example, when markup-text documents are translated, such as HTML, or Docbook for documentation, the extracted comment frequently states the tag which wraps the text in the original document:

#. Tag: title
#: skycoords.docbook:73
msgid "The Horizontal Coordinate System"
msgstr ""

In the above example, by the #. Tag: title comment you are informed that the message is a title, and you can adjust the translation accordingly.

Another example where processing tools may provide extracted comments is when the PO file is created in a slightly roundabout way, such that source references in some messages do not really point to the source file, but to a temporary file which existed only during the creation of the PO file. To patch up a bit, the extracted comment may then state the true source:

#. i18n: file: tools/observinglist.ui:263
#. i18n: ectx: property (toolTip), widget (KPushButton, ScopeButton)
#: rc.cpp:5865
msgid "Point telescope at highlighted object"
msgstr ""

Here the rc.cpp:5865 is the dummy temporary source, whereas the true source file is given as file: tools/observinglist.ui:263. (The automatically extracted ectx: ... comment may look a bit code-cryptic, but you can still easily guess from it that this message is a tooltip for a push button.)

Disambiguating Contexts

Consider the following two messages from an application user interface:

#. i18n: First letter in 'Scope'
#: tools/observinglist.cpp:700
msgid "S"
msgstr ""
⁠
#. i18n: South
#: skycomponents/horizoncomponent.cpp:429
msgid "S"
msgstr ""

At first sight, you could say that it was nice of the programmer to add explicit context (#. i18n: ... lines), informing that the 'S' of the first message is short for 'Scope', and the 'S' of the second message short for 'South', so that translators know that they should use the letters corresponding to these words in their languages. But, can you spot the problem? The problem is that these messages cannot be part of a valid PO file, since, as said earlier, all messages have unique msgid strings. Instead, in a real PO file, these two messages would be collapsed into one:

#. i18n: First letter in 'Scope'
#. i18n: South
#: tools/observinglist.cpp:700 skycomponents/horizoncomponent.cpp:429
msgid "S"
msgstr ""

Both contexts are still there, translators are still well informed, but it is now required that the words 'Scope' and 'South' also begin with the same letter in the target language--an extremely unlikely proposal.

In these situations, the programmer can give messages a different type of context, called disambiguating context. These contexts are no longer presented as extracted comments, but through a full-fledged keyword string, the msgctxt:

#: tools/observinglist.cpp:700
msgctxt "First letter in 'Scope'"
msgid "S"
msgstr ""
⁠
#: skycomponents/horizoncomponent.cpp:429
msgctxt "South"
msgid "S"
msgstr ""

This is now a valid PO file, and you can translate each 'S' properly. By this we update the earlier approximation that messages must be unique by msgid strings: they must in fact be unique by the combination of msgctxt and msgid strings. If msgctxt string is missing, as it usually is, you can think of it as being present but empty.

A rather frequent example when disambiguating contexts are needed, is when the original text is a single English adjective, and used at several places in the source:

#: utils/kateautoindent.cpp:78 utils/katestyletreewidget.cpp:132
msgid "Normal"
msgstr ""

Many languages need to match an adjective form to the noun to which it refers by gender, so if the 'Normal' above refers both to indentation mode and text style, it is almost certainly necessary to provide disambiguating contexts:

#: utils/katestyletreewidget.cpp:132
msgctxt "Text style"
msgid "Normal"
msgstr "običan"
⁠
#: utils/kateautoindent.cpp:78
msgctxt "Autoindent mode"
msgid "Normal"
msgstr "obično"

You can, however, imagine that programmers in general cannot know when a certain phrase, same in English when used in two contexts, needs different translations in some other language. This means that you, the translator, should inform them to add a disambiguating context when you determine that you need one. Programmers of the free software, on the other hand, are usually aware of this latent need, and readily reachable, so you should be able to get the request through with little communication overhead. Some common modes of such communication are briefly mentioned towards the end of this article.

As of the moment of this writing, the msgctxt keyword is a relatively fresh addition to the PO format. But the need for disambiguating contexts was observed much earlier, and different translation environments have historically used different custom solutions to provide them. Such older PO files are still to be found around in good numbers, so it makes sense to present few examples of the custom contexts. Since before the msgctxt keyword was introduced, messages indeed had to be unique by msgid only, context had to become part of the msgid itself, embedded in it with some special syntax. If we take the first message from the previous example, here is how it would look like in a KDE3 PO file:

#: utils/katestyletreewidget.cpp:132
msgid ""
"_⁠: Text style\n"
"Normal"
msgstr "običan"

The disambiguating context has been embedded at the beginning of the msgid, wrapped in _⁠: ...\n (the msgid string itself is shown broken into two lines, as PO tools wrap strings at \n regardless of their length; more on this special character sequence later). In Gnome, the same message would look something like this:

#: utils/gatestyletreewidget.c:132
msgid "Text style|Normal"
msgstr "običan"

Here the context is again at the beginning of msgid, but is separated from the real text only by the pipe character, |.

Translator Comments

Sometimes you will need to translate a message without explicit context in a non-obvious way, after having determined that such translation is needed by looking into the source, or seeing the message live in user interface at runtime. This may present a difficulty when the message is revisited, say, by a proof-reader in quality assurance, or by another translator after some months if the message got modified--either of them may conclude that the translation is wrong and mess it up, or at the very least waste time on quering why the translation is the way it is.

Conversely, sometimes you may be unsure if your translation is exactly right, e.g. if you have correctly guessed the context, or whether you have used correct terminology. In that case you can, of course, consult with fellow translators, but this can break your "flow" of translation. It is frequently better if such communication is delayed to the moment when the translation of the PO file is otherwise complete.

For these situations, you can write down your own reminders, doubts, inferred contexts, etc. in another type of comment, the translator comment. These comments start simply with # (hash and space), followed by any text whatsoever, and as with other comments, there may be any number of them. A hypothetical example:

# Wikipedia says that ‘etrurski’ is our name for this script.
#: viewpart/UnicodeBlocks.h:151
msgid "Old Italic"
msgstr "etrurski"

When for real, the translator comment as above would probably be written in the target language, as there is no reason for it to be in English. This is not to say that translator comments should never be in English, there may be situations when that would be advantageous--common sense applies.

Keep in mind that translator comments are the only type of comment that all well-behaved PO processing tools are guaranteed to preserve. For example, if you would write this kind of information as an extracted comment (#.), it would very soon perish, in one of the standard maintenance procedures. So stick to adding any personal remarks into translator comments, and nowhere else.

Constructive Substrings

Original text in a message frequently contains substrings which are not visible to the end user, but are instead used by the content producer (application, HTML engine) to construct the final visible text. Translators should reproduce such substrings in the translation as well, most of the time exactly as they are in the original, but sometimes also with a tweak or two.

For better or worse, constructive substrings tend to be tightly linked to the source environment of the text, for example the particular programming language in which the application is written, or the particular markup language for static content like documentation. To produce high-quality translations, you will benefit from having basic understanding of the constructive substrings possible in the source environment, of their function and behavior. (The prerequisite to this, as mentioned earlier, is that you are aware of what is the source of the text in the PO file.)

Format Directives

When a file manager shows a message like Really delete file tmp10.txt? or Open with KWrite, the 'tmp10.txt' and 'KWrite' parts certainly had to be added to the rest of the message at runtime. In such cases, the original text as seen by the translator will contain format directives, substrings which an application will replace with appropriate argument to construct the message as shown to the user. For example:

#: skycomponents/constellationlines.cpp:106
#, kde-format
msgid "No star named %1 found."
msgstr "Nema zvezde po imenu %1."

The format directive in this message is %1; the application will substitute it at runtime with the argument provided (probably) by the user as the name to search for. Format directives of the type %<number> are typical of KDE applications. A new type of comment has appeared as well, the flags comment. This comment begins with #,, followed by the comma-separated list of keywords, or flags, which clarify the state or the type of the message. In this example the flag is kde-format, confirming that any format directives in the message are of KDE type.

Format directives differ across source environments, but are usually easy to recognize. The message above, if found in a Gnome application, would look like:

#: skycomponents/constellationlines.cpp:106
#, c-format
msgid "No star named %s found."
msgstr "Nema zvezde po imenu %s."

The format directive changed to %s, and the format flag to c-format. This is the format used by most applications written in C, and many written in C++. (In C format, the %s directive is for substituting string arguments, and another frequent directive is %d for integers; but there are many more. There may also be some numbers and interpunction between the percent sign and the letter, e.g. %03d.)

For one more example, to illustrate the diversity of format directives, if the application would have been written in Python the message could look like:

#: skycomponents/constellationlines.cpp:106
#, python-format
msgid "No star named %(starname)s found."
msgstr "Nema zvezde po imenu %(starname)s."

Here the format directive is %(starname)s, which states the argument type as in C format (%s), but also its name in parenthesis. Hence the python-format flag. You must not change this name, as otherwise the application will not be able to find it and make the substitute--which would probably make the application crash when it tries to use the message.

You only need to make sure that each directive from the original string is found in the translation, and very rarely to modify the directives themselves. Format flags, such as kde-format, c-format, etc. are there not only as info for translators, but they are also used by tools for checking PO files. For example, if you forget or mistype a directive in the translation, such tools will report it. Dedicated PO editors may warn on the spot, or when saving the file. This provides you with a "safety net", so long as you remember to perform the checks after completing the translation (if the editor does not do it automatically).

One situation that may require modification of directives is when there are several of them, and they need to be ordered differently in the translation:

#: kxsldbgpart/libxsldbg/xsldbg.cpp:256
#, kde-format
msgid "%1 took %2 ms to complete."
msgstr "Trebalo je %2 ms da se %1 završi."

With KDE format directives, which are numbered, reordering is simple as above. Similarly for the mentioned Python format, where directives are named. But for formats where directives are neither numbered nor named by default, like in C format (where they only state argument type), you can sometimes modify directives to the desired effect:

#: gxsldbgpart/libxsldbg/xsldbg.c:256
#, c-format
msgid "%s took %d ms to complete."
msgstr "Trebalo je %2$d ms da se %1$s završi."

If the directives are numbered or named, and there is more than one same-number or same-name directive, usually any of the duplicates can be dropped in the translation. This may be useful in a longer text, e.g. when in the translation a pronoun can be used instead of repeating the argument:

#: hypothetical.cpp:100
#, kde-format
msgid "%1 is the blah, blah, blah. With %1 you can blah, blah."
msgstr "%1 je bla, bla, bla. Pomoću njega možete bla, bla."

where njega is a pronoun used instead of another %1. Conversely, it is possible to repeat the directive if it better fits where the English original has used a pronoun.

Sometimes the programmer may not use a directive to substitute an argument at runtime, but instead concatenate the full text out of separate messages:

#: hypothetical.cpp:100
msgid "No star named "
msgstr ""
⁠
#: hypothetical.cpp:100
msgid " found."
msgstr ""

Presumably, the application will fetch the first message above, append to it the name that was searched for, and then append the second message. This kind of programming is considered to be one of basic errors when striving for a translatable application, as it forces translators to "piece the puzzle", which may not even be possible in every language. This is thankfully rare today, but when it does happen, while you can try to work around, it is better that you contact the authors to have the source code fixed.

Text Markup

Applications sometimes show parts of the text in non-plain text: certain words may be italic or bold, titles in larger font size, lists with bullets, etc. This is frequent, for example, in "What's this" texts and message boxes. Even richer typographic elements of this kind are usually found in documentation and other static content, where the final output should be reading and printing friendly. On translator's end, such original text will contain markup, where words, phrases, and whole paragraphs may be wrapped with special tags.

The following messages show typical examples of markup in application user interface:

#: rc.cpp:1632 rc.cpp:3283
msgid "<b>Name:</b>"
msgstr ""
⁠
#: kgeography.cpp:375
#, kde-format
msgid "<qt>Current map:<br/><b>%1</b></qt>"
msgstr ""
⁠
#: rc.cpp:2537 rc.cpp:4188
msgid ""
"<b>Tip</b><br/>Some non-Meade telescopes support a subset of the LX200 "
"command set. Select <tt>LX200 Basic</tt> to control such devices."
msgstr ""

The markup in these messages is XML-like, where tags for visual formatting are specified as <tag>...</tag> wrappings around the visible text segments. For example <b>...</b> tells that the text inside should be shown in boldface, while <tt>...</tt> that a monospace font should be used, and lone <br/> introduces a line break (readers knowing some HTML will instantly recognize these tags).

Another frequent XML-like markup is used in documentation POs, which are in KDE (and Gnome, and many other environments) mostly written in the Docboox XML format:

#. Tag: title
#: blackbody.docbook:13
msgid "<title>Blackbody Radiation</title>"
msgstr ""
⁠
#. Tag: para
#: geocoords.docbook:28
msgid ""
"The Equator is obviously an important part of this coordinate system; "
"it represents the <emphasis>zeropoint</emphasis> of the latitude angle, "
"and the halfway point between the poles. The Equator is the "
"<firstterm>Fundamental Plane</firstterm> of the geographic coordinate "
"system. <link linkend='ai-skycoords'>All Spherical</link> Coordinate "
"Systems define such a Fundamental Plane."
msgstr ""

The Docbook tags are named somewhat differently to the HTML-like tags previously seen in application interfaces, stating the meaning of text that they wrap rather than the visual appearance (so called semantic markup). But it's all the same for you, except that knowing the meanings of text parts may be benefitial context-wise. Docbook tags will also sometimes provide one or few attributes following the opening tag, such as <link linkend=...> above (HTML tags may do that too).

When translating markup text, you should, in general, reproduce the same set of tags in the translation, assigning them to appropriate translated segments. Under no circumstances may the tags themselves be translated (e.g. <title> or <emphasis>), since they are processed by the machine to produce the final formatted text. As for tag attributes (linkend='ai-skycoords' in the example above), attribute names are also never translated, but in rare occasions their values in quotes may be (usually when a value is clearly a human-readable text).

However, this is not to say that you should never modify markup. Especially with HTML-like tags, not so rarely the markup in the original text gets to be sloppy (missing closing tags), and you are free to correct it in translation. Another example would be in CJK languages, where bold text is hard to read, so CJK translators tend to remove <b> tags in favor of quotes. In general, the more you are familiar with the particular markup, the more you can work past directly copying it from the original text.

In application interface POs, quite frequently there are parts in original text that may look somewhat like XML-like markup, for example:

#: utils/katecmds.cpp:180
#, kde-format
msgid "Missing argument. Usage: %1 <value>"
msgstr ""

The <value> here is not markup, but is shown verbatim to the user. It is a placeholder, an indicator to the user that a real argument should go in its place. Many languages tend to translate placeholders for this reason, and there is no technical issue with that. You should only exercise caution not to misjudge a tag for a placeholder (after little experience with the particular markup, the difference is usually obvious).

There are also non-XML like markups that tend to pop up for translation. One could be wiki markup, such as of this very article:

#: poformat.txt:191
msgid "=== Extracted Comments ==="
msgstr ""
⁠
#: poformat.txt:193
msgid ""
"One place where messages store explicit context provided by the "
"authors is within ''extracted comments'', those which (...)"
msgstr ""

where ===...=== is the approximate of HTML's <h2>...<h2>, while ''...'' is the counterpart of <i>...<i>. Another markup type is the source language for man pages, troff:

# type: Plain text
#: ../../doc/man/wesnoth.6:55
msgid ""
"compresses a savefile (B<infile>)  that is in text WML format into "
"binary WML format (B<outfile>)."
msgstr ""

where B<...> is the equivalent of HTML's <b>...<b>.

When you are faced with a new kind of markup, which you have never worked with before, you should definitely at least skim through a tutorial or two about it. For XML-like markups used in KDE, there is a standalone article covering them from the point of view of translators.

Escape Sequences

There are a few special characters which cannot appear verbatim in the msgid or msgstr fields. For one, consider the plain double quote ("): since it is used to delimit field strings, a raw double quote inside the text would terminate the string prematurely, and invalidate the message syntax. Such characters are therefore written as escape sequences, a combination of the backslash and another character, which is interpreted into an appropriate single character when showing the text to users. The plain double quote is written as \":

#: kstars_i18n.cpp:3591
msgid "The \"face\" on Mars"
msgstr "\"Lice\" na Marsu"

Another frequent escaped character is the newline, presented as \n:

#: kstarsinit.cpp:699
msgid ""
"The initial position is below the horizon.\n"
"Would you like to reset to the default position?"
msgstr ""
"Početni položaj je ispod horizonta.\n"
"Želite li da vratite na podrazumevani?"

Most PO tools unconditionally wrap the text at newlines, ignoring the designated wrap column, even when wrapping has been turned off. This is to increase readability when editing the PO file. If the text is not composed of markup (e.g. not HTML or Docbook), newlines are significant to the user too, so you should carry them over to the translation; for significance of newlines in markup text, see the article on markup. In general, unless you are confident that you can manipulate newlines in a certain way, you should follow the msgid lead.

Another two escape sequences, usually of much lower frequency than the double quote and the newline, are the tabulator \t and the backslash itself \\ (because single backslash always starts an escape sequence). While other sequences are possible, they are extremely rare.

Going back to double quotes, keep in mind that while English original usually uses plain ASCII quotes, translations tend to use "fancy" quotes according to the orthography of the language:

#: kstars_i18n.cpp:3591
msgid "The \"face\" on Mars"
msgstr "„Lice“ na Marsu"

This holds both for double and single quotes. So do check if your language defines any fancy quote pairs, and use them if it does.

Accelerators

In application interfaces, short texts on widgets used to perform an action or open a dialog, frequently have one letter in them underlined. This indicates that when the user presses the Alt key and that letter, the corresponding action will be activated. Such letters are called accelerators, and they are selected in the translation usually by preceding them with a special character for that purpose, the accelerator marker:

#: kstarsinit.cpp:163
msgid "Set Focus &Manually..."
msgstr "Zadaj fokus &ručno..."

In KDE the accelerator marker is the ampersand (&). Thus, the accelerator in the message above will be the letter 'M' in the English text, and the letter 'r' in the translation. Accelerator markers tend to differ across environments, e.g. Gnome uses the underscore (_), OpenOffice the tilde (~), etc.

How to choose accelerators in the translation (where to put the accelerator marker) may be tricky, as you can easily get into situations where in the same interface context (e.g. within one menu) two items end up having the same accelerator. This will not do anything too bad, e.g. the application may automatically reassign the conflicting accelerators, or the user may have to press the Alt+accelerator several times to go through all such items. Still, conflicting accelerators are not nice, but there is no way to positively avoid them; you can only try to track the message context in the PO file, and check the running applications. This is not only the problem of translation, as not so rarely the English original itself produces conflicting accelerators!

CJK languages use input methods different to alphabet-type ones (keyboard layouts), so instead of assigning an ideogram as the accelerator, they add a single English letter for that purpose:

#: kstarsinit.cpp:163
msgid "Set Focus &Manually..."
msgstr "フォーカスを手動でセット(&M)..."

This letter is usually picked to be the same as in the original, therefore reducing the possibility of accelerator conflicts to as much as the programmers were able to avoid conflicts themselves.

Accelerator does not have to be positioned at the start of a word, but can be put next to any letter or number. A reasonable order of choices would be: at the start of the most significant word in the message by default, then if it conflicts another message, at the start of another word, and if it still conflicts, inside one of the words.

Since accelerator marker is typically not such a rarely used character, it may appear in contexts in which it does not mark an accelerator. For example:

#: kspopupmenu.cpp:203
msgid "Center && Track"
msgstr ""
⁠
#. Tag: phrase
#: config.docbook:137
msgid "<phrase>Configure &kstars; Window</phrase>"
msgstr ""

In the first message above, the accelerator has been used to escape itself, to produce a verbatim ampersand in output (similar as with escape sequences where double-backslash was used to represent a verbatim backslash). In the second message, the ampersand is used to insert an XML entity &kstars;, of which you can read in more in the article on markup. That the character is not used as accelerator marker can only be determined from context, but after gaining little experience, the distinction will almost always be obvious to you.

Plural Forms

Applications frequently need to report to the user the number of objects in a given context: "10 files found", "Do you really want to delete 5 messages?" etc. Of, course, in English such messages should also have singular counterparts, like "1 file found", "...delete 1 message?". This means that two separate English texts are needed in the PO file, one covering the singular, and another the plural case. You could assume that these would then be two messages, like in this hypothetical example:

#: hypothetical.cpp:100
#, kde-format
msgid "Time: %1 second"
msgstr ""
⁠
#: hypothetical.cpp:101
#, kde-format
msgid "Time: %1 seconds"
msgstr ""

where the application fetches the first message when the number of objects is 1, and the second message for any other number.

However, while this works for some languages other than English (e.g. Spanish, German, French...), it does not work for all languages. The reason is that, while English needs one text for unity, and another text for any other number, many languages have it more complicated. For example, in some languages the singular form is used for all numbers ending with the digit 1, so application would be in error to fetch the singular form only for number exactly 1. Furthermore, in some languages more than two texts are needed, for example three: one for all numbers ending in 1, second for all numbers ending in 2, 3, 4, and third for all other numbers.

To handle this diversity, the PO format implements plural messages. The example above in reality looks like this:

#: mainwindow.cpp:127
#, kde-format
msgid "Time: %1 second"
msgid_plural "Time: %1 seconds"
msgstr[0] ""
msgstr[1] ""

The English singular form is given by the msgid field, and the plural form by the msgid_plural field. There are now several msgstr fields, with zero-based indices in square brackets, so that you can write as many translations as there are plural forms in your language. By default there will be two msgstr fields, but you may plainly insert the line with the third one (index 2), and so on. Then, the Spanish translation, which has same plural forms as English, looks like:

#: mainwindow.cpp:127
#, kde-format
msgid "Time: %1 second"
msgid_plural "Time: %1 seconds"
msgstr[0] "Tiempo: %1 segundo"
msgstr[1] "Tiempo: %1 segundos"

while the Polish translation, which needs three plural forms, is:

#: mainwindow.cpp:127
#, kde-format
msgid "Time: %1 second"
msgid_plural "Time: %1 seconds"
msgstr[0] "Czas: %1 sekunda"
msgstr[1] "Czas: %1 sekundy"
msgstr[2] "Czas: %1 sekund"

But, how should the application know which form corresponds to which numbers? The specification for this is written within the PO file itself, in the header (more on PO headers below); it consists of the number of plural forms which every plural message in the given PO file shall have, and a computable logical expression, which for any given number, computes the index of the plural form to be used. This expression is quite cryptic-looking, but you do not have to really understand how it works. Since it is constant for a given language, you can just copy it from any other previously translated PO file in your language, and by looking at plural messages in that other file, you will clearly see which form (by index of msgstr) is used in which situation. Bearing this in mind, just to complete the examples, here is the plural specification for Spanish:

nplurals=2; plural=n != 1;

and for the more complicated Polish plural:

nplurals=3; plural=(n==1 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2);

The nplurals field tells how many forms there are, and plural is the expression which computes the index of the msgstr field for the given number n (if the syntax is familiar to you, that's because you know some C).

Sometimes you will come upon a message, or pair of messages which are just like the first, hypothetical example above -- having a number in it, but not presented as plural message, when you clearly see it should be. In most environments today (e.g. in KDE or Gnome), this simply means that the programmer forgot to use the plural message. Since this is to be considered a bug, you should inform application authors to replace the ordinary with the plural message. In some environments, however, applications are not capable of handling plurals, mostly when PO format is used as intermediate (e.g. for OpenOffice). If that is the case, you can only try to translate the message in a "least bad" way.

At the time when KDE was introducing plural messages, PO format's native support for them was still very new. Thus, similar as with disambiguation contexts, in KDE 3 plural messages were embedded in the ordinary messages. Since you may still get to translate a few stray KDE3 PO files, here is how the previously shown Polish-translated message would look like in it:

#: mainwindow.cpp:127
msgid ""
"_n: Time: %n second\n"
"Time: %n seconds"
msgstr ""
"Czas: %n sekunda\n"
"Czas: %n sekundy\n"
"Czas: %n sekund"

The starting _n: in the msgid determines that the message is plural, and plural forms are separated by newlines, in both the original and the translation. Instead of an ordinary numbered placeholder, a special %n placeholder is used for the number.

Omitting The Number

Quite frequently English singular form will omit the number, that is, only the plural form will contain the format directive for the number:

#: modes/typesdialog.cpp:425
#, kde-format
msgid "Are you sure you want to delete this type?"
msgid_plural "Are you sure you want to delete these %1 types?"
msgstr[0] ""
msgstr[1] ""

It depends on the environment whether it is allowed to omit the number like this. For example, in KDE applications (kde-format flag) it is always possible, and so it is in Gnome (c-format), but not in pure Qt (qt-format). In the translation, if the environment supports omission, you can omit or retain the number in singular according to what is better language-wise, and regardless of whether or not it was omitted in the original. More precisely, you can omit the number in any form that is used for exactly one number. Conversely, if all forms are used for more than one number (e.g. the "singular" form is used for all numbers ending in digit 1), you cannot omit the number at all.

On rare occasions a plural message will have no number in either English singular or plural, when the programmer merely wanted to choose between the forms for "one" and "several". This is perfectly valid:

#: kgpg.cpp:498
msgid "Decryption of this file failed:"
msgid_plural "Decryption of these files failed:"
msgstr[0] ""
msgstr[1] ""

In such cases, in translation you should just use the same plural text for all forms but the one which is used for unity (if there is any such).

In old embedded plurals in KDE3 PO files, the %n placeholder can be omitted following the same rules.

Merging With Templates

At one point you will have translated the whole PO file, every message in it, and sent it back to the source where it is used. As time passes by, however, the original text at the source is going to change. Applications will get bug fixes and new features, which will require both new strings in the user interface, and modifications to some existing. Documentation will get new chapters, old chapters expanded, old paragraphs modified to better style. At some point you will want to update your old translation, so that the source is again fully translated into your language.

This is done in the following way. On the one side, there is your last translated version of the PO file. On the other side, there is the latest pristine PO, with non-translated messages corresponding to the current state of the source. Pristine PO files are actually called templates, and have the .pot extension, unlike the .po extension of translated POs. The translated PO file and the template are then merged in a special way, producing a new, partially translated PO for you to work on. The technicalities of merging are not so important at first, as in any established translation project you can just fetch the latest merged PO files; more is important is what you can expect to see in a merged PO file.

In general, merged PO files contain four categories of messages. First are those messages which were present in the PO file when you last worked on it, in the sense of having unchanged msgctxt and msgid fields since then. As expected, their translations (msgstr fields) are as you left them, so there is nothing new for you to do about these messages. The second category are entirely new messages, added in the source in the meantime, which you should now translate. New messages won't be added in an arbitrary way, for example simply appended to the end of the PO file. Instead they will be interspersed with translated messages, following the order of appearance of messages in the current source. This allows you to infer contexts by considering the preceding and following messages, same as you did when you were translating the PO from scratch. For example:

#: fitshistogram.cpp:347
msgid "Auto Scale"
msgstr ""
⁠
#: fitshistogram.cpp:350
msgid "Linear Scale"
msgstr "linearna skala"
⁠
#: fitshistogram.cpp:353
msgid "Logarithmic Scale"
msgstr "logaritamska skala"

The first message is a new one, untranslated, and the two other are old, translated earlier. From these two you can see that the new message is one among selection of scales (possibly for a diagram axis), and not e.g. a command or option to change the size of something, as in "scale automatically".

Fuzzy Messages

The most interesting, however, is the third category of messages in a merged PO file. These are the old messages which were somewhat modified in the meantime, i.e. one or both of their msgctxt and msgid fields have changed. Or, this can also be a new message, but very similar to one of the old ones. There is actually no way to tell between the two, it is only by similarity to one of the old messages that a modified or new message falls into this category. Either way, such a message is called fuzzy, and looks like this:

#: src/somwidget_impl.cpp:120
#, fuzzy
#| msgid "Elements with boiling point around this temperature:"
msgid "Elements with melting point around this temperature:"
msgstr "Elementi s tačkom ključanja u blizini ove temperature:"

The fuzzy flag states that the message is fuzzy. The comment starting with #| is called previous-field comment, as it contains the previous value of the msgid field, which corresponds to the translation as given by the msgstr. This translation is, however, not valid for the current (non-commented) msgid field. By comparing the previous and current msgid, you can see that the word "boiling" was replaced with "melting", and you can adjust the translation accordingly. Once you did that, to unfuzzy the message you should remove the fuzzy flag and previous field (#|) comments, so that the final updated message is:

#: src/somwidget_impl.cpp:120
msgid "Elements with melting point around this temperature:"
msgstr "Elementi s tačkom topljenja u blizini ove temperature:"

The previous-field comments are also a relatively newer addition to the PO format, so that in some translation environments you will not see them in merged POs. The fuzzy message would then be presented only with the fuzzy flag:

#: src/somwidget_impl.cpp:120
#, fuzzy
msgid "Elements with melting point around this temperature:"
msgstr "Elementi s tačkom ključanja u blizini ove temperature:"

It may seem that this is no great loss: so long as you are visually comparing texts, instead of comparing the previous (here missing) and current msgid, you might as well compare the current msgid and the old translation given in msgstr, and adjust translation based on that. However, there are two disadvantages to this. Less importantly, it may not always be easy to spot a difference by comparing the new original and the old translation. For example, only a typo or a missing dot may have been fixed in the original, leaving you to wonder if you are missing something. More importantly, a dedicated PO editor can use the previous and current msgid to highlight differences between them, which makes it that much easier for you to see them. Even if you are working with an ordinary text editor, there are command-line scripts which can embed differences into previous msgid, again making them more easy to spot. And the bigger the message, the more important to have automatic highlighting -- think of a long paragraph where only one word has been changed. For these reasons, if the merged PO files you work on do not have previous-field comments, do inquire with authors if they can enable them (they may simply not know about this possibility, as it is not the default behavior on merging).

Aside from msgid, the msgctxt field can also feature in the previous-field comment. Whether one or both of the msgctxt and msgid have been changed, both will be given in previous-field comments:

#: kstarsinit.cpp:451
#, fuzzy
#| msgctxt "Constellation Line"
#| msgid "Constell. Line"
msgctxt "Toggle Constellation Lines in the display"
msgid "Const. Lines"
msgstr "Linija sazvežđa"

But in particular, a message will be fuzzied if it previously had no msgctxt and got one after merging, or had one and lost it. In the first case, the previous-field comments will contain only the msgid, although it may be the same as the current one; by this you will know that the change was only the adding of context. In the second case, the previous-field comments will contain both the msgctxt and the msgid fields, while there will be no current msgctxt. Here are the two examples:

#: kstarsinit.cpp:444
#, fuzzy
#| msgid "Solar System"
msgctxt "Toggle Solar System objects in the display"
msgid "Solar System"
msgstr "Sunčev sistem"
⁠
#: finddialog.cpp:102
#, fuzzy
#| msgctxt "object name (optional)"
#| msgid "Andromeda Galaxy"
msgid "Andromeda Galaxy"
msgstr "Andromeda, galaksija"

It is important for a message to become fuzzy when only the disambiguating context is added or removed, because this has been done precisely to shed some light on the original text, which may require modification of the translation.

Treatment of Fuzzy Messages

Fuzzy messages are a special category only from translators' viewpoint. Consumers of PO files (applications, etc.) will treat them as ordinary untranslated messages, i.e. they will use the English original instead of the old translation. This is necessary, as there is no telling how inappropriate the old translation may be for the current original. The algorithm that produces fuzzy messages will sometimes turn out rather strange pairings, which to you or to the user may not look similar at all.

That a fuzzy message is treated as untranslated is important to keep in mind. Fresh translators will sometimes manually add the fuzzy flag to a message to mark they are not entirely sure that the translation is proper, not knowing that this will totally exclude the translation from being used. Thus, you should manually add the fuzzy flag only when you are so unsure of the meaning of the message, that you explicitly want to prevent the translation from being used. This is fairly rarely needed. Instead, when you just want to mark the message so that you or someone else can check it later, you should write your doubts in a translator comment.

Obsolete Messages

The last, fourth category are obsolete messages. These are the messages which are neither present in the source content any more, nor were judged by the merging algorithm as appropriate to base a fuzzy message on. All obsolete messages are grouped at the end of the merged PO file, and fully commented out by the #~ comment:

#~ msgid "Set the telescope longitude and latitude."
#~ msgstr "Postavi geo. dužinu i širinu teleskopa."

Obsolete messages have no extracted comments or source references, as they are no longer present in the source. Translator comments and flags will be retained, as they don't depend on the presence in the source.

It could be said that obsolete messages are in fact no messages at all, given that they don't exist from the point of consumers of the PO file, and there is nothing for translators to do with them. PO tools in general will ignore them, except to sometimes preserve them if modifying the PO file. Dedicated PO editors will invariably not show obsolete messages to the user, and may provide an option to automatically remove them from the file on saving.

What is then the purpose of obsolete messages? It frequently happens that a section of the source content, e.g. the code around a certain feature of an application, is temporarily removed. Authors sometimes want to improve a section separately, outside of the main content which is being translated, and sometimes a section is even briefly omitted by mistake when there are moves and renames in the source. When this happens, the affected messages will become obsolete in the merged PO; but, when the missing section is put back into the source, the merging algorithm will take obsolete messages into account, and promote them to real messages (either translated or fuzzy) where possible. Thus, some needless translation work may be saved.

What you should do with obsolete messages depends on the tools with which you work on PO files. For example, if you and other translators working on the given PO all use dedicated PO editors with internal storage of all previously encountered translations, the translation memory, there is less need for keeping obsolete messages around, as the editor will be able to fill new messages from the memory; but there are some difficulties, as the need for translators to share the same memory. In practice, many translators opt to keep obsolete messages around for some time, and periodically (e.g. months apart) remove them from PO files. By this they achieve that accidental removals of source content, which are quickly corrected, rarely bother them, while avoiding accretion of far too much obsolete material.

Starting a New PO file

In light of the translation maintenance through the merging process, you can think of starting to work on a never-before translated PO file as just the "initial merge": you will have to take the template and rename it to something with the .po extension, and work from there on. What you rename it to depends on the environment, but it is usually one of two things: either the same name as that of the template but with the .po extension (like in KDE), or your language code with the .po extension (like in Gnome). This basically depends on the organization of the particular translation project.

On the other hand, sometimes for each template in the project an empty PO for your language will have been created and put in a proper place in the source tree, so that you can just start translating it when you get to it.

At any rate, when you start working on a PO file from scratch, the first thing you should do is fill out its header.

PO Header

The very first message in each PO file is not a real message, but the header, which records many administrative and technical pieces of information about the PO file. Here is one pristine header, before any translation on the PO file has been done:

# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR This_file_is_part_of_KDE
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: http://bugs.kde.org\n"
"POT-Creation-Date: 2008-09-03 10:09+0200\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <[email protected]>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=INTEGER; plural=EXPRESSION;\n"

The header consists of introductory comments, followed by the empty msgid, and by the msgstr which contains header fields. The header comments, similar to those of normal messages, are not entirely free form, but have some structure to them. The msgstr is divided by newlines (\n) into fields of name: value form (name of the piece of information and the information itself). Although the header is pristine, some of the environment-dependent values are typically already supplied, e.g. wherever KDE is mentioned above. The fuzzy flag tells that the PO file has not been translated earlier. All-uppercase text segments are placeholders which you should replace with real values. The header updated to reflect the translation state could look like this:

# Translation of kstars.po into Spanish.
# This file is distributed under the same license as the kdeedu package.
# Pablo de Vicente <[email protected]>, 2005, 2006, 2007, 2008.
# Eloy Cuadra <[email protected]>, 2007, 2008.
msgid ""
msgstr ""
"Project-Id-Version: kstars\n"
"Report-Msgid-Bugs-To: http://bugs.kde.org\n"
"POT-Creation-Date: 2008-09-01 09:37+0200\n"
"PO-Revision-Date: 2008-07-22 18:13+0200\n"
"Last-Translator: Eloy Cuadra <[email protected]>\n"
"Language-Team: Spanish <[email protected]>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=n != 1;\n"

Even if this particular header has been slightly abridged for clarity, it probably still looks menacing, with a lot of data -- are you supposed to manually get all that correct? Not really. If you are using a dedicated PO editor, it will have a nice configuration dialog where you can enter data about yourself, your language, etc., and whenever you save a PO file, the editor will automatically fill out the header. If you are using a plain text editor, there are command line tools to similarly fill out the header automatically. But even with such aids, it merits to give a few general directions about header comments and fields.

The first comment usually has the title role, saying something about what is translated into which language. The second comment tells something about licensing. The following comments each list a translator who at one time worked on this particular PO file, his name, email address, and years of contribution. After that, any freeform comments may be added. The fuzzy flag has been removed, as the file has been worked on.

The Project-Id-Version header field states the name and possibly version of what is translated, Report-Msgid-Bugs-To gives address to write to when you discover problems in original text, POT-Creation-Date the time when the catalog template was created, PO-Revision-Date the time when the PO file was last edited by a translator, Last-Translator the name and address of last translator who worked on the file, and Language-Team the name and address of the translation team (if any) which the last translator is part of. The fields MIME-Version, Content-Type, and Content-Transfer-Encoding, are pretty much always and for any language as given above, so they are not interesting (though you could change encoding to something else than UTF-8, in this day and age really think thrice before you do that). The final field, Plural-Forms, is where you write the plural specification for your language (as explained in the section on plural forms).

Of the presented comments and fields, almost all of them are set when the PO file is translated for the first time. When you come back to a certain PO to update translation, if no one else worked on that PO in the meantime, you should only update the PO-Revision-Date field. If someone has worked on it, you will also have to put your data in Last-Translator field. If you get to work on a PO file for the first time after someone else has already worked on it, you should add yourself in the translator list in comments. (If you are using a dedicated PO editor, it will perform all these updates for you whenever you save the file.)

Note that everything in the header is supposed to be in English, readable by anyone, not just by your native language speakers. Aside from comments being in English, this also means that the name of the language and the language team should be in English, and your own name and names of other translators in their romanized equivalents. This is because, for example, people from other languages may need to contact you or your team about any technical problems in the translation (e.g. application maintainers). Keep this in mind also when you are setting up your data in a PO editor.

Aside from standard header fields, you may encounter some custom ones, whose names begin with X-. These fields are added by various PO processing tools. One typical custom field is X-Generator, where the dedicated PO editor which you use will write its name and version. Another custom field sometimes seen is X-Accelerator-Marker, which states the character used as the accelerator marker (recognized by some tools e.g. for searching through PO files, when otherwise the accelerator marker could "mask" a word by being in the middle of it). Aside from these more general custom fields, different translation environments may add various environment-specific ones.

Representation in Editors

When you translate PO files using a plain text editor, all the message elements will be displayed in it as we have seen in the examples so far; you can edit them at will, including invalidating the very syntax if you are not careful. Most capable text editors nowdays have syntax highlighting for the PO format, albeit with different levels of specificity. On the other hand, dedicated PO editors will provide you with much more automation, but each will have its own ways of presenting and means of editing different elements of a message.

This section will show how PO messages are represented in several widespread editors. Note this should not be understood as a review of PO editors in general, nor that any remarks are there to imply that one editor is better than the other. It merely serves to relate the elements of the PO format to what is seen in each editor.

Each editor is presented by a few remarks, and one or more annotated screenshots. Message elements on the screenshot are marked with a black circle and a number in it, corresponding to the following:

  • (1) msgid field (original text)
  • (2) msgstr field (translated text)
  • (3) msgctxt field (disambiguating context)
  • (4) extracted comments (context as comment)
  • (5) source references (source file/line of the message)
  • (6) flags (fuzzy, *-format, etc.)
  • (7) fuzzy state (although among flags, usually gets special attention)
  • (8) previous-fields (msgctxt and msgid)
  • (9) translator comments (those which you add manually)
  • (10) position context (preceding and following messages)

For any message element not seen in the screenshot, a red circle with the corresponding number will be given in the lower right corner.

The following contrived message is used as the exemplar for the screenshots:

# Do we have a better translation for 'froobaz'?
#. i18n: 'Froobaz' is short for 'froolimatic bazzier'.
#: contrivance.cpp:42
#, fuzzy, kde-format
#| msgctxt "control station: alpha"
#| msgid ""
#| "<p>Froobaz \"%1\" asks for attention.</p>\n"
#| "<p>Priority&nbsp;A message follows: <i>%2</i></p>"
msgctxt "control station: alpha"
msgid ""
"<p>Froobaz \"%1\" demands immediate attention.</p>\n"
"<p>Priority&nbsp;A message follows: <i>%2</i></p>"
msgstr ""
"<p>Frubaz „%1“ traži pažnju.</p>\n"
"<p>Poruka prioriteta&nbsp;A sledi: <i>%2</i></p>"

Aside from having all the numbered elements, this message sports various constructive substrings in the text, which allows you to see editor's highlighting capabilities within text fields as well. (We didn't choose a plural message to avoid clutter; plural messages are small part of all messages, and any dedicated PO editor will present them in a reasonable way, e.g. using tabs in the original and translation fields.)

Kate

PO message in Kate 3.1.0
PO message in Kate 3.1.0

KWrite and Kate are KDE's standard low-high team of text editors, which share the same text editing component. The syntax highlighting for the PO format shown on the screenshot was introduced in version 3.1.0 of Kate (released with KDE 4.1.0), while earlier versions had simpler highlighting. However, the new PO highlighting definition works equally well with versions from 2.4.0 onwards, so you can fetch it if you are using an older Kate.

The embedded differences seen in previous-fields, text segments wrapped in {+...+} and {-...-}, can be produced by piping the PO file through the diff-previous sieve of Pology.

Lokalize

PO message in Lokalize 0.2
PO message in Lokalize 0.2

Lokalize is the new dedicated PO editor (a general translation application in fact) for KDE 4, replacing KBabel in that role. The layout on the screenshot is only the default, you can rearrange display and editing widgets in any way you like.

You can observe how Lokalize uses previous-fields to automatically show differences between current and previous original (lower left pane, number 8). In the translation editing pane (right center, numbers 2 and 7), when a message is fuzzy it will give the text in italic, making it very easy for you to discern fuzzy from translated messages (though you can enable the more classical LEDs like in KBabel.)

Gtranslator

PO message in Gtranslator 1.1.8
PO message in Gtranslator 1.1.8

Gtranslator is a dedicated PO editor for Gnome. In the versions prior to the current 1.1.8 it was not able to open a PO file with msgctxt fields (since these are the newest addition to PO format), and in the current stable version it will open such files, but it will not display the content of msgctxt to the user (hence the red number 3 lower right). This is about to be implemented in the upcoming releases, as msgctxt is starting to get used in Gnome POs themselves.

Poedit

PO message in Poedit 1.4.1
PO message in Poedit 1.4.1

Poedit is a multiplatform dedicated PO editor. It suports translation memories and plural forms. It can open PO files containing msgctxt fields, but does not display them to the user as of version 1.4.1. Source references can be seen by right-clicking on the message in the list.

Poedit's primary visual feature is its compact layout. It can also work in full-screen mode.

Contacting Authors

In the preceding text, we have mentioned several situations when you may want to get in contact with the authors of the content which you are translating. You could report typos and other problems in the original text, request addition of context (especially disambiguating contexts), point out when a plural message is needed, warn of sentences split through several messages, etc.

Obviously, you should contact the authors when you need something changed in the source (from which the PO template is produced and merged with your translated PO), to be able to translate the message properly. However, sometimes even if you can translate a given message just fine, there is still reason to request some modifications. For example, if you have understood the meaning of a difficult message only after you had looked into the code, you may want to tell the authors to add context into the PO file even if you yourself don't need it any more. Or, if there was a bad case of a split sentence which you were able to outmaneuver, to nevertheless make it proper. The rationale for this is simple: if translators from different languages all help to improve messages at the source, they are efficiently helping each other out. While you handle improvement of one message, other translators will have done the same for messages which will cross your path at a later point.

It depends on the translation environment which channels of communication are used for localization issues. For small projects you may simply contact the author directly by email; the contact address may even be given in the Report-Msgid-Bugs-To header field of the PO template. Larger projects will have a mailing list dedicated to localization and a bug tracker. In KDE, for example, you can either write directly to the mailing list, or file a bug report against the application which you are translating; in Gnome, filing a bug report seems to be the preferred way. In general, the less sure you are how the message should be improved, and how it may effect other languages, the more reason to write to the mailing list where the issue can be discussed with translators from other languages.

Once you get the correction through, bear in mind that it may not appear immediately in the source (i.e. within the week or month) and your merged PO. This is due to the so-called message freezes, the periods of time prior to the release of the source content (e.g. an application) when only changes of utmost urgency are accepted. Remember that modifying a message will make it fuzzy, which means untranslated for the consumer of the PO file. If a message would be changed e.g. two days prior to the release, it would leave a day or less for dozens of language teams to update it. So, while the first next release may not contain the correction, one of those that follow will.