Localization/Concepts/PO Odyssey (pt BR): Difference between revisions

From KDE TechBase
m (Text replace - "<code cpp>" to "<syntaxhighlight lang="cpp">")
No edit summary
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
{{Template:I18n/Language Navigation Bar|Localization/Concepts/PO_Odyssey}}
 


{{LocalizationBrowser|
{{LocalizationBrowser|
Line 57: Line 57:
Without further ado, here is an excerpt from the middle of a PO file, showing three most basic messages, untranslated:
Without further ado, here is an excerpt from the middle of a PO file, showing three most basic messages, untranslated:


<code po>
<syntaxhighlight lang="text">
#: finddialog.cpp:38
#: finddialog.cpp:38
msgid "Globular Clusters"
msgid "Globular Clusters"
Line 73: Line 73:
Each message contains the keyword <tt>msgid</tt>, which is followed by the text in English, wrapped in double quotes. The keyword <tt>msgstr</tt> marks the string which is supposed to be the translation of the English one, also double-quoted. Thus, after you have went through the PO file and added translations, these messages would read:
Each message contains the keyword <tt>msgid</tt>, which is followed by the text in English, wrapped in double quotes. The keyword <tt>msgstr</tt> marks the string which is supposed to be the translation of the English one, also double-quoted. Thus, after you have went through the PO file and added translations, these messages would read:


<code po>
<syntaxhighlight lang="text">
#: finddialog.cpp:38
#: finddialog.cpp:38
msgid "Globular Clusters"
msgid "Globular Clusters"
Line 97: Line 97:
Each message above also contains the ''source reference comment'', which is the line starting with <tt>#:</tt>. It tells from which source code file of the application (or source document of any kind), and the exact line in it, the message has been extracted into the PO file. This piece of data may look strange at first -- of what use is it to translators, to merit inclusion in the PO file? Since PO format has been developed for localizing free software, the source reference enables you to actually look up the message in the source file, when you need ''more context'' to translate a certain message. This does not require that you be a programmer too, as source code is sometimes readable enough to be able to reason about message context without real understanding of the code. For example, in some languages the text in title position is usually written in noun form, and it may not be apparent from the PO file alone if the message:
Each message above also contains the ''source reference comment'', which is the line starting with <tt>#:</tt>. It tells from which source code file of the application (or source document of any kind), and the exact line in it, the message has been extracted into the PO file. This piece of data may look strange at first -- of what use is it to translators, to merit inclusion in the PO file? Since PO format has been developed for localizing free software, the source reference enables you to actually look up the message in the source file, when you need ''more context'' to translate a certain message. This does not require that you be a programmer too, as source code is sometimes readable enough to be able to reason about message context without real understanding of the code. For example, in some languages the text in title position is usually written in noun form, and it may not be apparent from the PO file alone if the message:


<code po>
<syntaxhighlight lang="text">
#: addcatdialog.cpp:45
#: addcatdialog.cpp:45
msgid "Import Catalog"
msgid "Import Catalog"
Line 115: Line 115:
When a message is long or contains some logical line-breaks, its original and translation strings may be wrapped in the PO file (usually with boundary at column 80), such as this:
When a message is long or contains some logical line-breaks, its original and translation strings may be wrapped in the PO file (usually with boundary at column 80), such as this:


<code po>
<syntaxhighlight lang="text">
#: indimenu.cpp:96
#: indimenu.cpp:96
msgid ""
msgid ""
Line 125: Line 125:
This wrapping is entirely irrelevant in the environment where the message is used, be it in application user interface, documentation, or elsewhere. PO processing tools produce wrapping mostly as a convenience to translators who would edit PO files with plain text editors. This means that you are free to wrap the translation (<tt>msgstr</tt> string) in the same way, differently, or not to wrap it at all--the result will be the same. You should only not forget to enclose each next wrapped line in double quotes, same as it is with <tt>msgid</tt>. For example, this translation of the previous message:
This wrapping is entirely irrelevant in the environment where the message is used, be it in application user interface, documentation, or elsewhere. PO processing tools produce wrapping mostly as a convenience to translators who would edit PO files with plain text editors. This means that you are free to wrap the translation (<tt>msgstr</tt> string) in the same way, differently, or not to wrap it at all--the result will be the same. You should only not forget to enclose each next wrapped line in double quotes, same as it is with <tt>msgid</tt>. For example, this translation of the previous message:


<code po>
<syntaxhighlight lang="text">
#: indimenu.cpp:96
#: indimenu.cpp:96
msgid ""
msgid ""
Line 137: Line 137:
would be completely equivalent to this one:
would be completely equivalent to this one:


<code po>
<syntaxhighlight lang="text">
#: indimenu.cpp:96
#: indimenu.cpp:96
msgid ""
msgid ""
Line 155: Line 155:
The second consequence is that there cannot be two messages with the same <tt>msgid</tt> in the same PO file (again, not exactly true, see later). If the same text has been used two or more times in the source, then in the PO file it will appear as a single message, with its source reference comment (<tt>#:</tt>) listing all appearances. For example, the source reference of this message:
The second consequence is that there cannot be two messages with the same <tt>msgid</tt> in the same PO file (again, not exactly true, see later). If the same text has been used two or more times in the source, then in the PO file it will appear as a single message, with its source reference comment (<tt>#:</tt>) listing all appearances. For example, the source reference of this message:


<code po>
<syntaxhighlight lang="text">
#: colorscheme.cpp:79 skycomponents/equator.cpp:31
#: colorscheme.cpp:79 skycomponents/equator.cpp:31
msgid "Equator"
msgid "Equator"
Line 173: Line 173:
One simple way to keep track of the context is to, when translating a given message, keep in sight several messages before and after it. As a trivial example, the following four messages:
One simple way to keep track of the context is to, when translating a given message, keep in sight several messages before and after it. As a trivial example, the following four messages:


<code po>
<syntaxhighlight lang="text">
#: locationdialog.cpp:228
#: locationdialog.cpp:228
msgid "Really override original data for this city?"
msgid "Really override original data for this city?"
Line 199: Line 199:
One place where messages store explicit context provided by the authors is within ''extracted comments'', those which start with <tt>#.</tt>. For example, the message:
One place where messages store explicit context provided by the authors is within ''extracted comments'', those which start with <tt>#.</tt>. For example, the message:


<code po>
<syntaxhighlight lang="text">
#. i18n: A classical test phrase, with all letters of the English alphabet.
#. i18n: A classical test phrase, with all letters of the English alphabet.
#. Replace it with a sample text in your language, such that it is
#. Replace it with a sample text in your language, such that it is
Line 214: Line 214:
Extracted comments can sometimes be provided not by a human author, but by a tool used to create or process PO files. For example, when markup-text documents are translated, such as HTML, or Docbook for documentation, the extracted comment frequently states the tag which wraps the text in the original document:
Extracted comments can sometimes be provided not by a human author, but by a tool used to create or process PO files. For example, when markup-text documents are translated, such as HTML, or Docbook for documentation, the extracted comment frequently states the tag which wraps the text in the original document:


<code po>
<syntaxhighlight lang="text">
#. Tag: title
#. Tag: title
#: skycoords.docbook:73
#: skycoords.docbook:73
Line 225: Line 225:
Another example where processing tools may provide extracted comments is when the PO file is created in a slightly roundabout way, such that source references in some messages do not really point to the source file, but to a temporary file which existed only during the creation of the PO file. To patch up a bit, the extracted comment may then state the true source:
Another example where processing tools may provide extracted comments is when the PO file is created in a slightly roundabout way, such that source references in some messages do not really point to the source file, but to a temporary file which existed only during the creation of the PO file. To patch up a bit, the extracted comment may then state the true source:


<code po>
<syntaxhighlight lang="text">
#. i18n: file: tools/observinglist.ui:263
#. i18n: file: tools/observinglist.ui:263
#. i18n: ectx: property (toolTip), widget (KPushButton, ScopeButton)
#. i18n: ectx: property (toolTip), widget (KPushButton, ScopeButton)
Line 239: Line 239:
Consider the following two messages from an application user interface:
Consider the following two messages from an application user interface:


<code po>
<syntaxhighlight lang="text">
#. i18n: First letter in 'Scope'
#. i18n: First letter in 'Scope'
#: tools/observinglist.cpp:700
#: tools/observinglist.cpp:700
Line 253: Line 253:
At first sight, you could say that it was nice of the programmer to add explicit context (<tt>#. i18n: ...</tt> lines), informing that the 'S' of the first message is short for 'Scope', and the 'S' of the second message short for 'South', so that translators know that they should use the letters corresponding to these words in their languages. But, can you spot the problem? The problem is that these messages cannot be part of a valid PO file, since, as said earlier, all messages have unique <tt>msgid</tt> strings. Instead, in a real PO file, these two messages would be collapsed into one:
At first sight, you could say that it was nice of the programmer to add explicit context (<tt>#. i18n: ...</tt> lines), informing that the 'S' of the first message is short for 'Scope', and the 'S' of the second message short for 'South', so that translators know that they should use the letters corresponding to these words in their languages. But, can you spot the problem? The problem is that these messages cannot be part of a valid PO file, since, as said earlier, all messages have unique <tt>msgid</tt> strings. Instead, in a real PO file, these two messages would be collapsed into one:


<code po>
<syntaxhighlight lang="text">
#. i18n: First letter in 'Scope'
#. i18n: First letter in 'Scope'
#. i18n: South
#. i18n: South
Line 265: Line 265:
In these situations, the programmer can give messages a different type of context, called ''disambiguating context''. These contexts are no longer presented as extracted comments, but through a full-fledged keyword string, the <tt>msgctxt</tt>:
In these situations, the programmer can give messages a different type of context, called ''disambiguating context''. These contexts are no longer presented as extracted comments, but through a full-fledged keyword string, the <tt>msgctxt</tt>:


<code po>
<syntaxhighlight lang="text">
#: tools/observinglist.cpp:700
#: tools/observinglist.cpp:700
msgctxt "First letter in 'Scope'"
msgctxt "First letter in 'Scope'"
Line 281: Line 281:
A rather frequent example when disambiguating contexts are needed, is when the original text is a single English adjective, and used at several places in the source:
A rather frequent example when disambiguating contexts are needed, is when the original text is a single English adjective, and used at several places in the source:


<code po>
<syntaxhighlight lang="text">
#: utils/kateautoindent.cpp:78 utils/katestyletreewidget.cpp:132
#: utils/kateautoindent.cpp:78 utils/katestyletreewidget.cpp:132
msgid "Normal"
msgid "Normal"
Line 289: Line 289:
Many languages need to match an adjective form to the noun to which it refers by gender, so if the 'Normal' above refers both to indentation mode and text style, it is almost certainly necessary to provide disambiguating contexts:
Many languages need to match an adjective form to the noun to which it refers by gender, so if the 'Normal' above refers both to indentation mode and text style, it is almost certainly necessary to provide disambiguating contexts:


<code po>
<syntaxhighlight lang="text">
#: utils/katestyletreewidget.cpp:132
#: utils/katestyletreewidget.cpp:132
msgctxt "Text style"
msgctxt "Text style"
Line 305: Line 305:
As of the moment of this writing, the <tt>msgctxt</tt> keyword is a relatively fresh addition to the PO format. But the need for disambiguating contexts was observed much earlier, and different translation environments have historically used different custom solutions to provide them. Such older PO files are still to be found around in good numbers, so it makes sense to present few examples of the custom contexts. Since before the <tt>msgctxt</tt> keyword was introduced, messages indeed had to be unique by <tt>msgid</tt> only, context had to become part of the <tt>msgid</tt> itself, embedded in it with some special syntax. If we take the first message from the previous example, here is how it would look like in a KDE3 PO file:
As of the moment of this writing, the <tt>msgctxt</tt> keyword is a relatively fresh addition to the PO format. But the need for disambiguating contexts was observed much earlier, and different translation environments have historically used different custom solutions to provide them. Such older PO files are still to be found around in good numbers, so it makes sense to present few examples of the custom contexts. Since before the <tt>msgctxt</tt> keyword was introduced, messages indeed had to be unique by <tt>msgid</tt> only, context had to become part of the <tt>msgid</tt> itself, embedded in it with some special syntax. If we take the first message from the previous example, here is how it would look like in a KDE3 PO file:


<code po>
<syntaxhighlight lang="text">
#: utils/katestyletreewidget.cpp:132
#: utils/katestyletreewidget.cpp:132
msgid ""
msgid ""
Line 315: Line 315:
The disambiguating context has been embedded at the beginning of the <tt>msgid</tt>, wrapped in <tt>_⁠: ...\n</tt> (the <tt>msgid</tt> string itself is shown broken into two lines, as PO tools wrap strings at <tt>\n</tt> regardless of their length; more on this special character sequence later). In Gnome, the same message would look something like this:
The disambiguating context has been embedded at the beginning of the <tt>msgid</tt>, wrapped in <tt>_⁠: ...\n</tt> (the <tt>msgid</tt> string itself is shown broken into two lines, as PO tools wrap strings at <tt>\n</tt> regardless of their length; more on this special character sequence later). In Gnome, the same message would look something like this:


<code po>
<syntaxhighlight lang="text">
#: utils/gatestyletreewidget.c:132
#: utils/gatestyletreewidget.c:132
msgid "Text style|Normal"
msgid "Text style|Normal"
Line 331: Line 331:
For these situations, you can write down your own reminders, doubts, inferred contexts, etc. in another type of comment, the ''translator comment''. These comments start simply with <tt># </tt> (hash and space), followed by any text whatsoever, and as with other comments, there may be any number of them. A hypothetical example:
For these situations, you can write down your own reminders, doubts, inferred contexts, etc. in another type of comment, the ''translator comment''. These comments start simply with <tt># </tt> (hash and space), followed by any text whatsoever, and as with other comments, there may be any number of them. A hypothetical example:


<code po>
<syntaxhighlight lang="text">
# Wikipedia says that ‘etrurski’ is our name for this script.
# Wikipedia says that ‘etrurski’ is our name for this script.
#: viewpart/UnicodeBlocks.h:151
#: viewpart/UnicodeBlocks.h:151
Line 352: Line 352:
When a file manager shows a message like ''Really delete file tmp10.txt?'' or ''Open with KWrite'', the 'tmp10.txt' and 'KWrite' parts certainly had to be added to the rest of the message at runtime. In such cases, the original text as seen by the translator will contain ''format directives'', substrings which an application will replace with appropriate argument to construct the message as shown to the user. For example:
When a file manager shows a message like ''Really delete file tmp10.txt?'' or ''Open with KWrite'', the 'tmp10.txt' and 'KWrite' parts certainly had to be added to the rest of the message at runtime. In such cases, the original text as seen by the translator will contain ''format directives'', substrings which an application will replace with appropriate argument to construct the message as shown to the user. For example:


<code po>
<syntaxhighlight lang="text">
#: skycomponents/constellationlines.cpp:106
#: skycomponents/constellationlines.cpp:106
#, kde-format
#, kde-format
Line 363: Line 363:
Format directives differ across source environments, but are usually easy to recognize. The message above, if found in a Gnome application, would look like:
Format directives differ across source environments, but are usually easy to recognize. The message above, if found in a Gnome application, would look like:


<code po>
<syntaxhighlight lang="text">
#: skycomponents/constellationlines.cpp:106
#: skycomponents/constellationlines.cpp:106
#, c-format
#, c-format
Line 374: Line 374:
For one more example, to illustrate the diversity of format directives, if the application would have been written in Python the message could look like:
For one more example, to illustrate the diversity of format directives, if the application would have been written in Python the message could look like:


<code po>
<syntaxhighlight lang="text">
#: skycomponents/constellationlines.cpp:106
#: skycomponents/constellationlines.cpp:106
#, python-format
#, python-format
Line 387: Line 387:
One situation that may require modification of directives is when there are several of them, and they need to be ordered differently in the translation:
One situation that may require modification of directives is when there are several of them, and they need to be ordered differently in the translation:


<code po>
<syntaxhighlight lang="text">
#: kxsldbgpart/libxsldbg/xsldbg.cpp:256
#: kxsldbgpart/libxsldbg/xsldbg.cpp:256
#, kde-format
#, kde-format
Line 396: Line 396:
With KDE format directives, which are numbered, reordering is simple as above. Similarly for the mentioned Python format, where directives are named. But for formats where directives are neither numbered nor named by default, like in C format (where they only state argument type), you can sometimes modify directives to the desired effect:
With KDE format directives, which are numbered, reordering is simple as above. Similarly for the mentioned Python format, where directives are named. But for formats where directives are neither numbered nor named by default, like in C format (where they only state argument type), you can sometimes modify directives to the desired effect:


<code po>
<syntaxhighlight lang="text">
#: gxsldbgpart/libxsldbg/xsldbg.c:256
#: gxsldbgpart/libxsldbg/xsldbg.c:256
#, c-format
#, c-format
Line 405: Line 405:
If the directives are numbered or named, and there is more than one same-number or same-name directive, usually any of the duplicates can be dropped in the translation. This may be useful in a longer text, e.g. when in the translation a pronoun can be used instead of repeating the argument:
If the directives are numbered or named, and there is more than one same-number or same-name directive, usually any of the duplicates can be dropped in the translation. This may be useful in a longer text, e.g. when in the translation a pronoun can be used instead of repeating the argument:


<code po>
<syntaxhighlight lang="text">
#: hypothetical.cpp:100
#: hypothetical.cpp:100
#, kde-format
#, kde-format
Line 416: Line 416:
Sometimes the programmer may not use a directive to substitute an argument at runtime, but instead concatenate the full text out of separate messages:
Sometimes the programmer may not use a directive to substitute an argument at runtime, but instead concatenate the full text out of separate messages:


<code po>
<syntaxhighlight lang="text">
#: hypothetical.cpp:100
#: hypothetical.cpp:100
msgid "No star named "
msgid "No star named "
Line 434: Line 434:
The following messages show typical examples of markup in application user interface:
The following messages show typical examples of markup in application user interface:


<code po>
<syntaxhighlight lang="text">
#: rc.cpp:1632 rc.cpp:3283
#: rc.cpp:1632 rc.cpp:3283
msgid "<b>Name:</b>"
msgid "<b>Name:</b>"
Line 455: Line 455:
Another frequent XML-like markup is used in documentation POs, which are in KDE (and Gnome, and many other environments) mostly written in the Docboox XML format:
Another frequent XML-like markup is used in documentation POs, which are in KDE (and Gnome, and many other environments) mostly written in the Docboox XML format:


<code po>
<syntaxhighlight lang="text">
#. Tag: title
#. Tag: title
#: blackbody.docbook:13
#: blackbody.docbook:13
Line 481: Line 481:
In application interface POs, quite frequently there are parts in original text that may look somewhat like XML-like markup, for example:
In application interface POs, quite frequently there are parts in original text that may look somewhat like XML-like markup, for example:


<code po>
<syntaxhighlight lang="text">
#: utils/katecmds.cpp:180
#: utils/katecmds.cpp:180
#, kde-format
#, kde-format
Line 491: Line 491:
There are also non-XML like markups that tend to pop up for translation. One could be wiki markup, such as of this very article:
There are also non-XML like markups that tend to pop up for translation. One could be wiki markup, such as of this very article:


<code po>
<syntaxhighlight lang="text">
#: poformat.txt:191
#: poformat.txt:191
msgid "=== Extracted Comments ==="
msgid "=== Extracted Comments ==="
Line 505: Line 505:
where <tt>===...===</tt> is the approximate of HTML's <tt>&lt;h2&gt;...&lt;h2&gt;</tt>, while <tt><nowiki>''...''</nowiki></tt> is the counterpart of <tt>&lt;i&gt;...&lt;i&gt;</tt>. Another markup type is the source language for man pages, ''troff'':
where <tt>===...===</tt> is the approximate of HTML's <tt>&lt;h2&gt;...&lt;h2&gt;</tt>, while <tt><nowiki>''...''</nowiki></tt> is the counterpart of <tt>&lt;i&gt;...&lt;i&gt;</tt>. Another markup type is the source language for man pages, ''troff'':


<code po>
<syntaxhighlight lang="text">
# type: Plain text
# type: Plain text
#: ../../doc/man/wesnoth.6:55
#: ../../doc/man/wesnoth.6:55
Line 522: Line 522:
There are a few special characters which cannot appear verbatim in the <tt>msgid</tt> or <tt>msgstr</tt> fields. For one, consider the plain double quote (<tt>"</tt>): since it is used to delimit field strings, a raw double quote inside the text would terminate the string prematurely, and invalidate the message syntax. Such characters are therefore written as ''escape sequences'', a combination of the backslash and another character, which is interpreted into an appropriate single character when showing the text to users. The plain double quote is written as <tt>\"</tt>:
There are a few special characters which cannot appear verbatim in the <tt>msgid</tt> or <tt>msgstr</tt> fields. For one, consider the plain double quote (<tt>"</tt>): since it is used to delimit field strings, a raw double quote inside the text would terminate the string prematurely, and invalidate the message syntax. Such characters are therefore written as ''escape sequences'', a combination of the backslash and another character, which is interpreted into an appropriate single character when showing the text to users. The plain double quote is written as <tt>\"</tt>:


<code po>
<syntaxhighlight lang="text">
#: kstars_i18n.cpp:3591
#: kstars_i18n.cpp:3591
msgid "The \"face\" on Mars"
msgid "The \"face\" on Mars"
Line 530: Line 530:
Another frequent escaped character is the newline, presented as <tt>\n</tt>:
Another frequent escaped character is the newline, presented as <tt>\n</tt>:


<code po>
<syntaxhighlight lang="text">
#: kstarsinit.cpp:699
#: kstarsinit.cpp:699
msgid ""
msgid ""
Line 546: Line 546:
Going back to double quotes, keep in mind that while English original usually uses plain ASCII quotes, translations tend to use "fancy" quotes according to the orthography of the language:
Going back to double quotes, keep in mind that while English original usually uses plain ASCII quotes, translations tend to use "fancy" quotes according to the orthography of the language:


<code po>
<syntaxhighlight lang="text">
#: kstars_i18n.cpp:3591
#: kstars_i18n.cpp:3591
msgid "The \"face\" on Mars"
msgid "The \"face\" on Mars"
Line 558: Line 558:
In application interfaces, short texts on widgets used to perform an action or open a dialog, frequently have one letter in them underlined. This indicates that when the user presses the Alt key and that letter, the corresponding action will be activated. Such letters are called ''accelerators'', and they are selected in the translation usually by preceding them with a special character for that purpose, the ''accelerator marker'':
In application interfaces, short texts on widgets used to perform an action or open a dialog, frequently have one letter in them underlined. This indicates that when the user presses the Alt key and that letter, the corresponding action will be activated. Such letters are called ''accelerators'', and they are selected in the translation usually by preceding them with a special character for that purpose, the ''accelerator marker'':


<code po>
<syntaxhighlight lang="text">
#: kstarsinit.cpp:163
#: kstarsinit.cpp:163
msgid "Set Focus &Manually..."
msgid "Set Focus &Manually..."
Line 570: Line 570:
CJK languages use input methods different to alphabet-type ones (keyboard layouts), so instead of assigning an ideogram as the accelerator, they add a single English letter for that purpose:
CJK languages use input methods different to alphabet-type ones (keyboard layouts), so instead of assigning an ideogram as the accelerator, they add a single English letter for that purpose:


<code po>
<syntaxhighlight lang="text">
#: kstarsinit.cpp:163
#: kstarsinit.cpp:163
msgid "Set Focus &Manually..."
msgid "Set Focus &Manually..."
Line 582: Line 582:
Since accelerator marker is typically not such a rarely used character, it may appear in contexts in which it does not mark an accelerator. For example:
Since accelerator marker is typically not such a rarely used character, it may appear in contexts in which it does not mark an accelerator. For example:


<code po>
<syntaxhighlight lang="text">
#: kspopupmenu.cpp:203
#: kspopupmenu.cpp:203
msgid "Center && Track"
msgid "Center && Track"
Line 599: Line 599:
Applications frequently need to report to the user the number of objects in a given context: "10 files found", "Do you really want to delete 5 messages?" etc. Of, course, in English such messages should also have singular counterparts, like "1 file found", "...delete 1 message?". This means that two separate English texts are needed in the PO file, one covering the singular, and another the plural case. You could assume that these would then be two messages, like in this hypothetical example:
Applications frequently need to report to the user the number of objects in a given context: "10 files found", "Do you really want to delete 5 messages?" etc. Of, course, in English such messages should also have singular counterparts, like "1 file found", "...delete 1 message?". This means that two separate English texts are needed in the PO file, one covering the singular, and another the plural case. You could assume that these would then be two messages, like in this hypothetical example:


<code po>
<syntaxhighlight lang="text">
#: hypothetical.cpp:100
#: hypothetical.cpp:100
#, kde-format
#, kde-format
Line 617: Line 617:
To handle this diversity, the PO format implements ''plural messages''. The example above in reality looks like this:
To handle this diversity, the PO format implements ''plural messages''. The example above in reality looks like this:


<code po>
<syntaxhighlight lang="text">
#: mainwindow.cpp:127
#: mainwindow.cpp:127
#, kde-format
#, kde-format
Line 628: Line 628:
The English singular form is given by the <tt>msgid</tt> field, and the plural form by the <tt>msgid_plural</tt> field. There are now several <tt>msgstr</tt> fields, with zero-based indices in square brackets, so that you can write as many translations as there are plural forms in your language. By default there will be two <tt>msgstr</tt> fields, but you may plainly insert the line with the third one (index 2), and so on. Then, the Spanish translation, which has same plural forms as English, looks like:
The English singular form is given by the <tt>msgid</tt> field, and the plural form by the <tt>msgid_plural</tt> field. There are now several <tt>msgstr</tt> fields, with zero-based indices in square brackets, so that you can write as many translations as there are plural forms in your language. By default there will be two <tt>msgstr</tt> fields, but you may plainly insert the line with the third one (index 2), and so on. Then, the Spanish translation, which has same plural forms as English, looks like:


<code po>
<syntaxhighlight lang="text">
#: mainwindow.cpp:127
#: mainwindow.cpp:127
#, kde-format
#, kde-format
Line 639: Line 639:
while the Polish translation, which needs three plural forms, is:
while the Polish translation, which needs three plural forms, is:


<code po>
<syntaxhighlight lang="text">
#: mainwindow.cpp:127
#: mainwindow.cpp:127
#, kde-format
#, kde-format
Line 667: Line 667:
At the time when KDE was introducing plural messages, PO format's native support for them was still very new. Thus, similar as with disambiguation contexts, in KDE 3 plural messages were embedded in the ordinary messages. Since you may still get to translate a few stray KDE3 PO files, here is how the previously shown Polish-translated message would look like in it:
At the time when KDE was introducing plural messages, PO format's native support for them was still very new. Thus, similar as with disambiguation contexts, in KDE 3 plural messages were embedded in the ordinary messages. Since you may still get to translate a few stray KDE3 PO files, here is how the previously shown Polish-translated message would look like in it:


<code po>
<syntaxhighlight lang="text">
#: mainwindow.cpp:127
#: mainwindow.cpp:127
msgid ""
msgid ""
Line 684: Line 684:
Quite frequently English singular form will omit the number, that is, only the plural form will contain the format directive for the number:
Quite frequently English singular form will omit the number, that is, only the plural form will contain the format directive for the number:


<code po>
<syntaxhighlight lang="text">
#: modes/typesdialog.cpp:425
#: modes/typesdialog.cpp:425
#, kde-format
#, kde-format
Line 697: Line 697:
On rare occasions a plural message will have no number in either English singular or plural, when the programmer merely wanted to choose between the forms for "one" and "several". This is perfectly valid:
On rare occasions a plural message will have no number in either English singular or plural, when the programmer merely wanted to choose between the forms for "one" and "several". This is perfectly valid:


<code po>
<syntaxhighlight lang="text">
#: kgpg.cpp:498
#: kgpg.cpp:498
msgid "Decryption of this file failed:"
msgid "Decryption of this file failed:"
Line 717: Line 717:
In general, merged PO files contain three categories of messages. First are those messages which were present in the PO file when you last worked on it, in the sense of having unchanged <tt>msgctxt</tt> and <tt>msgid</tt> fields since then. As expected, their translations (<tt>msgstr</tt> fields) are as you left them, so there is nothing new for you to do about these messages. The second category are entirely new messages, added in the source in the meantime, which you should now translate. New messages won't be added in an arbitrary way, for example simply appended to the end of the PO file. Instead they will be interspersed with translated messages, following the order of appearance of messages in the current source. This allows you to infer contexts by considering the preceding and following messages, same as you did when you were translating the PO from scratch. For example:
In general, merged PO files contain three categories of messages. First are those messages which were present in the PO file when you last worked on it, in the sense of having unchanged <tt>msgctxt</tt> and <tt>msgid</tt> fields since then. As expected, their translations (<tt>msgstr</tt> fields) are as you left them, so there is nothing new for you to do about these messages. The second category are entirely new messages, added in the source in the meantime, which you should now translate. New messages won't be added in an arbitrary way, for example simply appended to the end of the PO file. Instead they will be interspersed with translated messages, following the order of appearance of messages in the current source. This allows you to infer contexts by considering the preceding and following messages, same as you did when you were translating the PO from scratch. For example:


<code po>
<syntaxhighlight lang="text">
#: fitshistogram.cpp:347
#: fitshistogram.cpp:347
msgid "Auto Scale"
msgid "Auto Scale"
Line 737: Line 737:
The most interesting, however, is the third category of messages in a merged PO file. These are the old messages which were somewhat modified in the meantime, i.e. one or both of their <tt>msgctxt</tt> and <tt>msgid</tt> fields have changed. Or, this can also be a new message, but very similar to one of the old ones. There is actually no way to tell between the two, it is only by similarity to one of the old messages that a modified or new message falls into this category. Either way, such a message is called ''fuzzy'', and looks like this:
The most interesting, however, is the third category of messages in a merged PO file. These are the old messages which were somewhat modified in the meantime, i.e. one or both of their <tt>msgctxt</tt> and <tt>msgid</tt> fields have changed. Or, this can also be a new message, but very similar to one of the old ones. There is actually no way to tell between the two, it is only by similarity to one of the old messages that a modified or new message falls into this category. Either way, such a message is called ''fuzzy'', and looks like this:


<code po>
<syntaxhighlight lang="text">
#: src/somwidget_impl.cpp:120
#: src/somwidget_impl.cpp:120
#, fuzzy
#, fuzzy
Line 747: Line 747:
The <tt>fuzzy</tt> flag states that the message is fuzzy. The comment starting with <tt>#|</tt> is called ''previous-field comment'', as it contains the previous value of the <tt>msgid</tt> field, which corresponds to the translation as given by the <tt>msgstr</tt>. This translation is, however, not valid for the ''current'' (non-commented) <tt>msgid</tt> field. By comparing the previous and current <tt>msgid</tt>, you can see that the word "boiling" was replaced with "melting", and you can adjust the translation accordingly. Once you did that, to ''unfuzzy'' the message you should remove the <tt>fuzzy</tt> flag and previous field (<tt>#|</tt>) comments, so that the final updated message is:
The <tt>fuzzy</tt> flag states that the message is fuzzy. The comment starting with <tt>#|</tt> is called ''previous-field comment'', as it contains the previous value of the <tt>msgid</tt> field, which corresponds to the translation as given by the <tt>msgstr</tt>. This translation is, however, not valid for the ''current'' (non-commented) <tt>msgid</tt> field. By comparing the previous and current <tt>msgid</tt>, you can see that the word "boiling" was replaced with "melting", and you can adjust the translation accordingly. Once you did that, to ''unfuzzy'' the message you should remove the <tt>fuzzy</tt> flag and previous field (<tt>#|</tt>) comments, so that the final updated message is:


<code po>
<syntaxhighlight lang="text">
#: src/somwidget_impl.cpp:120
#: src/somwidget_impl.cpp:120
msgid "Elements with melting point around this temperature:"
msgid "Elements with melting point around this temperature:"
Line 755: Line 755:
The previous-field comments are also a relatively newer addition to the PO format, so that in some translation environments you will not see them in merged POs. The fuzzy message would then be presented only with the ''fuzzy'' flag:
The previous-field comments are also a relatively newer addition to the PO format, so that in some translation environments you will not see them in merged POs. The fuzzy message would then be presented only with the ''fuzzy'' flag:


<code po>
<syntaxhighlight lang="text">
#: src/somwidget_impl.cpp:120
#: src/somwidget_impl.cpp:120
#, fuzzy
#, fuzzy
Line 766: Line 766:
Aside from <tt>msgid</tt>, the <tt>msgctxt</tt> field can also feature in the previous-field comment. Whether one or both of the <tt>msgctxt</tt> and <tt>msgid</tt> have been changed, both will be given in previous-field comments:
Aside from <tt>msgid</tt>, the <tt>msgctxt</tt> field can also feature in the previous-field comment. Whether one or both of the <tt>msgctxt</tt> and <tt>msgid</tt> have been changed, both will be given in previous-field comments:


<code po>
<syntaxhighlight lang="text">
#: kstarsinit.cpp:451
#: kstarsinit.cpp:451
#, fuzzy
#, fuzzy
Line 778: Line 778:
But in particular, a message will be fuzzied if it previously had no <tt>msgctxt</tt> and got one after merging, or had one and lost it. In the first case, the previous-field comments will contain only the <tt>msgid</tt>, although it may be the same as the current one; by this you will know that the change was only the adding of context. In the second case, the previous-field comments will contain both the <tt>msgctxt</tt> and the <tt>msgid</tt> fields, while there will be no current <tt>msgctxt</tt>. Here are the two examples:
But in particular, a message will be fuzzied if it previously had no <tt>msgctxt</tt> and got one after merging, or had one and lost it. In the first case, the previous-field comments will contain only the <tt>msgid</tt>, although it may be the same as the current one; by this you will know that the change was only the adding of context. In the second case, the previous-field comments will contain both the <tt>msgctxt</tt> and the <tt>msgid</tt> fields, while there will be no current <tt>msgctxt</tt>. Here are the two examples:


<code po>
<syntaxhighlight lang="text">
#: kstarsinit.cpp:444
#: kstarsinit.cpp:444
#, fuzzy
#, fuzzy
Line 814: Line 814:
The very first message in each PO file is not a real message, but the ''header'', which records many administrative and technical pieces of information about the PO file. Here is one pristine header, before any translation on the PO file has been done:
The very first message in each PO file is not a real message, but the ''header'', which records many administrative and technical pieces of information about the PO file. Here is one pristine header, before any translation on the PO file has been done:


<code po>
<syntaxhighlight lang="text">
# SOME DESCRIPTIVE TITLE.
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR This_file_is_part_of_KDE
# Copyright (C) YEAR This_file_is_part_of_KDE
Line 837: Line 837:
The header consists of introductory comments, followed by the empty <tt>msgid</tt>, and by the <tt>msgstr</tt> which contains ''header fields''. The header comments, similar to those of normal messages, are not entirely free form, but have some structure to them. The <tt>msgstr</tt> is divided by newlines (<tt>\n</tt>) into fields of <tt>name: value</tt> form (name of the piece of information and the information itself). Although the header is pristine, some of the environment-dependent values are typically already supplied, e.g. wherever KDE is mentioned above. The <tt>fuzzy</tt> flag tells that the PO file has not been translated earlier. All-uppercase text segments are placeholders which you should replace with real values. The header updated to reflect the translation state could look like this:
The header consists of introductory comments, followed by the empty <tt>msgid</tt>, and by the <tt>msgstr</tt> which contains ''header fields''. The header comments, similar to those of normal messages, are not entirely free form, but have some structure to them. The <tt>msgstr</tt> is divided by newlines (<tt>\n</tt>) into fields of <tt>name: value</tt> form (name of the piece of information and the information itself). Although the header is pristine, some of the environment-dependent values are typically already supplied, e.g. wherever KDE is mentioned above. The <tt>fuzzy</tt> flag tells that the PO file has not been translated earlier. All-uppercase text segments are placeholders which you should replace with real values. The header updated to reflect the translation state could look like this:


<code po>
<syntaxhighlight lang="text">
# Translation of kstars.po into Spanish.
# Translation of kstars.po into Spanish.
# This file is distributed under the same license as the kdeedu package.
# This file is distributed under the same license as the kdeedu package.
Line 900: Line 900:
The following contrived message is used as the exemplar for the screenshots:
The following contrived message is used as the exemplar for the screenshots:


<code po>
<syntaxhighlight lang="text">
# Do we have a better translation for 'froobaz'?
# Do we have a better translation for 'froobaz'?
#. i18n: 'Froobaz' is short for 'froolimatic bazzier'.
#. i18n: 'Froobaz' is short for 'froolimatic bazzier'.

Latest revision as of 16:12, 15 July 2012


O Formato PO
On Localization   Concepts
Prerequisites   Text Encoding
Related Articles   XML Markup, Gettext Tools
External Reading   Gettext Manual

Antes de começar a traduzir

Antes de começar a abordar os detalhes do formato PO (ou de qualquer outro formato), é útil examinar as formas conceituais que um texto pode seguir desde o autor, passando pelo tradutor, até o leitor. Vamos chamar esta sequência de "o trajeto da tradução", e considerar o seguinte exemplo:

  • o autor prepara um documento de texto, usando o OpenOffice Writer, por exemplo;
  • o tradutor traduz o documento, também no Writer, substituindo cada parágrafo do texto original pela sua tradução;
  • o usuário lê o documento traduzido, em PDF, criado pelo Writer.

Simples, claro e todo mundo está feliz, certo? Errado! Esta é a seqüência que as pessoas imaginam antes de se envolver de verdade com a localização (l10n). Mas antes de explicar porque esta seqüência está errada e porque ela não serve para a tradução de software livre, vamos abordar mais algumas hipóteses.

O exemplo anterior era sobre tradução estática, como a tradução de um texto ou de uma página HTML. Já que a seqüência de exemplo não era apropriada, mesmo com os devidos passos para a tradução a saída para o usuário final deve ser um documento estático traduzido, como um arquivo PDF ou outra página HTML. Como isto pode ser associado com uma interface de usuário traduzida em uma aplicação? Para iniciantes, nós podemos ser pouco criativos e seguir o roteiro para uma tradução estática: o programador deve manter todas as strings da interface em um arquivo de texto, que será inserido nos arquivos executáveis da aplicação assim que o pacote da instalação for compilado. Seguindo o mesmo roteiro, o tradutor deve traduzir o arquivo de texto, substituindo string por string e, depois, criando um novo pacote para o software, desta vez com a versão traduzida. Desta forma, assim como no caso dos arquivos PDF, onde no final existe um arquivo para cada idioma, também haverá um pacote para cada idioma.

Entretanto, o conteúdo de um arquivo PDF é o próprio texto, então quase não existe a duplicação de conteúdo independente de idioma. Já o contrário acontece em um aplicativo, como por exemplo um gerenciador de arquivos ou um navegador web, o texto a ser traduzido é uma parte minúscula do conteúdo total do pacote de instalação. Assim, manter um pacote do software para cada idioma seria um grande desperdício de espaço digital. Se pegarmos como exemplo alguma distribuição de sistema operacional de hoje, este método de tradução tornaria quase impossível que uma instalação padrão tivesse qualquer pacote além dos pacotes originais em inglês. Basta dizer que a tradução estática de aplicaçoes dificilmente é uma opção considerada para softwares livres, já que este tipo de software tem como uma das principais finalidades ser utilizado internacionalmente.

A more clever way of having localized applications is for them to draw translations at runtime. Returning to the previously mentioned file with user interface strings that the programmer had prepared (usually in English), instead of replacing it with translated version, now the translated files of the same structure, one for each language, are put alongside each other. The application is programmed to select strings from one of these files while running, based on user's language settings. We will call this dynamic translation.

Now we come back to the translation pipeline. As explained, regardless whether the translation is static (PDF files, HTML pages) or dynamic (application interfaces), in the end there is a file full of English text to be translated. Why is it then wrong to just open it up in OO Writer (or KWord, Abiword, etc.) and translate it by replacing paragraph for paragraph? There are two issues that make this approach infeasible:

Varying formats. While a pure text document, bound for PDF presentation, may be "just text", application user interface strings will need some extra data for application to be able to pick them at runtime. Also, interface strings will contain various special substrings, with constraints on what may be done to them in translation. These aspects tend to be different between different application frameworks (KDE, Gnome, etc.), which raises the question of validation of translated text--faulty translation may break the application behavior.

Maintenance. Software, be it applications themselves or their documentation, evolve through time according to users' needs. This means that the text also changes: new interface strings and documentation paragraphs are added, some removed, and some modified. If OO Writer would be used for translation, when the time comes to update the translation, how would the translator know that, say, a new paragraph got inserted between paragraphs 128 and 129, and that paragraphs 42 and 86 have had one sentence modified?

To handle these issues, free software, by large, went the following way: one translation file format has been organically evolved, and many independent but complementary tools have been built to translate, validate, maintain, and convert this format to and from various target formats. Thus, whether translating user interface (in various application frameworks), documentation, man pages, release notes, web content, the translator can efficiently do it by getting to know the translation pipeline built around this one file format.

Enter The PO Format

The PO format has been developed as the translation file format of the Gettext translation system, which is used today by the large part of free software. Given introductory considerations on translation pipelines, it is useful to explain what exactly is meant by used. There are three distinct uses of the PO format:

  • Intermediate static translations. Static text data, such as software documentation, is converted from its source format to PO format, translated, and converted back into the original format. Out of that the final documents for user consumption, such as PDF files or HTML pages, are built.
  • Intermediate dynamic translations. Some software keeps user interface strings in their own custom format, as is the case with e.g. Mozilla and OpenOffice. Such custom formats are converted into PO for translation, then converted back for runtime consumption by the respective applications.
  • Native dynamic translations. Finally, many applications use PO format as the native format for their user interface strings, so that no conversion is necessary. These include KDE and Gnome desktop environments, GNU tools, etc. To be usable at runtime, translated PO files are only compiled into binary MO files.

This distinction should be kept in mind, as while the PO format is one, the text exposed by it for translation will have embedded elements which are tightly coupled with the source of what is translated. For example, user interface strings will frequently contain format directives, while documentation strings may be written with HTML-like markup (examples provided later in the text). This means that the translator should be aware, in general, of what is being translated through a particular PO file.

The development of the PO format has been, and is, driven solely by the needs of its users, as in time these needs become well formulated and generalizable; hence the earlier remark of "organically evolved". Thanks to this, features of the PO format other than the very basic can be gradually introduced as necessary, and stay out of the way when they are not. The format is quite compact, human-readable and editable without special-purpose tools (though, of course, these come in handy). These aspects benefit the learning curve, everyday usage, and explanatory texts such as this one.

Although translators will frequently prefer to work on PO format files using dedicated PO editors, which purport to hide "technical details" such as the underlying file format, they should nevertheless understand the PO format very well. This is because the PO format is more than a mere vessel of text to be translated, but also, in light of the way it has been developed, reflects important concepts in the translation pipeline. Or, to put it more concretely, the translator should know how a given dedicated PO editor exposes all the bits of information provided by the PO format.

Format Basics

The PO format is a plain text format, written in files with .po extension. A PO file contains a number of messages, partly independent text segments to be translated, which have been grouped into one file according to some logical division of what is being translated. For example, a standalone application will frequently have all its user interface messages in one PO file, and all documentation messages in another; or, user interface may be split into several PO files by major application modules, documentation split by chapters, etc. PO files are also called message catalogs.

Without further ado, here is an excerpt from the middle of a PO file, showing three most basic messages, untranslated:

#: finddialog.cpp:38
msgid "Globular Clusters"
msgstr ""
⁠
#: finddialog.cpp:39
msgid "Gaseous Nebulae"
msgstr ""
⁠
#: finddialog.cpp:40
msgid "Planetary Nebulae"
msgstr ""

Each message contains the keyword msgid, which is followed by the text in English, wrapped in double quotes. The keyword msgstr marks the string which is supposed to be the translation of the English one, also double-quoted. Thus, after you have went through the PO file and added translations, these messages would read:

#: finddialog.cpp:38
msgid "Globular Clusters"
msgstr "Globularna jata"
⁠
#: finddialog.cpp:39
msgid "Gaseous Nebulae"
msgstr "Gasne magline"
⁠
#: finddialog.cpp:40
msgid "Planetary Nebulae"
msgstr "Planetarne magline"

Not terribly complicated, is it?

As usual with text formats, immediately something must be said about the encoding of a PO file: while you could use encodings other than UTF-8 if no non-ASCII letters are used in the original text, you really should use UTF-8 (in KDE this is even mandatory). The encoding is also specified within the PO file, and by default it is UTF-8; if you want to use another encoding, aside from writing out the file in it, you must specify it in the PO header.

Leaving some messages in the PO file untranslated is technically not a problem. For every untranslated messages, consumers of PO files (applications, format converters) will show the English original to the user, so that not all information is lost. Of course, you should strive to have the PO files under your maintenance completely translated, in order for the users not to be faced with mixed translated and English text.

Source References

Each message above also contains the source reference comment, which is the line starting with #:. It tells from which source code file of the application (or source document of any kind), and the exact line in it, the message has been extracted into the PO file. This piece of data may look strange at first -- of what use is it to translators, to merit inclusion in the PO file? Since PO format has been developed for localizing free software, the source reference enables you to actually look up the message in the source file, when you need more context to translate a certain message. This does not require that you be a programmer too, as source code is sometimes readable enough to be able to reason about message context without real understanding of the code. For example, in some languages the text in title position is usually written in noun form, and it may not be apparent from the PO file alone if the message:

#: addcatdialog.cpp:45
msgid "Import Catalog"
msgstr ""

is of that kind. Then, by following the source reference, you see this statement in the file addcatdialog.cpp, line 45:

setCaption( i18n( "Import Catalog" ) );

The setCaption bit here is probably a dead give-away of the message being used in a title position. Some dedicated PO editors provide very quick and comfortable source reference lookups, by pressing single shortcut, which makes this approach to context resolution that more viable.

String Wrapping

When a message is long or contains some logical line-breaks, its original and translation strings may be wrapped in the PO file (usually with boundary at column 80), such as this:

#: indimenu.cpp:96
msgid ""
"No INDI devices currently running. To run devices, please select devices "
"from the Device Manager in the devices menu."
msgstr ""

This wrapping is entirely irrelevant in the environment where the message is used, be it in application user interface, documentation, or elsewhere. PO processing tools produce wrapping mostly as a convenience to translators who would edit PO files with plain text editors. This means that you are free to wrap the translation (msgstr string) in the same way, differently, or not to wrap it at all--the result will be the same. You should only not forget to enclose each next wrapped line in double quotes, same as it is with msgid. For example, this translation of the previous message:

#: indimenu.cpp:96
msgid ""
"No INDI devices (...)"
"(...) in the devices menu."
msgstr ""
"Nema INDI uređaja (...)"
"(...) u meniju uređaja."

would be completely equivalent to this one:

#: indimenu.cpp:96
msgid ""
"No INDI devices (...)"
"(...) in the devices menu."
msgstr "Nema INDI uređaja (...) u meniju uređaja."

Dedicated PO editors may even not show wrapping to the user, or wrap on their own independent of the underlying PO file. Curiosly though, most of them seem to follow the original wrapping, at least by default. At any rate, if you would like to have all strings unwrapped, including msgid ones, or vice versa, there are command line tools to achieve this.

Uniqueness of Messages

A message in the PO file is uniquely identified by its msgid string (this is not entirely true, as will be explained later, but let us consider it approximately true for the moment). This means that, in the course of evolution of the source which is translated, a message may change some of its elements or the position within the PO file, but as long as it has the same msgid, it is the same message. Those non-identifying elements may be the translation, source reference comments, etc., and by the position we mean either raw line numbers, or relative ordering among other messages.

The first consequence of this fact is that the only reliable way to "report" a message is to state its msgid string in full, even if the person to whom you are reporting has access to its PO file. (You may want to point to a message when consulting with fellow translators, or when reporting a typo or another problem in the original text to the authors.) Newcomer translators are sometimes not briefed about this, and then they at first report the line number of the message, or its ordinal number in the range of all messages, without giving the msgid. Line numbers cannot work, for example, because of the line wrapping as described previously, which is arbitrary from one to another translator. Ordinals do not work because your PO file may be slightly older or newer than that of the other person, and the ordinals may have changed in the meantime.

The second consequence is that there cannot be two messages with the same msgid in the same PO file (again, not exactly true, see later). If the same text has been used two or more times in the source, then in the PO file it will appear as a single message, with its source reference comment (#:) listing all appearances. For example, the source reference of this message:

#: colorscheme.cpp:79 skycomponents/equator.cpp:31
msgid "Equator"
msgstr ""

shows that it is used at two places in the application source code. This feature of the PO format prevents needless duplication of work, by allowing you to go through any duplicate text in the source only once in the translation. However, this efficiency optimization can sometimes be a double-edged sword, but with an elegant solution for the problem that can arise, as we will see shortly.

The third, so to say, consequence, though more of a remark for clarity, is: you should never modify the msgid field. Not only that doing so would have no purpose, but if the msgid gets modified, a consumer of the translated PO file will not see the message as translated, since it will look for the message by matching the msgid field.

Message Context

Depending on the target language, sometimes it may be hard to translate a message well if treated in isolation, without any additional context. Naive translation may break style guidelines, or worse, misinterpret the meaning of the original text. To avoid this, there are several ways in which you can infer the context in which the message is used.

One way we have already seen: looking into the source file of the message, as pointed to by the source reference comment. But, this way can be tedious. Not only that to a programming-untrained translator the source code may look menacing, but also, while generally available, it is usually not very comfortable to keep all that source code laying around just for the sake of context lookups. This is a well understood difficulty, so more friendly context-pointers have been devised.

One simple way to keep track of the context is to, when translating a given message, keep in sight several messages before and after it. As a trivial example, the following four messages:

#: locationdialog.cpp:228
msgid "Really override original data for this city?"
msgstr ""
⁠
#: locationdialog.cpp:229
msgid "Override Existing Data?"
msgstr ""
⁠
#: locationdialog.cpp:229
msgid "Override Data"
msgstr ""
⁠
#: locationdialog.cpp:229
msgid "Do Not Override"
msgstr ""

are pretty obviously a question in some kind of a message dialog, title of that dialog, and the two answer buttons, so that you know exactly how the messages are related. Aside from the pure meaning, such conclusions may be further supported by the English user interface conventions (title word case for dialog titles, but also for push buttons), and the source reference comments (here they show all four messages to be in two adjacent lines of the same file). As time passes, you will start to pick up patterns of this kind which are typical for the source environment, and be more confident in your estimates.

Up to now, all the context gathering rested on the shoulders of the translator. However, when authors of the original text, for example application programmers, are themselves well-aware of the translation issues, they can explicitly provide some context for translators. This is particularly warranted when a message is quite strange, puts technical limitations on the translation, is used in a specific way, and the like.

Extracted Comments

One place where messages store explicit context provided by the authors is within extracted comments, those which start with #.. For example, the message:

#. i18n: A classical test phrase, with all letters of the English alphabet.
#. Replace it with a sample text in your language, such that it is
#. representative of language's writing system.
#: kdeui/fonts/kfontchooser.cpp:382
msgid "The Quick Brown Fox Jumps Over The Lazy Dog"
msgstr ""

has an extracted comment which tells you to avoid translating the English phrase for what it is, but to instead put there a phrase with the said properties in your language.

This kind of context usually begins with an agreed-upon keyword, which in the above case is i18n: (short for 'internationalization'), typical for KDE, but in principle depends on the source environment. In many other environments (e.g. Gnome) this keyword is the more direct TRANSLATORS:, which is the default for the Gettext translation system (under which the PO format is maintained).

Extracted comments can sometimes be provided not by a human author, but by a tool used to create or process PO files. For example, when markup-text documents are translated, such as HTML, or Docbook for documentation, the extracted comment frequently states the tag which wraps the text in the original document:

#. Tag: title
#: skycoords.docbook:73
msgid "The Horizontal Coordinate System"
msgstr ""

In the above example, by the #. Tag: title comment you are informed that the message is a title, and you can adjust the translation accordingly.

Another example where processing tools may provide extracted comments is when the PO file is created in a slightly roundabout way, such that source references in some messages do not really point to the source file, but to a temporary file which existed only during the creation of the PO file. To patch up a bit, the extracted comment may then state the true source:

#. i18n: file: tools/observinglist.ui:263
#. i18n: ectx: property (toolTip), widget (KPushButton, ScopeButton)
#: rc.cpp:5865
msgid "Point telescope at highlighted object"
msgstr ""

Here the rc.cpp:5865 is the dummy temporary source, whereas the true source file is given as file: tools/observinglist.ui:263. (The automatically extracted ectx: ... comment may look a bit code-cryptic, but you can still easily guess from it that this message is a tooltip for a push button.)

Disambiguating Contexts

Consider the following two messages from an application user interface:

#. i18n: First letter in 'Scope'
#: tools/observinglist.cpp:700
msgid "S"
msgstr ""
⁠
# i18n: South
#: skycomponents/horizoncomponent.cpp:429
msgid "S"
msgstr ""

At first sight, you could say that it was nice of the programmer to add explicit context (#. i18n: ... lines), informing that the 'S' of the first message is short for 'Scope', and the 'S' of the second message short for 'South', so that translators know that they should use the letters corresponding to these words in their languages. But, can you spot the problem? The problem is that these messages cannot be part of a valid PO file, since, as said earlier, all messages have unique msgid strings. Instead, in a real PO file, these two messages would be collapsed into one:

#. i18n: First letter in 'Scope'
#. i18n: South
#: tools/observinglist.cpp:700 skycomponents/horizoncomponent.cpp:429
msgid "S"
msgstr ""

Both contexts are still there, translators are still well informed, but it is now required that the words 'Scope' and 'South' also begin with the same letter in the target language--an extremely unlikely proposal.

In these situations, the programmer can give messages a different type of context, called disambiguating context. These contexts are no longer presented as extracted comments, but through a full-fledged keyword string, the msgctxt:

#: tools/observinglist.cpp:700
msgctxt "First letter in 'Scope'"
msgid "S"
msgstr ""
⁠
#: skycomponents/horizoncomponent.cpp:429
msgctxt "South"
msgid "S"
msgstr ""

This is now a valid PO file, and you can translate each 'S' properly. By this we update the earlier approximation that messages must be unique by msgid strings: they must in fact be unique by the combination of msgctxt and msgid strings. If msgctxt string is missing, as it usually is, you can think of it as being present but empty.

A rather frequent example when disambiguating contexts are needed, is when the original text is a single English adjective, and used at several places in the source:

#: utils/kateautoindent.cpp:78 utils/katestyletreewidget.cpp:132
msgid "Normal"
msgstr ""

Many languages need to match an adjective form to the noun to which it refers by gender, so if the 'Normal' above refers both to indentation mode and text style, it is almost certainly necessary to provide disambiguating contexts:

#: utils/katestyletreewidget.cpp:132
msgctxt "Text style"
msgid "Normal"
msgstr "običan"
⁠
#: utils/kateautoindent.cpp:78
msgctxt "Autoindent mode"
msgid "Normal"
msgstr "obično"

You can, however, imagine that programmers in general cannot know when a certain phrase, same in English when used in two contexts, needs different translations in some other language. This means that you, the translator, should inform them to add a disambiguating context when you determine that you need one. Programmers of the free software, on the other hand, are usually aware of this latent need, and readily reachable, so you should be able to get the request through with little communication overhead. Some common modes of such communication are briefly mentioned towards the end of this article.

As of the moment of this writing, the msgctxt keyword is a relatively fresh addition to the PO format. But the need for disambiguating contexts was observed much earlier, and different translation environments have historically used different custom solutions to provide them. Such older PO files are still to be found around in good numbers, so it makes sense to present few examples of the custom contexts. Since before the msgctxt keyword was introduced, messages indeed had to be unique by msgid only, context had to become part of the msgid itself, embedded in it with some special syntax. If we take the first message from the previous example, here is how it would look like in a KDE3 PO file:

#: utils/katestyletreewidget.cpp:132
msgid ""
"_⁠: Text style\n"
"Normal"
msgstr "običan"

The disambiguating context has been embedded at the beginning of the msgid, wrapped in _⁠: ...\n (the msgid string itself is shown broken into two lines, as PO tools wrap strings at \n regardless of their length; more on this special character sequence later). In Gnome, the same message would look something like this:

#: utils/gatestyletreewidget.c:132
msgid "Text style|Normal"
msgstr "običan"

Here the context is again at the beginning of msgid, but is separated from the real text only by the pipe character, |.

Translator Comments

Sometimes you will need to translate a message without explicit context in a non-obvious way, after having determined that such translation is needed by looking into the source, or seeing the message live in user interface at runtime. This may present a difficulty when the message is revisited, say, by a proof-reader in quality assurance, or by another translator after some months if the message got modified--either of them may conclude that the translation is wrong and mess it up, or at the very least waste time on quering why the translation is the way it is.

Conversely, sometimes you may be unsure if your translation is exactly right, e.g. if you have correctly guessed the context, or whether you have used correct terminology. In that case you can, of course, consult with fellow translators, but this can break your "flow" of translation. It is frequently better if such communication is delayed to the moment when the translation of the PO file is otherwise complete.

For these situations, you can write down your own reminders, doubts, inferred contexts, etc. in another type of comment, the translator comment. These comments start simply with # (hash and space), followed by any text whatsoever, and as with other comments, there may be any number of them. A hypothetical example:

# Wikipedia says that ‘etrurski’ is our name for this script.
#: viewpart/UnicodeBlocks.h:151
msgid "Old Italic"
msgstr "etrurski"

When for real, the translator comment as above would probably be written in the target language, as there is no reason for it to be in English. This is not to say that translator comments should never be in English, there may be situations when that would be advantageous--common sense applies.

Keep in mind that translator comments are the only type of comment that all well-behaved PO processing tools are guaranteed to preserve. For example, if you would write this kind of information as an extracted comment (#.), it would very soon perish, in one of the standard maintenance procedures. So stick to adding any personal remarks into translator comments, and nowhere else.

Constructive Substrings

Original text in a message frequently contains substrings which are not visible to the end user, but are instead used by the content producer (application, HTML engine) to construct the final visible text. Translators should reproduce such substrings in the translation as well, most of the time exactly as they are in the original, but sometimes also with a tweak or two.

For better or worse, constructive substrings tend to be tightly linked to the source environment of the text, for example the particular programming language in which the application is written, or the particular markup language for static content like documentation. To produce high-quality translations, you will benefit from having basic understanding of the constructive substrings possible in the source environment, of their function and behavior. (The prerequisite to this, as mentioned earlier, is that you are aware of what is the source of the text in the PO file.)

Format Directives

When a file manager shows a message like Really delete file tmp10.txt? or Open with KWrite, the 'tmp10.txt' and 'KWrite' parts certainly had to be added to the rest of the message at runtime. In such cases, the original text as seen by the translator will contain format directives, substrings which an application will replace with appropriate argument to construct the message as shown to the user. For example:

#: skycomponents/constellationlines.cpp:106
#, kde-format
msgid "No star named %1 found."
msgstr "Nema zvezde po imenu %1."

The format directive in this message is %1; the application will substitute it at runtime with the argument provided (probably) by the user as the name to search for. Format directives of the type %<number> are typical of KDE applications. A new type of comment has appeared as well, the flags comment. This comment begins with #,, followed by the comma-separated list of keywords, or flags, which clarify the state or the type of the message. In this example the flag is kde-format, confirming that any format directives in the message are of KDE type.

Format directives differ across source environments, but are usually easy to recognize. The message above, if found in a Gnome application, would look like:

#: skycomponents/constellationlines.cpp:106
#, c-format
msgid "No star named %s found."
msgstr "Nema zvezde po imenu %s."

The format directive changed to %s, and the format flag to c-format. This is the format used by most applications written in C, and many written in C++. (In C format, the %s directive is for substituting string arguments, and another frequent directive is %d for integers; but there are many more. There may also be some numbers and interpunction between the percent sign and the letter, e.g. %03d.)

For one more example, to illustrate the diversity of format directives, if the application would have been written in Python the message could look like:

#: skycomponents/constellationlines.cpp:106
#, python-format
msgid "No star named %(starname)s found."
msgstr "Nema zvezde po imenu %(starname)s."

Here the format directive is %(starname)s, which states the argument type as in C format (%s), but also its name in parenthesis. Hence the python-format flag. You must not change this name, as otherwise the application will not be able to find it and make the substitute--which would probably make the application crash when it tries to use the message.

You only need to make sure that each directive from the original string is found in the translation, and very rarely to modify the directives themselves. Format flags, such as kde-format, c-format, etc. are there not only as info for translators, but they are also used by tools for checking PO files. For example, if you forget or mistype a directive in the translation, such tools will report it. Dedicated PO editors may warn on the spot, or when saving the file. This provides you with a "safety net", so long as you remember to perform the checks after completing the translation (if the editor does not do it automatically).

One situation that may require modification of directives is when there are several of them, and they need to be ordered differently in the translation:

#: kxsldbgpart/libxsldbg/xsldbg.cpp:256
#, kde-format
msgid "%1 took %2 ms to complete."
msgstr "Trebalo je %2 ms da se %1 završi."

With KDE format directives, which are numbered, reordering is simple as above. Similarly for the mentioned Python format, where directives are named. But for formats where directives are neither numbered nor named by default, like in C format (where they only state argument type), you can sometimes modify directives to the desired effect:

#: gxsldbgpart/libxsldbg/xsldbg.c:256
#, c-format
msgid "%s took %d ms to complete."
msgstr "Trebalo je %2$d ms da se %1$s završi."

If the directives are numbered or named, and there is more than one same-number or same-name directive, usually any of the duplicates can be dropped in the translation. This may be useful in a longer text, e.g. when in the translation a pronoun can be used instead of repeating the argument:

#: hypothetical.cpp:100
#, kde-format
msgid "%1 is the blah, blah, blah. With %1 you can blah, blah."
msgstr "%1 je bla, bla, bla. Pomoću njega možete bla, bla."

where njega is a pronoun used instead of another %1. Conversely, it is possible to repeat the directive if it better fits where the English original has used a pronoun.

Sometimes the programmer may not use a directive to substitute an argument at runtime, but instead concatenate the full text out of separate messages:

#: hypothetical.cpp:100
msgid "No star named "
msgstr ""
⁠
#: hypothetical.cpp:100
msgid " found."
msgstr ""

Presumably, the application will fetch the first message above, append to it the name that was searched for, and then append the second message. This kind of programming is considered to be one of basic errors when striving for a translatable application, as it forces translators to "piece the puzzle", which may not even be possible in every language. This is thankfully rare today, but when it does happen, while you can try to work around, it is better that you contact the authors to have the source code fixed.

Text Markup

Applications sometimes show parts of the text in non-plain text: certain words may be italic or bold, titles in larger font size, lists with bullets, etc. This is frequent, for example, in "What's this" texts and message boxes. Even richer typographic elements of this kind are usually found in documentation and other static content, where the final output should be reading and printing friendly. On translator's end, such original text will contain markup, where words, phrases, and whole paragraphs may be wrapped with special tags.

The following messages show typical examples of markup in application user interface:

#: rc.cpp:1632 rc.cpp:3283
msgid "<b>Name:</b>"
msgstr ""
⁠
#: kgeography.cpp:375
#, kde-format
msgid "<qt>Current map:<br/><b>%1</b></qt>"
msgstr ""
⁠
#: rc.cpp:2537 rc.cpp:4188
msgid ""
"<b>Tip</b><br/>Some non-Meade telescopes support a subset of the LX200 "
"command set. Select <tt>LX200 Basic</tt> to control such devices."
msgstr ""

The markup in these messages is XML-like, where tags for visual formatting are specified as <tag>...</tag> wrappings around the visible text segments. For example <b>...</b> tells that the text inside should be shown in boldface, while <tt>...</tt> that a monospace font should be used, and lone <br/> introduces a line break (readers knowing some HTML will instantly recognize these tags).

Another frequent XML-like markup is used in documentation POs, which are in KDE (and Gnome, and many other environments) mostly written in the Docboox XML format:

#. Tag: title
#: blackbody.docbook:13
msgid "<title>Blackbody Radiation</title>"
msgstr ""
⁠
#. Tag: para
#: geocoords.docbook:28
msgid ""
"The Equator is obviously an important part of this coordinate system; "
"it represents the <emphasis>zeropoint</emphasis> of the latitude angle, "
"and the halfway point between the poles. The Equator is the "
"<firstterm>Fundamental Plane</firstterm> of the geographic coordinate "
"system. <link linkend='ai-skycoords'>All Spherical</link> Coordinate "
"Systems define such a Fundamental Plane."
msgstr ""

The Docbook tags are named somewhat differently to the HTML-like tags previously seen in application interfaces, stating the meaning of text that they wrap rather than the visual appearance (so called semantic markup). But it's all the same for you, except that knowing the meanings of text parts may be benefitial context-wise. Docbook tags will also sometimes provide one or few attributes following the opening tag, such as <link linkend=...> above (HTML tags may do that too).

When translating markup text, you should, in general, reproduce the same set of tags in the translation, assigning them to appropriate translated segments. Under no circumstances may the tags themselves be translated (e.g. <title> or <emphasis>), since they are processed by the machine to produce the final formatted text. As for tag attributes (linkend='ai-skycoords' in the example above), attribute names are also never translated, but in rare occasions their values in quotes may be (usually when a value is clearly a human-readable text).

However, this is not to say that you should never modify markup. Especially with HTML-like tags, not so rarely the markup in the original text gets to be sloppy (missing closing tags), and you are free to correct it in translation. Another example would be in CJK languages, where bold text is hard to read, so CJK translators tend to remove <b> tags in favor of quotes. In general, the more you are familiar with the particular markup, the more you can work past directly copying it from the original text.

In application interface POs, quite frequently there are parts in original text that may look somewhat like XML-like markup, for example:

#: utils/katecmds.cpp:180
#, kde-format
msgid "Missing argument. Usage: %1 <value>"
msgstr ""

The <value> here is not markup, but is shown verbatim to the user. It is a placeholder, an indicator to the user that a real argument should go in its place. Many languages tend to translate placeholders for this reason, and there is no technical issue with that. You should only exercise caution not to misjudge a tag for a placeholder (after little experience with the particular markup, the difference is usually obvious).

There are also non-XML like markups that tend to pop up for translation. One could be wiki markup, such as of this very article:

#: poformat.txt:191
msgid "=== Extracted Comments ==="
msgstr ""
⁠
#: poformat.txt:193
msgid ""
"One place where messages store explicit context provided by the "
"authors is within ''extracted comments'', those which (...)"
msgstr ""

where ===...=== is the approximate of HTML's <h2>...<h2>, while ''...'' is the counterpart of <i>...<i>. Another markup type is the source language for man pages, troff:

# type: Plain text
#: ../../doc/man/wesnoth.6:55
msgid ""
"compresses a savefile (B<infile>)  that is in text WML format into "
"binary WML format (B<outfile>)."
msgstr ""

where B<...> is the equivalent of HTML's <b>...<b>.

When you are faced with a new kind of markup, which you have never worked with before, you should definitely at least skim through a tutorial or two about it. For XML-like markups used in KDE, there is a standalone article covering them from the point of view of translators.

Escape Sequences

There are a few special characters which cannot appear verbatim in the msgid or msgstr fields. For one, consider the plain double quote ("): since it is used to delimit field strings, a raw double quote inside the text would terminate the string prematurely, and invalidate the message syntax. Such characters are therefore written as escape sequences, a combination of the backslash and another character, which is interpreted into an appropriate single character when showing the text to users. The plain double quote is written as \":

#: kstars_i18n.cpp:3591
msgid "The \"face\" on Mars"
msgstr "\"Lice\" na Marsu"

Another frequent escaped character is the newline, presented as \n:

#: kstarsinit.cpp:699
msgid ""
"The initial position is below the horizon.\n"
"Would you like to reset to the default position?"
msgstr ""
"Početni položaj je ispod horizonta.\n"
"Želite li da vratite na podrazumevani?"

Most PO tools unconditionally wrap the text at newlines, ignoring the designated wrap column, even when wrapping has been turned off. This is to increase readability when editing the PO file. If the text is not composed of markup (e.g. not HTML or Docbook), newlines are significant to the user too, so you should carry them over to the translation; for significance of newlines in markup text, see the article on markup. In general, unless you are confident that you can manipulate newlines in a certain way, you should follow the msgid lead.

Another two escape sequences, usually of much lower frequency than the double quote and the newline, are the tabulator \t and the backslash itself \\ (because single backslash always starts an escape sequence). While other sequences are possible, they are extremely rare.

Going back to double quotes, keep in mind that while English original usually uses plain ASCII quotes, translations tend to use "fancy" quotes according to the orthography of the language:

#: kstars_i18n.cpp:3591
msgid "The \"face\" on Mars"
msgstr "„Lice“ na Marsu"

This holds both for double and single quotes. So do check if your language defines any fancy quote pairs, and use them if it does.

Accelerators

In application interfaces, short texts on widgets used to perform an action or open a dialog, frequently have one letter in them underlined. This indicates that when the user presses the Alt key and that letter, the corresponding action will be activated. Such letters are called accelerators, and they are selected in the translation usually by preceding them with a special character for that purpose, the accelerator marker:

#: kstarsinit.cpp:163
msgid "Set Focus &Manually..."
msgstr "Zadaj fokus &ručno..."

In KDE the accelerator marker is the ampersand (&). Thus, the accelerator in the message above will be the letter 'M' in the English text, and the letter 'r' in the translation. Accelerator markers tend to differ across environments, e.g. Gnome uses the underscore (_), OpenOffice the tilde (~), etc.

How to choose accelerators in the translation (where to put the accelerator marker) may be tricky, as you can easily get into situations where in the same interface context (e.g. within one menu) two items end up having the same accelerator. This will not do anything too bad, e.g. the application may automatically reassign the conflicting accelerators, or the user may have to press the Alt+accelerator several times to go through all such items. Still, conflicting accelerators are not nice, but there is no way to positively avoid them; you can only try to track the message context in the PO file, and check the running applications. This is not only the problem of translation, as not so rarely the English original itself produces conflicting accelerators!

CJK languages use input methods different to alphabet-type ones (keyboard layouts), so instead of assigning an ideogram as the accelerator, they add a single English letter for that purpose:

#: kstarsinit.cpp:163
msgid "Set Focus &Manually..."
msgstr "フォーカスを手動でセット(&M)..."

This letter is usually picked to be the same as in the original, therefore reducing the possibility of accelerator conflicts to as much as the programmers were able to avoid conflicts themselves.

Accelerator does not have to be positioned at the start of a word, but can be put next to any letter or number. A reasonable order of choices would be: at the start of the most significant word in the message by default, then if it conflicts another message, at the start of another word, and if it still conflicts, inside one of the words.

Since accelerator marker is typically not such a rarely used character, it may appear in contexts in which it does not mark an accelerator. For example:

#: kspopupmenu.cpp:203
msgid "Center && Track"
msgstr ""
⁠
#. Tag: phrase
#: config.docbook:137
msgid "<phrase>Configure &kstars; Window</phrase>"
msgstr ""

In the first message above, the accelerator has been used to escape itself, to produce a verbatim ampersand in output (similar as with escape sequences where double-backslash was used to represent a verbatim backslash). In the second message, the ampersand is used to insert an XML entity &kstars;, of which you can read in more in the article on markup. That the character is not used as accelerator marker can only be determined from context, but after gaining little experience, the distinction will almost always be obvious to you.

Plural Forms

Applications frequently need to report to the user the number of objects in a given context: "10 files found", "Do you really want to delete 5 messages?" etc. Of, course, in English such messages should also have singular counterparts, like "1 file found", "...delete 1 message?". This means that two separate English texts are needed in the PO file, one covering the singular, and another the plural case. You could assume that these would then be two messages, like in this hypothetical example:

#: hypothetical.cpp:100
#, kde-format
msgid "Time: %1 second"
msgstr ""
⁠
#: hypothetical.cpp:101
#, kde-format
msgid "Time: %1 seconds"
msgstr ""

where the application fetches the first message when the number of objects is 1, and the second message for any other number.

However, while this works for some languages other than English (e.g. Spanish, German, French...), it does not work for all languages. The reason is that, while English needs one text for unity, and another text for any other number, many languages have it more complicated. For example, in some languages the singular form is used for all numbers ending with the digit 1, so application would be in error to fetch the singular form only for number exactly 1. Furthermore, in some languages more than two texts are needed, for example three: one for all numbers ending in 1, second for all numbers ending in 2, 3, 4, and third for all other numbers.

To handle this diversity, the PO format implements plural messages. The example above in reality looks like this:

#: mainwindow.cpp:127
#, kde-format
msgid "Time: %1 second"
msgid_plural "Time: %1 seconds"
msgstr[0] ""
msgstr[1] ""

The English singular form is given by the msgid field, and the plural form by the msgid_plural field. There are now several msgstr fields, with zero-based indices in square brackets, so that you can write as many translations as there are plural forms in your language. By default there will be two msgstr fields, but you may plainly insert the line with the third one (index 2), and so on. Then, the Spanish translation, which has same plural forms as English, looks like:

#: mainwindow.cpp:127
#, kde-format
msgid "Time: %1 second"
msgid_plural "Time: %1 seconds"
msgstr[0] "Tiempo: %1 segundo"
msgstr[1] "Tiempo: %1 segundos"

while the Polish translation, which needs three plural forms, is:

#: mainwindow.cpp:127
#, kde-format
msgid "Time: %1 second"
msgid_plural "Time: %1 seconds"
msgstr[0] "Czas: %1 sekunda"
msgstr[1] "Czas: %1 sekundy"
msgstr[2] "Czas: %1 sekund"

But, how should the application know which form corresponds to which numbers? The specification for this is written within the PO file itself, in the header (more on PO headers below); it consists of the number of plural forms which every plural message in the given PO file shall have, and a computable logical expression, which for any given number, computes the index of the plural form to be used. This expression is quite cryptic-looking, but you do not have to really understand how it works. Since it is constant for a given language, you can just copy it from any other previously translated PO file in your language, and by looking at plural messages in that other file, you will clearly see which form (by index of msgstr) is used in which situation. Bearing this in mind, just to complete the examples, here is the plural specification for Spanish:

nplurals=2; plural=n != 1;

and for the more complicated Polish plural:

nplurals=3; plural=(n==1 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2);

The nplurals field tells how many forms there are, and plural is the expression which computes the index of the msgstr field for the given number n (if the syntax is familiar to you, that's because you know some C).

Sometimes you will come upon a message, or pair of messages which are just like the first, hypothetical example above -- having a number in it, but not presented as plural message, when you clearly see it should be. In most environments today (e.g. in KDE or Gnome), this simply means that the programmer forgot to use the plural message. Since this is to be considered a bug, you should inform application authors to replace the ordinary with the plural message. In some environments, however, applications are not capable of handling plurals, mostly when PO format is used as intermediate (e.g. for OpenOffice). If that is the case, you can only try to translate the message in a "least bad" way.

At the time when KDE was introducing plural messages, PO format's native support for them was still very new. Thus, similar as with disambiguation contexts, in KDE 3 plural messages were embedded in the ordinary messages. Since you may still get to translate a few stray KDE3 PO files, here is how the previously shown Polish-translated message would look like in it:

#: mainwindow.cpp:127
msgid ""
"_n: Time: %n second\n"
"Time: %n seconds"
msgstr ""
"Czas: %n sekunda\n"
"Czas: %n sekundy\n"
"Czas: %n sekund"

The starting _n: in the msgid determines that the message is plural, and plural forms are separated by newlines, in both the original and the translation. Instead of an ordinary numbered placeholder, a special %n placeholder is used for the number.

Omitting The Number

Quite frequently English singular form will omit the number, that is, only the plural form will contain the format directive for the number:

#: modes/typesdialog.cpp:425
#, kde-format
msgid "Are you sure you want to delete this type?"
msgid_plural "Are you sure you want to delete these %1 types?"
msgstr[0] ""
msgstr[1] ""

It depends on the environment whether it is allowed to omit the number like this. For example, in KDE applications (kde-format flag) it is always possible, and so it is in Gnome (c-format), but not in pure Qt (qt-format). In the translation, if the environment supports omission, you can omit or retain the number in singular according to what is better language-wise, and regardless of whether or not it was omitted in the original. More precisely, you can omit the number in any form that is used for exactly one number. Conversely, if all forms are used for more than one number (e.g. the "singular" form is used for all numbers ending in digit 1), you cannot omit the number at all.

On rare occasions a plural message will have no number in either English singular or plural, when the programmer merely wanted to choose between the forms for "one" and "several". This is perfectly valid:

#: kgpg.cpp:498
msgid "Decryption of this file failed:"
msgid_plural "Decryption of these files failed:"
msgstr[0] ""
msgstr[1] ""

In such cases, in translation you should just use the same plural text for all forms but the one which is used for unity (if there is any such).

In old embedded plurals in KDE3 PO files, the %n placeholder can be omitted following the same rules.

Merging With Templates

At one point you will have translated the whole PO file, every message in it, and sent it back to the source where it is used. As time passes by, however, the original text at the source is going to change. Applications will get bug fixes and new features, which will require both new strings in the user interface, and modifications to some existing. Documentation will get new chapters, old chapters expanded, old paragraphs modified to better style. At some point you will want to update your old translation, so that the source is again fully translated into your language.

This is done in the following way. On the one side, there is your last translated version of the PO file. On the other side, there is the latest pristine PO, with non-translated messages corresponding to the current state of the source. Pristine PO files are actually called templates, and have the .pot extension, unlike the .po extension of translated POs. The translated PO file and the template are then merged in a special way, producing a new, partially translated PO for you to work on. The technicalities of merging are not so important at first, as in any established translation project you can just fetch the latest merged PO files; more is important is what you can expect to see in a merged PO file.

In general, merged PO files contain three categories of messages. First are those messages which were present in the PO file when you last worked on it, in the sense of having unchanged msgctxt and msgid fields since then. As expected, their translations (msgstr fields) are as you left them, so there is nothing new for you to do about these messages. The second category are entirely new messages, added in the source in the meantime, which you should now translate. New messages won't be added in an arbitrary way, for example simply appended to the end of the PO file. Instead they will be interspersed with translated messages, following the order of appearance of messages in the current source. This allows you to infer contexts by considering the preceding and following messages, same as you did when you were translating the PO from scratch. For example:

#: fitshistogram.cpp:347
msgid "Auto Scale"
msgstr ""
⁠
#: fitshistogram.cpp:350
msgid "Linear Scale"
msgstr "linearna skala"
⁠
#: fitshistogram.cpp:353
msgid "Logarithmic Scale"
msgstr "logaritamska skala"

The first message is a new one, untranslated, and the two other are old, translated earlier. From these two you can see that the new message is one among selection of scales (possibly for a diagram axis), and not e.g. a command or option to change the size of something, as in "scale automatically".

Fuzzy Messages

The most interesting, however, is the third category of messages in a merged PO file. These are the old messages which were somewhat modified in the meantime, i.e. one or both of their msgctxt and msgid fields have changed. Or, this can also be a new message, but very similar to one of the old ones. There is actually no way to tell between the two, it is only by similarity to one of the old messages that a modified or new message falls into this category. Either way, such a message is called fuzzy, and looks like this:

#: src/somwidget_impl.cpp:120
#, fuzzy
#| msgid "Elements with boiling point around this temperature:"
msgid "Elements with melting point around this temperature:"
msgstr "Elementi s tačkom ključanja u blizini ove temperature:"

The fuzzy flag states that the message is fuzzy. The comment starting with #| is called previous-field comment, as it contains the previous value of the msgid field, which corresponds to the translation as given by the msgstr. This translation is, however, not valid for the current (non-commented) msgid field. By comparing the previous and current msgid, you can see that the word "boiling" was replaced with "melting", and you can adjust the translation accordingly. Once you did that, to unfuzzy the message you should remove the fuzzy flag and previous field (#|) comments, so that the final updated message is:

#: src/somwidget_impl.cpp:120
msgid "Elements with melting point around this temperature:"
msgstr "Elementi s tačkom topljenja u blizini ove temperature:"

The previous-field comments are also a relatively newer addition to the PO format, so that in some translation environments you will not see them in merged POs. The fuzzy message would then be presented only with the fuzzy flag:

#: src/somwidget_impl.cpp:120
#, fuzzy
msgid "Elements with melting point around this temperature:"
msgstr "Elementi s tačkom ključanja u blizini ove temperature:"

It may seem that this is no great loss: so long as you are visually comparing texts, instead of comparing the previous (here missing) and current msgid, you might as well compare the current msgid and the old translation given in msgstr, and adjust translation based on that. However, there are two disadvantages to this. Less importantly, it may not always be easy to spot a difference by comparing the new original and the old translation. For example, only a typo or a missing dot may have been fixed in the original, leaving you to wonder if you are missing something. More importantly, a dedicated PO editor can use the previous and current msgid to highlight differences between them, which makes it that much easier for you to see them. Even if you are working with an ordinary text editor, there are command-line scripts which can embed differences into previous msgid, again making them more easy to spot. And the bigger the message, the more important to have automatic highlighting -- think of a long paragraph where only one word has been changed. For these reasons, if the merged PO files you work on do not have previous-field comments, do inquire with authors if they can enable them (they may simply not know about this possibility, as it is not the default behavior on merging).

Aside from msgid, the msgctxt field can also feature in the previous-field comment. Whether one or both of the msgctxt and msgid have been changed, both will be given in previous-field comments:

#: kstarsinit.cpp:451
#, fuzzy
#| msgctxt "Constellation Line"
#| msgid "Constell. Line"
msgctxt "Toggle Constellation Lines in the display"
msgid "Const. Lines"
msgstr "Linija sazvežđa"

But in particular, a message will be fuzzied if it previously had no msgctxt and got one after merging, or had one and lost it. In the first case, the previous-field comments will contain only the msgid, although it may be the same as the current one; by this you will know that the change was only the adding of context. In the second case, the previous-field comments will contain both the msgctxt and the msgid fields, while there will be no current msgctxt. Here are the two examples:

#: kstarsinit.cpp:444
#, fuzzy
#| msgid "Solar System"
msgctxt "Toggle Solar System objects in the display"
msgid "Solar System"
msgstr "Sunčev sistem"
⁠
#: finddialog.cpp:102
#, fuzzy
#| msgctxt "object name (optional)"
#| msgid "Andromeda Galaxy"
msgid "Andromeda Galaxy"
msgstr "Andromeda, galaksija"

It is important for a message to become fuzzy when only the disambiguating context is added or removed, because this has been done precisely to shed some light on the original text, which may require modification of the translation.

Treatment of Fuzzy Messages

Fuzzy messages are a special category only from translators' viewpoint. Consumers of PO files (applications, etc.) will treat them as ordinary untranslated messages, i.e. they will use the English original instead of the old translation. This is necessary, as there is no telling how inappropriate the old translation may be for the current original. The algorithm that produces fuzzy messages will sometimes turn out rather strange pairings, which to you or to the user may not look similar at all.

That a fuzzy message is treated as untranslated is important to keep in mind. Fresh translators will sometimes manually add the fuzzy flag to a message to mark they are not entirely sure that the translation is proper, not knowing that this will totally exclude the translation from being used. Thus, you should manually add the fuzzy flag only when you are so unsure of the meaning of the message, that you explicitly want to prevent the translation from being used. This is fairly rarely needed. Instead, when you just want to mark the message so that you or someone else can check it later, you should write your doubts in a translator comment.

Starting a New PO file

In light of the translation maintenance through the merging process, you can think of starting to work on a never-before translated PO file as just the "initial merge": you will have to take the template and rename it to something with the .po extension, and work from there on. What you rename it to depends on the environment, but it is usually one of two things: either the same name as that of the template but with the .po extension (like in KDE), or your language code with the .po extension (like in Gnome). This basically depends on the organization of the particular translation project.

On the other hand, sometimes for each template in the project an empty PO for your language will have been created and put in a proper place in the source tree, so that you can just start translating it when you get to it.

At any rate, when you start working on a PO file from scratch, the first thing you should do is fill out its header.

PO Header

The very first message in each PO file is not a real message, but the header, which records many administrative and technical pieces of information about the PO file. Here is one pristine header, before any translation on the PO file has been done:

# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR This_file_is_part_of_KDE
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: http://bugs.kde.org\n"
"POT-Creation-Date: 2008-09-03 10:09+0200\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <[email protected]>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=INTEGER; plural=EXPRESSION;\n"

The header consists of introductory comments, followed by the empty msgid, and by the msgstr which contains header fields. The header comments, similar to those of normal messages, are not entirely free form, but have some structure to them. The msgstr is divided by newlines (\n) into fields of name: value form (name of the piece of information and the information itself). Although the header is pristine, some of the environment-dependent values are typically already supplied, e.g. wherever KDE is mentioned above. The fuzzy flag tells that the PO file has not been translated earlier. All-uppercase text segments are placeholders which you should replace with real values. The header updated to reflect the translation state could look like this:

# Translation of kstars.po into Spanish.
# This file is distributed under the same license as the kdeedu package.
# Pablo de Vicente <[email protected]>, 2005, 2006, 2007, 2008.
# Eloy Cuadra <[email protected]>, 2007, 2008.
msgid ""
msgstr ""
"Project-Id-Version: kstars\n"
"Report-Msgid-Bugs-To: http://bugs.kde.org\n"
"POT-Creation-Date: 2008-09-01 09:37+0200\n"
"PO-Revision-Date: 2008-07-22 18:13+0200\n"
"Last-Translator: Eloy Cuadra <[email protected]>\n"
"Language-Team: Spanish <[email protected]>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=n != 1;\n"

Even if this particular header has been slightly abridged for clarity, it probably still looks menacing, with a lot of data -- are you supposed to manually get all that correct? Not really. If you are using a dedicated PO editor, it will have a nice configuration dialog where you can enter data about yourself, your language, etc., and whenever you save a PO file, the editor will automatically fill out the header. If you are using a plain text editor, there are command line tools to similarly fill out the header automatically. But even with such aids, it merits to give a few general directions about header comments and fields.

The first comment usually has the title role, saying something about what is translated into which language. The second comment tells something about licensing. The following comments each list a translator who at one time worked on this particular PO file, his name, email address, and years of contribution. After that, any freeform comments may be added. The fuzzy flag has been removed, as the file has been worked on.

The Project-Id-Version header field states the name and possibly version of what is translated, Report-Msgid-Bugs-To gives address to write to when you discover problems in original text, POT-Creation-Date the time when the catalog template was created, PO-Revision-Date the time when the PO file was last edited by a translator, Last-Translator the name and address of last translator who worked on the file, and Language-Team the name and address of the translation team (if any) which the last translator is part of. The fields MIME-Version, Content-Type, and Content-Transfer-Encoding, are pretty much always and for any language as given above, so they are not interesting (though you could change encoding to something else than UTF-8, in this day and age really think thrice before you do that). The final field, Plural-Forms, is where you write the plural specification for your language (as explained in the section on plural forms).

Of the presented comments and fields, almost all of them are set when the PO file is translated for the first time. When you come back to a certain PO to update translation, if no one else worked on that PO in the meantime, you should only update the PO-Revision-Date field. If someone has worked on it, you will also have to put your data in Last-Translator field. If you get to work on a PO file for the first time after someone else has already worked on it, you should add yourself in the translator list in comments. (If you are using a dedicated PO editor, it will perform all these updates for you whenever you save the file.)

Note that everything in the header is supposed to be in English, readable by anyone, not just by your native language speakers. Aside from comments being in English, this also means that the name of the language and the language team should be in English, and your own name and names of other translators in their romanized equivalents. This is because, for example, people from other languages may need to contact you or your team about any technical problems in the translation (e.g. application maintainers). Keep this in mind also when you are setting up your data in a PO editor.

Aside from standard header fields, you may encounter some custom ones, whose names begin with X-. These fields are added by various PO processing tools. One typical custom field is X-Generator, where the dedicated PO editor which you use will write its name and version. Another custom field sometimes seen is X-Accelerator-Marker, which states the character used as the accelerator marker (recognized by some tools e.g. for searching through PO files, when otherwise the accelerator marker could "mask" a word by being in the middle of it). Aside from these more general custom fields, different translation environments may add various environment-specific ones.

Representation in Editors

When you translate PO files using a plain text editor, all the message elements will be displayed in it as we have seen in the examples so far; you can edit them at will, including invalidating the very syntax if you are not careful. Most capable text editors nowdays have syntax highlighting for the PO format, albeit with different levels of specificity. On the other hand, dedicated PO editors will provide you with much more automation, but each will have its own ways of presenting and means of editing different elements of a message.

This section will show how PO messages are represented in several widespread editors. Note this should not be understood as a review of PO editors in general, nor that any remarks are there to imply that one editor is better than the other. It merely serves to relate the elements of the PO format to what is seen in each editor.

Each editor is presented by a few remarks, and one or more annotated screenshots. Message elements on the screenshot are marked with a black circle and a number in it, corresponding to the following:

  • (1) msgid field (original text)
  • (2) msgstr field (translated text)
  • (3) msgctxt field (disambiguating context)
  • (4) extracted comments (context as comment)
  • (5) source references (source file/line of the message)
  • (6) flags (fuzzy, *-format, etc.)
  • (7) fuzzy state (although among flags, usually gets special attention)
  • (8) previous-fields (msgctxt and msgid)
  • (9) translator comments (those which you add manually)
  • (10) position context (preceding and following messages)

For any message element not seen in the screenshot, a red circle with the corresponding number will be given in the lower right corner.

The following contrived message is used as the exemplar for the screenshots:

# Do we have a better translation for 'froobaz'?
#. i18n: 'Froobaz' is short for 'froolimatic bazzier'.
#: contrivance.cpp:42
#, fuzzy, kde-format
#| msgctxt "control station: alpha"
#| msgid ""
#| "<p>Froobaz \"%1\" asks for attention.</p>\n"
#| "<p>Priority&nbsp;A message follows: <i>%2</i></p>"
msgctxt "control station: alpha"
msgid ""
"<p>Froobaz \"%1\" demands immediate attention.</p>\n"
"<p>Priority&nbsp;A message follows: <i>%2</i></p>"
msgstr ""
"<p>Frubaz „%1“ traži pažnju.</p>\n"
"<p>Poruka prioriteta&nbsp;A sledi: <i>%2</i></p>"

Aside from having all the numbered elements, this message sports various constructive substrings in the text, which allows you to see editor's highlighting capabilities within text fields as well. (We didn't choose a plural message to avoid clutter; plural messages are small part of all messages, and any dedicated PO editor will present them in a reasonable way, e.g. using tabs in the original and translation fields.)

Kate

PO message in Kate 3.1.0
PO message in Kate 3.1.0

KWrite and Kate are KDE's standard low-high team of text editors, which share the same text editing component. The syntax highlighting for the PO format shown on the screenshot was introduced in version 3.1.0 of Kate (released with KDE 4.1.0), while earlier versions had simpler highlighting. However, the new PO highlighting definition works equally well with versions from 2.4.0 onwards, so you can fetch it if you are using an older Kate.

The embedded differences seen in previous-fields, text segments wrapped in {+...+} and {-...-}, can be produced by piping the PO file through the diff-previous sieve of Pology.

Lokalize

PO message in Lokalize 0.2
PO message in Lokalize 0.2

Lokalize is the new dedicated PO editor (a general translation application in fact) for KDE 4, replacing KBabel in that role. The layout on the screenshot is only the default, you can rearrange display and editing widgets in any way you like.

You can observe how Lokalize uses previous-fields to automatically show differences between current and previous original (lower left pane, number 8). In the translation editing pane (right center, numbers 2 and 7), when a message is fuzzy it will give the text in italic, making it very easy for you to discern fuzzy from translated messages (though you can enable the more classical LEDs like in KBabel.)

Gtranslator

PO message in Gtranslator 1.1.8
PO message in Gtranslator 1.1.8

Gtranslator is a dedicated PO editor for Gnome. In the versions prior to the current 1.1.8 it was not able to open a PO file with msgctxt fields (since these are the newest addition to PO format), and in the current stable version it will open such files, but it will not display the content of msgctxt to the user (hence the red number 3 lower right). This is about to be implemented in the upcoming releases, as msgctxt is starting to get used in Gnome POs themselves.

Poedit

PO message in Poedit 1.4.1
PO message in Poedit 1.4.1

Poedit is a multiplatform dedicated PO editor. It suports translation memories and plural forms. It can open PO files containing msgctxt fields, but does not display them to the user as of version 1.4.1. Source references can be seen by right-clicking on the message in the list.

Poedit's primary visual feature is its compact layout. It can also work in full-screen mode.

Contacting Authors

In the preceding text, we have mentioned several situations when you may want to get in contact with the authors of the content which you are translating. You could report typos and other problems in the original text, request addition of context (especially disambiguating contexts), point out when a plural message is needed, warn of sentences split through several messages, etc.

Obviously, you should contact the authors when you need something changed in the source (from which the PO template is produced and merged with your translated PO), to be able to translate the message properly. However, sometimes even if you can translate a given message just fine, there is still reason to request some modifications. For example, if you have understood the meaning of a difficult message only after you had looked into the code, you may want to tell the authors to add context into the PO file even if you yourself don't need it any more. Or, if there was a bad case of a split sentence which you were able to outmaneuver, to nevertheless make it proper. The rationale for this is simple: if translators from different languages all help to improve messages at the source, they are efficiently helping each other out. While you handle improvement of one message, other translators will have done the same for messages which will cross your path at a later point.

It depends on the translation environment which channels of communication are used for localization issues. For small projects you may simply contact the author directly by email; the contact address may even be given in the Report-Msgid-Bugs-To header field of the PO template. Larger projects will have a mailing list dedicated to localization and a bug tracker. In KDE, for example, you can either write directly to the mailing list, or file a bug report against the application which you are translating; in Gnome, filing a bug report seems to be the preferred way. In general, the less sure you are how the message should be improved, and how it may effect other languages, the more reason to write to the mailing list where the issue can be discussed with translators from other languages.

Once you get the correction through, bear in mind that it may not appear immediately in the source (i.e. within the week or month) and your merged PO. This is due to the so-called message freezes, the periods of time prior to the release of the source content (e.g. an application) when only changes of utmost urgency are accepted. Remember that modifying a message will make it fuzzy, which means untranslated for the consumer of the PO file. If a message would be changed e.g. two days prior to the release, it would leave a day or less for dozens of language teams to update it. So, while the first next release may not contain the correction, one of those that follow will.