Localization/Tools/Pology/PO Embedded Diffing: Difference between revisions

Latest revision as of 16:35, 17 July 2012

Embedded Diffing with Pology

On Localization	Tools
Prerequisites	Pology
Related Articles	n/a
External Reading	n/a

Diffing of PO So Far

Line-level diffing assumes that the maximal generaly recognizable unit of text file is one line, that each line carries a meaning of significant standalone value, and that the ordering of lines has been deliberately chosen and cannot be arbitrary. For example, this is typical of program code.

Superficially, PO files could also be considered "a programming language of translations", and amenable to same treatment on diffing. However, some of the above assumptions which make line-level diffing viable are violated with PO format. Firstly, minimal unit of PO file is one message, whereas one line is of negligible semantic value. Secondly, ordering of messages can in principle be arbitrary (e.g. dependent on the order of extraction from source code), such that two line-wise very different PO files are actually equivalent from translator's viewpoint. And thirdly, good number of lines in the PO file are auxiliary, neither original text nor translation, generated either automatically or by the programmer (e.g. source references, extracted comments), all of which are out of translator's scope for modifications.

Thus, a common way to use line-level diffing with PO files so far was only for review, and with some preparations. Due to myriad of equivalent but line-wise different representations of PO content, it is quite useless to send line diffs as patches; translators are instructed to always send full PO files to the reviewer/commiter, no matter the amount of modifications. Then, the reviewer merges the received PO file (new version), and possibly the original (old version), with current template, without wrapping of long lines in text fields. This "normalizes" old and new files with respect to all semantically non-significant elements, and only then can line diff be performed. Additionally, since a long non-wrapped line of translated text may differ only in some segments, a dedicated diff viewer which can highlight word-level differences has to be used; ordinary diff syntax highlighting (e.g. in shell, or in general text editor) won't cut it.

Even with such preparations and dedicated diff viewer at hand, there is at least one significant case which is still not reasonably covered: when a fuzzy message, which had previous fields (PO was merged with --previous option to msgmerge), has been updated and unfuzzied. For example:

old

#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"

new

#: main.c:110
msgid "Records of The Witch River"
msgstr "Beleške o Veštičjoj reci"

diff

⁠  #: main.c:110
- #, fuzzy
- #| msgid "The Record of The Witch River"
  msgid "Records of The Witch River"
- msgstr "Beleška o Veštičjoj reci"
+ msgstr "Beleške o Veštičjoj reci"

Here, the diff viewer will know to show word-level diff for modified translation, but it cannot know that it should also show word-level diff between the removed previous and current msgid fields, so that reviewer can see what has changed in the original text (i.e. why the message became fuzzy), and based on that judge the change in translation.

Finally, a dedicated PO editor may be able to show the truly proper, message-level difference (such as Lokalize can, operating in merge mode). Even then, however, there is still the need to send around full PO files, and, to some extent, to normalize them before comparing in the editor. The diff format becomes tied to and defined in terms of the given PO editor (which you may not want to use in general), instead of being intrinsically defined and modularly processable (such as line diffs are).

The rest of this article will therefore propose a format and semantics for self-contained, message-level diffing of PO files -- the embedded diff -- and present two Pology scripts which embody it as proof of concept (but are also quite practically applicable).

The Embedded Diff

Difference between two PO messages should be primarily, though not exclusively, composed of differences between its text fields (msgid, msgctxt, etc.) Then, to be easily judged, differences in text should be presented as succinctly and locally as possible -- think of a long paragraph where only some spelling or punctuation has changed. Finally, the format of the complete message diff should be easily comprehendable to translators used to PO format itself, and as far as possible, to existing PO tools too.

Previous considerations lead to the following decision: PO message diff will also be a PO message. Or, in other words, the diff will be embedded into the regular parts of a PO message. An embedded diff (ediff for short) message should be at least syntactically a valid PO entry, if not always semantically (i.e. if part of PO file, it should pass msgfmt, though not necessarily msgfmt --check). To enable ediffs to be passed around as patches for PO files, the embedding should be automatically resolvable (up to significant message parts) to the old and new messages from which the diff was created.

In this way, if ediff messages are packed into a PO file (an ediff PO), existing PO tools can be used to review and modify the diff. For example, highlighting in a text editor will need only minimal upgrades to show the embedded differences (more on that below), and otherwise it will already highlight ediff message parts as usual.

To complete the definition of ediffs in detail, the following questions should be answered:

How to represent embedded differences in text?

Which parts of the message should be diffed?

How to pair for diffing messages from two files?

How to present collection of diffed messages?

Embedding Differences into Text

Once the difference between the old and new strings, the word-level difference, has been determined, it should be decided how to embed it into the new (or, equivalently, old) string. One possibility is that of wdiff(1), where removed and added text segments are by default wrapped with [-...-] and {+...+}, respectively:

old	"The Record of The Witch River"
new	"Records of The Witch River"
diff	"[-The Record-]{+Records+} of The Witch River"

However, search through PO files of several translation projects (KDE, Gnome, OpenOffice, Fedora, Mozilla) reveals the [- character combination to be frequently encountered, e.g. in synopsis of command usage. While there must exist an escaping mechanism for cases when diff wrappers are encountered in the original text, to enable unambiguous resolving of ediffs, it is nevertheless prudent to pick wrappers which reduce frequency of escaping (e.g. syntax highlighting may not be able to discern non-diff wrapper-like segments). Searching through same collection of PO files produces no hits for {-, so this combination is picked instead as wrapper of removed text segments:

old	"The Record of The Witch River"
new	"Records of The Witch River"
ediff	"{-The Record-}{+Records+} of The Witch River"

If the text itself contains a wrapper-like combination, it is escaped by inserting tilde (~) between the brace and plus/minus sign:

old	"Foo {+ bar"
new	"Foo {+ qwyx"
ediff	"Foo {~+ {-bar-}{+qwyx+}"

If the text already contains a substring with brace-tilde-plus/minus, then another tilde is inserted, and so on. Thus the ediff can be unambiguously resolved to old and new versions of the string. Inserting the tilde between the two characters of wrapper combinations also makes it easier on the syntax higlighting, as the difference highlighting trigger is automatically removed.

It may happen that a given string is not merely empty in the old or new PO message, but that it does not exist at all (e.g. msgctxt field). Thus an ediff can be made between existing and non-existing strings too, in which case a tilde is appended to the very end of the ediff:

old
new	"a-context-note"
ediff	"{+a-context-note+}~"

Again, escaping is provided for by inserting further tildes if the ediff between two existing strings would result in trailing tilde (old: "~", new: "foo~", ediff: "{+foo+}~~").

How exactly the difference between two strings is formed, may be left to implementation. In fact, an implementation may allow translator to select between several diffing algorithms, depending on personal taste and situation. For example, the default algorithm of poediff does the following: words are diffed as whole, all non-word segments (interpunction, markup tags, etc.) character by character, and equal non-word segments between different words are taken into the difference segment. Hence the above ediff

"{-The Record-}{+Records+} of The Witch River"

instead of smaller

"{-The -}Record{+s+} of The Witch River"

as the former is (tentatively) easier to comprehend.

Every difference segment in an ediff PO message will be represented like this, thus it is sufficient to upgrade PO syntax highlighting of an editor to indiscriminately highlight {-...-} and {+...+} segments everywhere in the message.

Message Parts Included in Diffing

A PO message consists of several types of parts: text fields, comments, flags, source references, etc. It would not be very constructive to diff all of them; for example, while msgstr fields should clearly be included into diffing, source references most probably should not. In order not to consider pros and cons of inclusion for each and every message part, there already exists a clear split of message parts into two groups, one of which will be taken into diffing, and the other ignored. These two groups are:

extraction-invariant parts -- those which do not depend on placement (or even presence) of message in the sources, such as msgid field, msgstr fields, manual comments, etc.

extraction-prescribed parts -- those which cannot exist independently of the source from which the message is extracted, such as format flags or extracted comments.

Extraction-invariant parts are the ones which will be diffed. Working definition of exactly which parts belong into this group is provided by obsolete messages in PO files. Thus, these parts are:

current original: msgctxt, msgid, and msgid_plural fields

previous original: commented #| msgctxt, #| msgid, and #| msgid_plural fields

translation: msgstr fields

translator (manual) comments

fuzzy state (whether the fuzzy flag is present)

obsolete state (whether the message is obsolete)

All the listed fields and manual comments are presented in ediff message as wrapped word-level differences, as described earlier. Change in states, fuzzy and obsolete, is represented slightly differently. A special "extracted" (automatic) comment is added to the ediff message, starting with ediff: and listing any extra info needed to describe the ediff, including the state changes. Here is an example of two messages and the ediff they would produce (whether two messages such as these would get paired for diffing in the first place, will be discussed later on):

old

#, fuzzy
#~| msgid "Accurate subpolar weather cycles"
#~ msgid "Accurate subpolar climate cycles"
#~ msgstr "Tačni ciklusi subpolarnog vremena"

new

#. ui: property (text), widget (QCheckBox, accCyclesTrop)
#: config.ui:180
#, fuzzy
#| msgid "Accurate tropical weather cycles"
msgctxt "some-superfluous-context"
msgid "Accurate tropical climate cycles"
msgstr "Tačni ciklusi tropskog vremena"

ediff

#. ediff: state {-obsolete-}
#. ui: property (text), widget (QCheckBox, accCyclesTrop)
#: config.ui:180
#, fuzzy
#| msgid "Accurate {-subpolar-}{+tropical+} weather cycles"
msgctxt "{+some-superfluous-context+}~"
msgid "Accurate {-subpolar-}{+tropical+} climate cycles"
msgstr "Tačni ciklusi {-subpolarnog-}{+tropskog+} vremena"

Here is how this ediff message will look like in KWrite and Kate, starting with KDE 4.2:

(In fact, ediff-aware PO syntax highlighting definition for Kate can be used with Kate versions as early as KDE 3.4. The definition file should simply be placed as ~/.kde/share/apps/katepart/syntax/gettext.xml, and it will override the system definition.)

First thing to note is that the ediff message contains not only extraction-invariant parts, but also copies verbatim extraction-prescribed parts from the new message. Effectively, the ediff is embedded into the copy of new message. Extraction-prescribed parts are not simply discarded in order to provide more context when reviewing the diff. Here, for example, the extracted comment states that the text is a checkbox label, which may be important for the style of translation.

The other important element is the #. ediff: dummy extracted comment, which here indicates that the obsolete state has been "removed", i.e. the message was unobsoleted going from old to new version. Aside from state changes, few other indicators may rarely be present, as described in API documentation of pology.misc.diff module, function msg_ediff. Some of these indicators will be mentioned later on. The ediff comment will be present only when necessary, if there are any indicators to show.

If diffing of two messages would always be conducted part for part, for all parts which are taken into diffing, then in some cases the resulting ediff would not be very useful. Consider how the first example in the article, the line-level diff of a fuzzy and translated message, would look like as ediff if performed part for part:

old

#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"

new

#: main.c:110
msgid "Records of The Witch River"
msgstr "Beleške o Veštičjoj reci"

ediff

#. ediff: state {-fuzzy-}
#: main.c:110
#| msgid "{-The Record of The Witch River-}~"
msgid "Records of The Witch River"
msgstr "{-Beleška-}{+Beleške+} o Veštičjoj reci"

The same problem as with the line-level diff persists: instead of showing the difference from previous to current msgid field, the current msgid is left untouched, and previous msgid is simply shown to have been entirely removed.

Therefore, instead of straightforward, part for part diffing, a special transformation will take place when exactly one of the two diffed messages is fuzzy and equipped with previous original text fields. This splits into two directions: from fuzzy to non-fuzzy, and from non-fuzzy to fuzzy.

Diffing from fuzzy to non-fuzzy message, such as above, is the more usual of the two directions. It is typical of translation updated after merging with template, where some fuzzied messages will have been resolved. In this case, the original old and new messages are transformed thusly prior to diffing (*-rest denotes all diffed parts that are neither original text nor fuzzy state):

old

fuzzy                   -->     fuzzy
old-previous-fields     -->     old-previous-fields
old-current-fields      -->     old-previous-fields
old-rest                -->     old-rest

new

-                       -->     -
-                       -->     old-current-fields
new-current-fields      -->     new-current-fields
new-rest                -->     new-rest

In this way, ediff message's current fields will show the important difference, that of previous fields of old fuzzy message and current fields of new non-fuzzy message. Ediff message's previous fields will show the less important difference of old fuzzy messages previous and current fields, but only if it is not equal to the difference in current fields; otherwise it is eliminated. This may sound confusing, but the final ediff produced in this way is quite intuitive:

old

#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"

new

#: main.c:110
msgid "Records of The Witch River"
msgstr "Beleške o Veštičjoj reci"

ediff

#. ediff: state {-fuzzy-}
#: main.c:110
msgid "{-The Record-}{+Records+} of The Witch River"
msgstr "{-Beleška-}{+Beleške+} o Veštičjoj reci"

The reviewer here sees that the message was unfuzzied, the change in the original text that caused the message to become fuzzy, and the change made in translation to unfuzzy it. Old version (in removed and equal segments) of original and translation is that of the message before it got fuzzied, and new version (in added and equal segments) is that of the message after it was unfuzzied.

From non-fuzzy to fuzzy message should be the less frequent direction of diffing. It corresponds e.g. to case where the diff is taken from older fully complete translation to the one just after merging with newest template. In this case, the transformation is as follows:

old

-                       -->     -
-                       -->     new-previous-fields
old-current-fields      -->     old-current-fields
old-rest                -->     old-rest

new

fuzzy                   -->     fuzzy
new-previous-fields     -->     new-current-fields
new-current-fields      -->     new-current-fields
new-rest                -->     new-rest

Again, the difference presented by ediff messages's current fields will be the most important one, the difference in previous fields the less important one, and eliminated if equal to the other. Here is what this will do if applied to one step earlier (just after merging) of the running example:

old

#: main.c:89
msgid "The Record of The Witch River"
msgstr "Beleška o Veštičjoj reci"

new

#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"

ediff

#. ediff: state {+fuzzy+}
#: main.c:110
#, fuzzy
msgid "{-The Record-}{+Records+} of The Witch River"
msgstr "Beleška o Veštičjoj reci"

The reviewer here sees that the message became fuzzy from new to old, and the change in original text that caused this. (On a side note, remember that ediff message is constructed by embedding differences into a copy of new message, so the source reference and the fuzzy flag from the new message appear here in the ediff message.)

To be able to use the ediff as patch, it is necessary to reconstruct original old and new messages after resolving ediff into transformed old and new messages. This step is fortunatelly unambiguous; one just needs to check whether the non-fuzzy of the two resolved messages has previous fields, or, if not (due to elimination of equal difference in previous fields, or because template was merged without the --previous option), whether the current fields are equal. Then the back-transformation to original old and new messages can be performed.

Pairing Messages From Two Catalogs

So far the article has described how to make an embedded diff out of two messages, once it has been decided that those messages should be diffed. However, on the lowest level, the user decides not which messages to diff, but which two PO files to diff. The implementation should then automatically pair for diffing messages from the two catalogs, and this section described several possible ways to do this.

In the first instance, messages should obviously be paired by key (primary pairing), the unique combination of msgctxt and msgid fields. In the most usual case -- reviewing an ediff from incomplete catalog with some fuzzy and untranslated messages, to an updated catalog with some or all of those messages translated -- pairing by key will be fully sufficient, as both catalogs contain exactly the same set of messages by keys. These two messages will be plainly paired by key:

old

#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"

new

#: main.c:110
msgid "Records of The Witch River"
msgstr "Beleške o Veštičjoj reci"

But what should happen if some messages cannot be paired by key? Consider the earlier example where diff was taken from older fully translated, to newer merged catalog:

old

#: main.c:89
msgid "The Record of The Witch River"
msgstr "Beleška o Veštičjoj reci"

new

#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"

The keys, here just current msgid fields, of the two message do not match, so they cannot be paired by key. Yet it would be rather ungainly to represent the old message as fully removed in the ediff, and the new message as fully added:

ediff

#: main.c:89
msgid "{-The Record of The Witch River-}~"
msgstr "{-Beleška o Veštičjoj reci-}~"
⁠
#. ediff: state {+fuzzy+}
#: main.c:110
#, fuzzy
#| msgid "{+The Record of The Witch River+}~"
msgid "{+Records of The Witch River+}~"
msgstr "{+Beleška o Veštičjoj reci+}~"

(That the message has been fully added or removed can be seen by trailing tilde in the msgid field, which indicates that the old or new msgid does not exist at all, and so neither the message with it.)

Instead, messages left unpaired by key should be tested for pairing by pivoting around previous fields (secondary pairing). The two messages above will thus be paired due to the fact that the current msgid of the old message is equal to the previous msgid of the new message, and will produce a single ediff message as shown earlier.

Finally, consider the third related combination, when the old catalog has not yet been merged with the template, while the new catalog has both been merged and its translation updated:

old	#: main.c:89 msgid "The Record of The Witch River" msgstr "Beleška o Veštičjoj reci"
new	#: main.c:110 msgid "Records of The Witch River" msgstr "Beleške o Veštičjoj reci"

Once again, it would be a waste to present the old message as fully removed and the new message as fully added in the ediff. When a message is left unpaired after both pairing by key and pairing by pivoting, then the two catalogs should be merged in the background -- as if the new is the template for the old, and vice versa -- and then tested for chained pairing by pivoting and by key with merged catalog as intermediary. This pairing by merging, or tertiary pairing, will then produce another natural ediff:

ediff

#: main.c:110
msgid "{-The Record-}{+Records+} of The Witch River"
msgstr "{-Beleška-}{+Beleške+} o Veštičjoj reci"

Implementations can decide which pairing modes beyond the primary, by key, to use. There should not be much reason not to perform secondary pairing, by pivoting, too. If tertiary pairing, by merging, is implemented, it should be provided that the user can disable it (sometimes this pairing may produce strange results, and needs msgmerge to be available on the system).

Collecting Diffed Messages

For ediff of two PO files to also be a syntactically valid PO file, constructed ediff messages should be preceded by a PO header on output. At first glance, this PO header could be itself the ediff of PO headers of the catalogs which were diffed. However, there are several issues with this approach:

Reviewer of the ediff PO file would not be informed at once whereas there was any difference between the headers. Headers tend to be long, and a point change in one of the fields may go visually unnoticed.

Depending on the amount of changes between the two headers, the resulting ediff message of the header could be too badly formed to represent the header as such (e.g. if header fields in msgstr were added or removed, embedded difference wrappers would invalidate MIME-header format of msgstr), and thus confuse the PO tools.

How would the diff of two collections of PO files (e.g. directories) be packed into a single ediff PO, such as can normally be done with line-level diffing?

To avert these difficulties, the following is done instead. First, the header of ediff PO is constructed as a minimal valid header (i.e. one that msgfmt would not complain about) independent of the content of original headers. poediff will produce something like:

# +- ediff -+
msgid ""
msgstr ""
"Project-Id-Version: ediff\n"
"PO-Revision-Date: 2009-02-08 01:20+0100\n"
"Last-Translator: J. Random Translator\n"
"Language-Team: Differs\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"X-Ediff-Header-Context: ~\n"

PO-Revision-Date header field is naturally set to the date when the ediff is made. Last-Translator and Language-Team can be somehow pulled from environment (poediff will source them from ~/.pologyrc, or set some dummy values as above if not present). Encoding of the ediff PO can be chosen at will by the implementation, so long as all following ediff messages can be encoded with it (poediff will always use UTF-8). The purpose of the X-Ediff-Header-Context field will be explained shortly.

It is the first next message in the ediff PO that will actually be the ediff of headers of the two diffed PO files. Headers are diffed just like any other message, but the resulting ediff is equiped with few extra decorations:

# =========================================================
# Translation of The Witch River into Serbian.
# Koja Kojic <[email protected]>, 2008.
# {+Era Eric <[email protected]>, 2008.+}~
msgctxt "~"
msgid ""
"- l10n-wr/sr/wriver-main.po\n"
"+ l10n-wr/sr-mod/wriver-main.po\n"
msgstr ""
"Project-Id-Version: wriver 0.1\n"
"POT-Creation-Date: 2008-09-22 09:17+0200\n"
"PO-Revision-Date: 2008-09-{-25 20:44-}{+28 21:49+}+0100\n"
"Last-Translator: {-Koja Kojic <koja.kojic@nedohodnik-}"
"{+Era Eric <era.eric@ledopad+}.net>\n"
"Language-Team: Serbian\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

First observe the normal ediff segments: translator comment with a new translator who updated the PO file has been added, and PO-Revision-Date and Last-Translator header fields contain ediffs reflecting the update. These are the only actual differences between the two headers.

More interesting are the extra decorations:

The very first translator comment (here a long line of equality signs) can be anything, and serves to give good visual distinction to the header ediff. This is more so convenient when the ediff PO contains diffs of several pairs of PO files.

That a particular message in the ediff PO is a header ediff, is indicated by the msgctxt set to a special value, here a single tilde. This value is given up front by the X-Ediff-Header-Context of the ediff PO header. It should be computed during diffing such that it does not conflict with msgctxt of one of the normal message ediffs (e.g. it may simply be a sufficiently long sequence tildes).

The msgid field of the header ediff contains newline-separated paths of the diffed PO files. More precisely, the two lines of the msgid field are in form
```
[+-] file-path[ <<< comment]\n
```
(The trailing newline of the second file path is elided if the msgstr does not end in newline, to prevent msgfmt from complaining.) The optional, <<<-separated comment to the file path can be used for any purpose, one which will be demonstrated when describing functionality of poediff.

Although when a PO file is properly updated there should always be some difference in its header, it may happen that there is none. In such case, the header ediff message is still added, but such that it only contains the extra decorations -- the visual separator comment, special msgctxt, and msgid with file paths. All other comments and msgstr should be empty, and the empty msgstr indicates that there was no difference in headers. Presence of this "empty" header ediff is necessary to provide diffed file paths and, if several POs were diffed, to separate them in the ediff PO.

After the header ediff message, ordinary ediff messages follow. As already obvious by now, when several POs were diffed to construct a single ediff PO, each next PO in the ediff simply opens with a new header ediff message. Click on the thumbnail on the right to see how an ediff of two PO files looks like in entirety, with syntax highlighting in Kate.

Especially when diffing several PO files, it may happen that two ediff messages have equal keys (msgid and msgctxt fields) and thus cannot be both added as such to the ediff PO. In such case, the ediff message which was added after the first with the same key, will have its msgctxt field padded by few random alphanumerics, such as to make its key unique. This padding sequence will be recorded in the ediff comment. For example:

# =========================================================
msgctxt "~"
msgid "...(first PO header ediff)..."
msgstr "..."
⁠
#. ediff: state {-fuzzy-}
msgid "White{+ horizon+}"
msgstr "Belo{+ obzorje+}"
⁠
# =========================================================
msgctxt "~"
msgid "...(second PO header ediff)..."
msgstr "..."
⁠
#. ediff: state {-fuzzy-}, ctxtpad q9ac3
msgctxt "|q9ac3~"
msgid "White{+ horizon+}"
msgstr "Belo{+ obzorje+}"

The padding sequence is appended to the original msgctxt, separated by the pipe character. If there was no original msgctxt, the padding sequence is further extended by a tilde.

Producing Ediffs: `poediff`

Pology's poediff script implements embedded diffing as outlined in the previous section. To diff two PO files, executing the usual:

$ poediff orig/foo.po mod/foo.po

will write the ediff PO (if there is any difference) to standard output, with some basic shell highlighting of difference segments. Option -o can be used to output the ediff PO to file instead. Other options include --no-merge (-n) to not perform pairing by merging, and --strip-headers (-s) to take a quick look of differences without headers in the way (output with stripped headers is not a valid PO, and cannot be used as patch).

It is equally simple to make ediff of directories:

$ poediff orig/ mod/

Diffing of directories is by default recursive and takes into ediff added and removed catalogs. When a catalog has been added or removed, the msgid of corresponding header ediff will have one of the file paths, new or old, empty.

To have non-dummy translator name and email address added to the header of ediff PO, set them in Pology's configuration file, ~/.pologyrc:

[user]
name = Koja Kojic
original-name = Која Којић
email = [email protected]

(These entries cover all contexts in Pology where such information is of use; original-name field is used when the name is to be written differently in English and translator's native language.)

This is pretty much where the story about poediff would end, if not for its ability to take into account the underlying VCS, if one is used to control PO files.

Diffing With Underlying VCS

When PO files are under version control, poediff can operate in VCS mode using the option -c VCSKEY (--vcs), where VCSKEY is the keyword of one of VCS known to poediff. Then, instead of giving two paths to diff, any number of version-controlled paths (files or directories) is given. Without other options, all modified PO files in these paths are diffed against the last commit known to local repository. For example, if an application uses Subversion repository, PO files in its po/ directory can be diffed with:

$ poediff -c svn app/po/

Some other set of revisions to diff can be given by the option -r REV1[:REV2] (--revision). REV1 and REV2 are not necessarily proper revision IDs, but any strings that the underlying VCS will be able to convert into revision IDs. If REV2 is omitted, diffing is preformed from REV1 to current working copy.

When ediff is made in VCS mode, msgid fields of header ediffs will use <<<-separated comments to file paths to indicate revision IDs:

# =========================================================
# ...
msgctxt "~"
msgid ""
"- app/po/lang.po <<< 20537\n"
"+ app/po/lang.po"
msgstr "..."

As for supported VCS, Pology currently knows about Subversion (svn) and Git (git). To add a new VCS, its functionality should be wrapped for Pology's use in pology.misc.vcs, using the interface defined by VcsBase class.

Ediffs as Patches: `poepatch`

For basic functionality, applying an ediff patch is implementationally even easier than applying a line-level patch. Each ediff message in turn is resolved into originating old and new message, and if either the old or new message exists in target PO and is equal by extraction-invariant parts, then the message patch is applied, otherwise rejected.

Applying the patch means overwriting extraction-invariant parts of the target message with those of the new message from the ediff, and leaving other parts untouched. If the target message is already equal to new message by extraction-invariant parts, then the patch is silently ignored. This means that if the same patch is applied twice, second application makes no modifications to the target catalog. Likewise if, by chance, the changes given by the patch were already independently introduced in the target catalog.

Command-line interface of poepatch is much like patch(1), sans the myriad of its more obscure options. There is the -p option to strip leading elements of file paths in the ediff, and -d option to append to them a directory path where target POs are to be found. If the files were diffed with underlying VCS as in the previous example, then the ediff could be applied in any of the following ways:

$ cd repos/app/po && poepatch <ediff.po
$ cd repos/ && poepatch -p0 <ediff.po
$ poepatch -d repos/app/po <ediff.po

Header patch (coming from the header ediff message) is applied in a slightly more relaxed fashion: some of the header fields are ignored when checking whether the patch is applicable. These are the fields which are known to be volatile as the PO file goes through different translators, and do not influence the processing of the catalog directed by the header (such as encoding or plural forms). Currently, these fields are: POT-Creation-Date, PO-Revision-Date, Last-Translator, X-Generator. When a header patch is accepted, these fields in the target header are overwritten with those from the patch (including being added or removed).

Handling Rejected Ediffs

Any rejected ediff messages will be written out to stdin.rej.po if the patch was read from standard input, or to <catname>.rej.po if it was given as file through -i option (e.g. ediff.rej.po for input ediff.po).

File with rejected ediff messages will again be an ediff PO file. It will have the header as before, except that its title comment will mention, in free prose, that this particular ediff PO contains rejects of some patching operation. Afterwards, ediff messages rejected as patch will follow. Header ediff messages will be present whether rejected or not, for the same purpose of separation and provision of file paths, but they will be stripped of comments and msgstr when the header patch itself was not rejected.

Furthermore, to every straigh-out rejected ediff message an ediff-no-match flag will be added. This is done, naturally, because some ediff messages may not be rejected straight-out, but go through some post-processing instead. Consider the following scenario. A catalog has been merged to produce the fuzzy message:

old

#: tools/power.c:348
msgid "Active sonar low frequency"
msgstr "Niska frekvencija aktivnog sonara"

new

#: tools/power.c:361
#, fuzzy
#| msgid "Active sonar low frequency"
msgid "Active sonar high frequency"
msgstr "Niska frekvencija aktivnog sonara"

Translator updates the translation, which produces the usual ediff message on update from fuzzy to translated:

ediff

#. ediff: state {-fuzzy-}
#: tools/power.c:361
msgid "Active sonar {-low-}{+high+} frequency"
msgstr "{-Niska-}{+Visoka+} frekvencija aktivnog sonara"

However, before this patch could have been applied, the programmer adds a trailing colon to the same message, and the catalog is merged again to produce:

new2

#: tools/power.c:361
#, fuzzy
#| msgid "Active sonar low frequency"
msgid "Active sonar high frequency:"
msgstr "Niska frekvencija aktivnog sonara"

The patch can no longer be cleanly applied, due to the extra colon added in the meantime to the msgid, so it has to be rejected. If nothing else is done, it would appear in the file of rejects as:

#. ediff: state {-fuzzy-}
#: tools/power.c:361
#, ediff-no-match
msgid "Active sonar {-low-}{+high+} frequency"
msgstr "{-Niska-}{+Visoka+} frekvencija aktivnog sonara"

It seems a bit wastefull to reject such a near-match patch without any indication that it could be easily adapted to suit the latest message in the target PO. Therefore, when an ediff message is rejected, the following analysis is performed: by trying out message pairings as described for diffing, could the old message from the patch be paired with a current message from the target PO, and that current message with the new message from the patch? Or, in other words, can an existing message in the target PO be "fitted in between" the old and new messages defined by the patch? If this is the case, instead of the original, two special ediff messages -- split rejects -- are constructed and written out: one from the old to the current message, and another from the current to the new message. They are flagged as ediff-to-cur and ediff-to-new, respectively:

#: tools/power.c:361
#, fuzzy, ediff-to-cur
#| msgid "Active sonar low frequency"
msgid "Active sonar high frequency{+:+}"
msgstr "Niska frekvencija aktivnog sonara"
⁠
#. ediff: state {-fuzzy-}
#: tools/power.c:361
#, ediff-to-new
#| msgid "Active sonar {-low-}{+high+} frequency{+:+}"
msgid "Active sonar {-low-}{+high+} frequency"
msgstr "{-Niska-}{+Visoka+} frekvencija aktivnog sonara"

There are more ways to interpret split rejects, depending on the circumstances. In this example, from ediff-to-cur message reviewer can see what had changed in the target message after the translator made the ediff. This can also be seen by comparing difference embedded into previous and current msgid fields in the ediff-to-new message. With slightly more extensive editing, the reviewer can fold these two messages into an applicable patch:

#. ediff: state {-fuzzy-}
#: tools/power.c:361
#, ediff
msgid "Active sonar {-low-}{+high+} frequency:"
msgstr "{-Niska-}{+Visoka+} frekvencija aktivnog sonara"

Given that the file of rejected ediffs is also an ediff PO, after edits to make some of the rejected patches applicable, it can be reapplied as patch. If that is done, poepatch will silently ignore all ediff messages having ediff-no-match and ediff-to-new flag, as these have already been determined inapplicable. That is why in the example above the reviewer has replaced the ediff-to-new with the plain ediff flag on the folded ediff.

Embedding Patches

Depending on the type of text which is being translated, and distance of translation language's grammar, ortography and style from English, it may be difficult to review an ediff in isolation. In general, messages in ediff PO will lack positional context, which is in the full PO provided by messages immediately preceding and following the target message. For example, a long passage from documentation probably needs no positional context. But a short newly added message such as "Crimson" could very well need one, if it has neither msgctxt nor extracted comment describing it. Is it really a color? What grammatical ending should it have, in a language which matches adjective to noun gender? Several messages around it in the full PO could easily show whether it is just another color in a row, and their grammatical endings (determined by a translator earlier) would show the needed ending for the new color.

Then, if an ediff message needs some editing before being applied, it may not be easy to do this directly in the ediff PO. Everything is fine so long as only added text segments ({+...+}) are edited, but if the sentence needs to be restructured more thoroughly, reviewer would have to make sure to put all additions into existing or new {+...+} segments, and to wrap all removals as {-...-} segments. If this is not carefully performed, the patch will not be applicable any more, as old message resolved from it will no longer exactly match a message in the target PO.

For these reasons, poepatch can apply the patch such as not to resolve the ediff, but to set all its extraction-invariant fields to the message in the patched PO. In effect, the patched PO becomes an ediff PO by itself, but only in the messages which were actually patched. To mark these messages for lookup, ediff flag is added to them. If the message in the ediff PO was:

#: title.c:274
msgid "Tutorial"
msgstr "{-Tutorijal-}{+Podučavanje+}"

then when the patch is successfully applied with embedding, the patched message in target PO will look like this:

#: main.c:110
msgid "Records of The Witch River"
msgstr "Beleške o Veštičjoj reci"
⁠
#: title.c:292
#, ediff
msgid "Tutorial"
msgstr "{-Tutorijal-}{+Podučavanje+}"
⁠
#: title.c:328
msgid "Start the Expedition"
msgstr "Pođi u ekspediciju"

Other than the addition ediff flag, note that the patched message also kept its source reference, rather than it being overwritten by that from ediff PO. Same holds for all extraction-prescribed parts.

Reviewer can now jump from ediff to edif flag, always having the full positional context for each patched message, and being able to edit it to heart's content, with only minimal care not to invalidate the ediff format. Wrapped difference segments can be entirely removed, non-wrapped segments can be freely edited; it should only not happen that a wrapped segment looses its opening or closing sequence. But this does not mean that the reviewer needs to remove or touch difference segments at all, that is, to unembed patched messages by hand -- poepatch will do that automatically, when run on embedded-patched POs with a particular option.

A patch is applied with embedding by issuing the -e (--embed) option (E in parenthesis in progress output indicates that embedding is engaged):

$ poepatch -e <ediff.po
patched (E): lang.po
$

When the patched PO has been reviewed and patched messages possibly edited, all remaining embedded differences are removed, i.e. resolved to new versions, by running:

$ poepatch -u lang.po

Only those messages having the ediff flag are resolved, therefore the reviewer should never remove them (unless manually unembedding the whole patched message).

What happens with rejected patches when embedding is engaged? They are also added into the target PO, with heuristic positioning, and no separate ediff PO with rejects is created. Same as with plain patching, straight-out rejects will have ediff-no-match flag, and split rejects ediff-to-cur and ediff-to-new. If these are not manually resolved during the review (ediff-no-match messages removed, ediff-to-* messages removed or folded), when poepatch is run to unembed the patch, it will remove all ediff-no-match and ediff-to-new messages, and resolve ediff-to-cur messages to new version (effectively rejecting patches from which split rejects had originated).

Lightweight Diffing when Updating Translation

Rather than for reviewing changes and sending patches, the translator may want to have only a convenient diff between the previous and new original text when updating a catalog merged with --previous option. Repeating the (by now) usual example message:

#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"

Obviously, a dedicated PO editor could automatically present appropriate diff for messages like these (e.g. Lokalize can do it since inception), so for translators using such an editor, diffing of fuzzies is either already available or a strong candidate for a feature request. The rest of this section therefore applies only to translators who stick to general text editors (with or without some power-assists for editing PO files).

One could use poediff and poepatch for the purpose of diffing of fuzzies, by first diffing from previous complete to new merged catalog, embedding the patch into the merged catalog, updating translation, and then unembedding ediffs. One drawback of this, however, is that two version of the catalog are necessary, when only the merged one contains all the information. More drawbacks are that the catalog has to be cleared of ediffs after updating, or that each diffed message contains more information than necessary (ediff flag, #. ediff: comment) since all ediffs are of one and the same type.

In short, full embedded diffing is too heavyweight as an assist to updating fuzzy messages. Instead, there is a Pology sieve intended for this particular purpose, diff-previous, which embeds the difference from previous to current original text into previous fields. Executing:

$ posieve diff-previous lang.po

will modify fuzzy messages in-place in the processed catalog, to the following partial ediff:

#: main.c:110
#, fuzzy
#| msgid "{-The Record-}{+Records+} of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"

The format of embedding is the same as before, so the ediff-aware syntax highlighting works here too. When the message is updated and properly unfuzzied (both fuzzy flag and previous fields removed), none of the ediff remains, so there is no need for post-unembedding. In other words, translator updates translation just as used to before, with the free benefit of having the ediff of original text.

Note that no embedding should remain in previous fields by the time the catalog is merged again, as it would throw off msgmerge when matching to create new fuzzies. This will normally not be an issue, as the translator will have unfuzzied all messages when updating the catalog after last merging. However, if there were a lot of fuzzy messages, and translator didn't have the time to update all of them, diff-previous sieve can also be used to unembed any remaining ediffs, restoring the original values of previous fields. It just needs to be run on the catalog with extra parameter strip:

$ posieve diff-previous -sstrip lang.po

Latest revision as of 16:35, 17 July 2012

Diffing of PO So Far

The Embedded Diff

Embedding Differences into Text

Message Parts Included in Diffing

Pairing Messages From Two Catalogs

Collecting Diffed Messages

Producing Ediffs: poediff

Diffing With Underlying VCS

Ediffs as Patches: poepatch

Handling Rejected Ediffs

Embedding Patches

Lightweight Diffing when Updating Translation

Producing Ediffs: `poediff`

Ediffs as Patches: `poepatch`