Localization/Concepts/Transcript: Difference between revisions

Revision as of 07:56, 18 August 2011

Translation Scripting with Transcript

On Localization	Concepts
Prerequisites	The PO Format
Related Articles	n/a
External Reading	JavaScript Guide and Reference

Translation Scripting?

The current state of affairs in the localization of user-visible strings (messages) in application interfaces is such that the translator is sometimes forced to supply an inadequate translation. This problem typically occurs when a message contains a placeholder to be substituted at runtime, or when two unrelated strings become related by placement in the interface. In either case, the modest requirements of the English language on the congruence of words in a sentence allow the original string to remain grammatically correct, while in many other languages this is not the case.

Translators and programmers sometimes try to work out a change in the code which would provide a more workable alternative. However, this process is difficult, and worse yet, the outcome is still language-dependent: solving the problem for a few languages does not necessarily solve it for all of them.

One way to overcome these problems in a more general and compartmentalized manner is to provide translators with a way to modify translated strings at runtime, depending on the context (eg. particular placeholder substitutes). In other words, to script translation. The translator should be able to operate on any interface string he wishes, while the programmer should not bear any extra burden (or know about translation scripting at all).

The Transcript Engine

KDE4 comes ready with a translation scripting system, called Transcript. Several strategic choices were made in its design:

programmers are unaware of scripting, which means that translator can script any message without outside coordination

unless the translator wants to script a message, he is faced with the familiar standard Gettext PO environment (i.e. translators, too, can ignore scripting)

scripting is a low-level bolt-on to Gettext environment, to keep existing PO tools in the game

in order to be powerful enough for unforeseen needs, a general-purpose scripting language is provided: JavaScript (with extensions for interfacing with Transcript)

Note

For comparison, another translation scripting approach taking some different decisions (new translation environment down to the file format, scripting facilities more specialized, etc.) is brewing at http://wiki.mozilla.org/L20n

To script a particular message, the translator writes short scripting calls into msgstr in the PO file which expand into parts of msgstr (interpolations), and JavaScript code which defines these calls into an accompanying Transcript module file (eg. foo.po can be augmented with foo.js).

In case where the application draws translations from several PO files, scripting calls defined in one of the Transcript modules are available in all used PO files. Since every KDE app uses kdelibs.po, calls defined in kdelibs.js are available everywhere.

The scripting process is illustrated by several examples. More detailed explanations of the elements are given in following sections.

A Useless Example

In Nevernessian it is impolite to speak a greeting with the same tone of voice throughout; instead, the name of the person must be shouted out. Hence, the translator wants to capitalize the placeholder substitute in the following login greeting in neverness.po:

#: neverness_login.cpp:10
msgid "Hello, %1!"
msgstr "Heelyy, %1!"

So the translator adds a scripted msgstr, with an interpolation:

#: neverness_login.cpp:10
msgid "Hello, %1!"
msgstr "Heelyy, %1!"
"|/|"
"Heelyy, $[shout %1]!"

The first thing to note is that, while a bit longer, msgstr is still a proper PO msgstr, which means that it can be edited and processed by the usual PO tools.

The first part of msgstr is same as before, and called the fallback in this context: if the scripted part happens to fail in some way, the fallback translation is used. The fallback is followed by the fence |/|, which separates the fallback and scripted translation (and also indicates that this message is scripted).

Finally, there is the scripted translation after the fence. It is different from the fallback in that it contains the interpolation $[shout %1], which is supposed to evaluate to a capitalized version of the placeholder substitute. It is composed of the call name, shout, and one argument to it, the %1 placeholder which will be replaced by its substitute. The syntax and expansion rules for interpolations are similar to Unix shell.

The call shout itself is defined in the Transcript module neverness.js, which contains only these lines:

function capitalize (str) {
    return str.toUpperCase();
}

Ts.setcall("shout", capitalize);

Here the function capitalize is an ordinary JavaScript function which takes a string argument and returns all-caps version of it.

The link with the PO file is established by the call to Ts.setcall() -- the Transcript interface is represented by the property functions of the Ts object. In this variant, the Ts.setcall() takes the name of the call for the interpolations in the PO messages (a string), and the JavaScript function which will actually be invoked (bound to the call).

That's it: now the fair Nevernesse folks are greeted properly.

Basic Case Resolution

One problem frequently encountered is use of the wrong noun case when the placeholder is substituted in the msgstr. For example, in many languages every KDE app has such a problem in the Help menu, with one or both of "About %1..." and "%1 &Handbook". This can be scripted in kdelibs.po like this:

msgid "&About %1"
msgstr "&O %1"
"|/|"
"&O $[get-case dative %1]"

The get-case interpolation is supposed to get the dative case of whatever app name the %1 happens to be. The Transcript module kdelibs.js contains the definition of get-case, as well as the dictionary of cases:

function getProperty (prop, key) {
    return _dict_[key][prop];
}
Ts.setcall("get-case", getProperty);

_dict_ = {};
function addDictCases (key, gen, dat, acc, ins) {
    if (!_dict_[key])
        _dict_[key] = {};
    _dict_[key]["genitive"]     = gen;
    _dict_[key]["dative"]       = dat;
    _dict_[key]["accusative"]   = acc;
    _dict_[key]["instrumental"] = ins;
}

// dictionary entries follow:
addDictCases("KWrite", "KWritea", "KWriteu", "KWrite", "KWriteom");
addDictCases("Konsole", "Konsole", "Konsoli", "Konsolu", "Konsolom");
...

Function getProperty, bound to get-case call, simply returns the entry from the dictionary of forms. Function addDictCases is responsible for adding the static entries (name and its cases) into the dictionary, which is done in the final few lines for all apps of interest.

This completes the example, but for better modularization, it is also possible split out the dictionary insertion in a separate file, eg. appdict.js:

// appdict.js
addDictCases("KWrite", "KWritea", "KWriteu", "KWrite", "KWriteom");
addDictCases("Konsole", "Konsole", "Konsoli", "Konsolu", "Konsolom");
...

and use Transcript interface to load this file in the kdelibs.js:

// kdelibs.js
...
...
...
Ts.load("appdict");

Note that Ts.load() takes filename without extension, and assumes its location is relative to the folder of the parent file (ie. in this case kdelibs.js and appdict.js should be in the same folder).

Dynamic Case Setting

The previous scripted example solves the original problem, but introduces the burden of maintaining the dictionary insertion file. There is no way around this when the placeholder substitutes are "dead" strings from outside (eg. from .desktop files), but when they are coming from KDE's PO files at runtime, this burden can be removed.

The app name in KDE's Help menu indeed comes from the app PO file, and it is of course encountered at runtime before the menu strings come into focus. This allows setting the cases of app name in the PO msgstr which contains it. For example, katepart.po contains the "KWrite" string, and the forms could be set at that point:

msgid "KWrite"
msgstr "KWrite"
"|/|"
"$[set-cases KWritea KWriteu KWrite KWriteom]"

The set-cases is a side-effect interpolation: it should set the dictionary entries, but this particular message should in any case use the ordinary translation. Assuming all the definitions from previous example are still in effect, here is how set-cases could be defined in kdelibs.js:

function dynamicSetCases (gen, dat, acc, ins) {
    addDictCases(Ts.msgstrf(), gen, dat, acc, ins);
    Ts.fallback();
}
Ts.setcall("set-cases", dynamicSetCases);

In other words, this is little more than a wrapper to "static" addDictCases from previous example, but two new elements of Transcript interface appear. First is the Ts.msgstrf() function, which returns the finalized ordinary translation (placeholders substituted), and which is needed in this case as the dictionary key. Second is the Ts.fallback() function, which signalizes the Transcript engine to disregard the result of the scripted part of msgstr and use the ordinary translation.

Admittedly, the use of Ts.fallback() in this case is not necessary, but given for introductory purpose; dynamicSetCases might as well return the ordinary translation via Ts.msgstrf().

The PO Shell

This section gives the details of how the interpolations in the PO msgstr are expanded before evaluation.

The interpolations are parts of msgstr between $[...], and are parsed into a number of strings. The first string is the name of the call registered in the scripting module via Ts.setcall(), and the rest are the arguments to bound JavaScript function. The arguments are typically passed as JavaScript type String, except in a special case detailed below.

The special characters in the interpolation are whitespace, single quote (') and backslash (\). Whitespace separates arguments, whereas single quote can be used for text which contains whitespaces. The backslash is used as escape; it can escape whitespace in non-quoted text, or single quotes in quoted text. This is pretty much like a typical Unix shell.

Double quotes are not special. Single quotes are used instead of double quotes because it makes it easier to edit interpolations in PO files, where double quotes would have to be escaped. This also means that when escape is needed in the interpolation, it must be escaped once itself for the PO msgstr.

The biggest difference from the shell expansion is that unlike with shell variables, the placeholders are expanded such that no characters within them are treated as special. Otherwise, many tricky problems could arise with whitespace or single quotes contained inside.

The call name bound to a JavaScript function using Ts.setcall() does not have to be a proper JavaScript identifier, but any Unicode string not containing the interpolation-special characters. This means that more "natural" call names can be used inside the msgstr, like those with dashes or non-Latin1 characters. E.g. call names can be in the language of the PO file, for an aesthetic impression.

Sub-interpolations, $[... $[...] ...], are also possible. If put inside inside single quotes, they will be treated as ordinary text. Thus, if a literal closing square bracket is needed as an argument inside the interpolation, it can be given within single quotes.

In case that the argument is given as ^number, it will evaluate to the value supplied for the corresponding placeholder -- unlike the %number form, which evaluates to the placeholder substitution string. Also, the argument type as seen by JavaScript need no longer be String, but will correspond to the value substituted (e.g. Number). This distinction may be important in some situations. Even if the value is originally a string, the substitution string may have extra formatting (padding, tags...), which may not be desirable. In particular, the number substitutes may come tagged or locale-formatted, such that they cannot be easily parsed back into Number. In these cases, ^number form can be used to get the raw values.

The Transcript Interface

Transcript provides extensions to the JavaScript, which interface with the PO file and the Transcript environment. They are all function properties of the Ts object, accessible as Ts.func(args).

setcall (name, func)

setcall (name, func, obj)

Binds the call name to the JavaScript function, for use in the interpolations inside the PO file.

name name of the call. Can be any Unicode string, for ease of use in the msgstr

func function object

obj object to act as this inside the function; if omitted, this refers to global object

Returns Undefined.

hascall (name)

Check if the call for use in PO file has been set.

name name of the call

Returns true if set, false otherwise.

acall (name, arg*)

Applies previously set PO call to a list of arguments, within the scripting module itself. Indirect calls can be made this way, e.g. by passing a call name as a string within PO file interpolation.

name name of the call

arg* arguments to the call

Returns what the indirect call would return.

load (file*)

Evaluates the code in the specified files, in the left to right order. File paths are expected to be relative to current module's folder.

file file name without extension

Returns Undefined.

fallback (): Forces Transcript to use ordinary translation, regardless of whether the interpolation evaluates successfully or fails. The evaluation of the script is not aborted, any other interpolations will evaluate too.; Returns Undefined.

msgid (): Returns msgid of the last message, with placeholders intact.

msgstrf (): Returns finalized ordinary translation of the last message, with placeholders substituted.

msgctxt (): Returns msgctxt of the last message, with placeholders intact.

msgkey (): Returns a String which is implementation-dependent combination of msgctxt and msgid with placeholders intact. Used to uniquely identify the message within the PO file, usefull as a dictionary key.

nsubs (): Returns the number of substitutes provided for placeholders in the last message. It is equal to the highest-numbered placeholder for a proper i18n call in the application code, but i18n calls do not have to be quite proper.

subs (index)

Used to access values of placeholder substitutes provided to the last message. Numbering is zero-based.

index index of placeholder substitute

Returns String, regardless of the value type substituted in the application code.

vals (index)

Used to access values of placeholder values provided to the last message. Numbering is zero-based.

index index of placeholder value

Returns the matching JavaScript type to the value type used in the application code. E.g. strings will be String, integers and doubles Number. May also return Undefined, in case the value cannot be represented as a reasonable JavaScript type.

dynctxt (key)

Programmers may append dynamic context to the message, which is a pair of strings: a context key and its value. Use this function to retrieve such contexts.

key context key string

Returns String, or Undefined if dynamic context with the given key does not exist.

dbgputs (msg)

Outputs a debug message in the shell. The message is seen only if KDE libraries have been compiled in debug mode.

msg message string

Returns Undefined.

warnputs (msg)

Outputs a warning message in the shell.

msg message string

Returns Undefined.

Since KDE 4.7.1.

setcallForall (name, func, obj)
setcallForall (name, func): Similar to setcall(), but the function is executed on every message after it has been finalized (if the message was explicitly scripted, the function is executed after the script has been evaluated). The function receives no arguments, and its return value is ignored, i.e. it cannot change the finalized message; the call is used for side-effects. The name parameter is used only for reporting errors. When several setcallForall() have been issued, the functions are invoked in the order of issue.; More precisely, these calls are made on every message after the Transcript engine has been initialized, which happens on first explicitly scripted message. Thus, if earlier application of the calls is necessary, Transcript can be jump-started by scripting one early encountered message with only a single empty interpolation: msgstr "...|/|$[]"; this will do nothing for that message, but it will start the engine.; Returns Undefined.

toUpperFirst (text)

Converts the first letter in the string into upper case, even if the first character in the string is not a letter. Unicode specification is used to determine what a letter is (i.e. works for non-ASCII letters).

text string to process

Returns String.

toLowerFirst (text)

Converts the first letter in the string into lower case, under same assumptions as for toUpperFirst.

text string to process

Returns String.

loadProps (file*)

Loads property maps from the specified files. File paths are expected to be relative to current module's folder.

file file name without extension

Returns Undefined.

getProp (phrase, prop)

Fetches the property value of the phrase, as defined by property map.

phrase text for which the property is requested

prop property key string

Returns String, or Undefined if the given text has no such property.

setProp (phrase, prop, value)

Sets the property value of the phrase. The phrase and property key will be automatically normalized, as they would have been when loaded from a property map file.

phrase text for which the property is set

prop property key string

value value of the property (string)

Returns Undefined.

normKey (phrase)

Normalizes the text by removing all whitespace, removing the accelerator, and lowercasing it, e.g. for use in near-match lookups.

phrase text to be normalized

Returns String.

getConfString (key)

getConfString (key, value)

getConfBool (key)

getConfBool (key, value)

getConfNumber (key)

getConfNumber (key, value)

Fetches a value of the requested type from the user-configuration file. A default value of the same type can be stated too (or else it defaults to undefined).

key the name of the field in the configuration file

value the default value, in case the key is not found

Returns the configuration value if the key is found, or the default value or undefined otherwise. If the configuration string is not a valid representation of the requested type, again the default or Undefined is returned instead.

Property Maps

There is a frequent need to have different properties attached to the particular pieces of text. For example, case, gender, etc. may be considered as properties of a noun, or phrase with a subject/object role in a sentence. Sometimes it may be convenient (or necessary) to define and maintain the phrases and their properties in an external file. Property map is Transcript's built-in (i.e. efficient) way to handle such definitions.

Property map files are text files in a simple dictionary format for writing down phrases with their properties (filenames must end in .pmap). The format is best shown on the example:

# cities.pmap
=:Athens:Atina:nom=Atina:gen=Atine:dat=Atini:acc=Atinu::
=:Paris:Pariz:nom=Pariz:gen=Pariza:dat=Parizu:acc=Pariz::

Each entry starts with two characters, =: in this example, which may be chosen arbitrarily (but must be non-letters, and non-#), and can change from entry to entry. The first character defines the separator in the property key-value pair, and the second character is the separator of pairs. E.g. the gen=Atine segment above is defining the genitive case of the localized form of Athens. The "pairs" which actually lack the property key (first two in the entries above) are considered phrase identifiers. The completely empty pair (:: in the example) terminates the entry.

The property map is loaded inside the scripting module using the Ts.loadProps() call. The map file should normally reside in the same folder as the scripting module that uses it; if so, the map from the above example is loaded simply with Ts.loadProps("cities"). The property values are obtained in the scripts using Ts.getProp() call. E.g. the genitive form of Athens would be obtained either by Ts.getProp("Athens", "gen") or Ts.getProp("Atina", "gen") -- because the map defined both as phrase identifiers for Athens. The phrase/key lookup is performed using normalized values; this normalization can be checked or used for own purposes by Ts.normKey() call.

The comments are a bit of an issue in property map files. Since the phrases and property values can be anything (hence the possibility to choose separators), fixing any single character as your typical to-the-end-of-line comment would introduce limitations. Instead, the #-comments are allowed only between the entries, span to the end of line, and # cannot be a separator character.

The whitespace in property values is mostly preserved (for the keys it does not matter, as it gets stripped for lookups). However, if the leading sequence of whitespace contains a newline, all whitespace to the first newline and including it is stripped; symmetrically, in a trailing whitespace sequence, the last newline (if any) and the whitespace following it are removed.

User Configuration

Using a configuration file located at $HOME/.transcriptrc, the user can be allowed to configure certain aspects of Transcript operation. The configuration file is in ini-style, where group names are language codes to which the selections apply. For example:

# Settings for German.
[de]
serve-sausages = no
wine-instead-of-beer = yes

# Settings for Italian.
[it]
with-mozzarella = no

Within scripting modules, the configuration values can be fetched using Ts.getConf*() series of calls. These calls will automatically look for the configuration fields under the proper language group, as the scripting modules are language-dependent too.

For boolean-valued keys, 0, no, and false can be used to indicate negative (false) values, and everything else is treated as positive (true). Also, the boolean values are not case sensitive.

Repository Organization

For the PO file in the KDE repository LL/messages/kdemodule/foo.po, the corresponding Transcript module should be located at LL/scripts/kdemodule/foo/foo.js. Note the extra subfolder in the module path, named like the basename of the PO/JS file. This subfolder is introduced because the module's main JS file might need other supporting files, which can then be located conveniently in the same folder.

The autogen.sh script, used for some time already to generate build system support for PO files, will also generate build support for Transcript modules. In particular, it must be rerun whenever a new module subfolder is added (e.g. the LL/scripts/kdemodule/foo), but not when particular files within it are added or removed. Every file in the module subfolder will be installed automatically, so put there only what is needed at runtime.

Real-Life Examples

CJK Accelerator Keys

In CJK environment, accelerator keys are wrapped in the parenthesis. For example, translation of "&File" in Korean is "파일(&F)", and in Japanese, "ファイル(&F)". When it comes to the names of action, it is shared among toolbar icon names, menu bar, and so on.

However, in toolbar icon name, "Configure shortcut" dialog, etc., those access keys are remaining in the name, makes the name little bit awkward. In this situation, we can use transcript to get rid of those accelerator keys. Currently, the code is in Japanese and Korean kdelibs4.js file.

Korean Postpositions

In Korean language, postpositions are widely used. Among those, 8 postpositions are in pair and they change forms according to the word. Because PO file itself doesn't know what the substituted word(like %1) is, proper postpositions couldn't be added to the string.

Fortunately, the rule is programmable, and proper postpositions could be added via transcript. Right now a little part of postposition rules is programmed into Korean version of kdelibs4.js.

Automatic Property Calls

Leveraging property maps interface, we can make a generalization of the dynamic case setting example. We want to set named properties by key-value pairs for a given message, and then, to use calls named as property keys to retrieve the values when that message is used as placeholder replacement in another message.

For example, Nevernessians need to match the noun case of "baloon" when it is inserted into sentence "I see a %1", and additionally the adjective form of "red" to baloon in "I see a red %1", so they do:

msgid "baloon"
msgstr ""
"beeluun"
"|/|"
"$[properties see-it beeluuno see-red raado]"

msgid "I see a %1"
msgstr ""
"O sii e %1"
"|/|"
"O sii e $[see-it %1]"

msgid "I see a red %1"
msgstr ""
"O sii e raad %1"
"|/|"
"O sii e $[see-red %1] $[see-it %1]"

The properties call, which does all the work -- setting properties, setting new calls by property keys -- looks like this:

// Set properties of the phrase given by the finalized msgstr in the PO file.
// The arguments to the call are consecutive pairs of keys and values,
// as many as needed (i.e. total number of arguments must be even).
//
// The property keys are registered as PO calls taking single argument,
// which can be used to retrive the property values for this msgstr
// when it is later used as placeholder replacement in another message.
//
// Always signals fallback.
//
function setMsgstrProps (/*KEY1, VALUE1, ...*/)
{
    if (arguments.length % 2 != 0)
        throw Error("Property setter given odd number of arguments.");

    // Collect finalized msgstr.
    phrase = Ts.msgstrf()

    // Go through all key-value pairs.
    for (var i = 0; i < arguments.length; i += 2) {
        var pkey = arguments[i];
        var pval = arguments[i + 1];

        // Set the value of the property for this phrase.
        Ts.setProp(phrase, pkey, pval);

        // Set the PO call for getting this property, if not already set.
        if (!Ts.hascall(pkey)) {
            Ts.setcall(pkey,
                       function (phr) { return Ts.getProp(phr, this.pkey) },
                       {"pkey" : pkey});
        }
    }

    throw Ts.fallback();
}
Ts.setcall("properties", setMsgstrProps);
// NOTE: You can replace "properties" in the line above with any UTF-8 string,
// e.g. one in your language so that it blends nicely inside POs.

Labels Dependent on Dynamic Values

Sometimes the text of a standalone label should grammatically conform to the a changing value elsewhere in the interface. The typical example of this is when a sentence-like row of different widgets is used to specify an action:

Search in files modified:
( ) ...
(*) during the last {day|week|month|year}
( ) ...

( ) stand for radio buttons, and {...|...|...} for a list box with several choices. The label "during the last" contains the adjective "last", which in some languages needs to have different forms according to the noun it describes (one of the choices from list box). Whenever the list box value changes, the label should change to match it. But this label is a standalone message in the PO file, so based on what can its translation be scripted?

The solution is to ask the programmer to make the label refresh on every change of the list box value, and to add a dynamic context to the label's message (using inContext method of KLocalizedString). The dynamic context is a key-value string pair which indicates the current list box value; the key and possible values should be documented in the message's context. If the programmer did this properly, the translator should now see in the PO file:

msgctxt ""
"Part of logical sentence 'Search in files modified >during the last< "
"day|week|...'; provides dynamic context 'last-what': 'd' for day, "
"'w' for week, 'm' for month, 'y' for year."
msgid "during the last"

The translation can now be scripted based on the last-what dynamic context:

msgstr ""
"duureeng thee leestee"
"|/|"
"duureeng $[by-context last-what 'd|m' 'thee leestee' 'w|y' 'thaa leestaa']"

The by-context call takes the context key, followed by pairs of context value matches (regular expressions) and corresponding phrases; in this example, one form is selected for context values d and m, and another form for w and y. The definition of this call is:

// Select a phrase according to dynamic context.
// The first argument is the context keyword,
// followed by arbitrary numer of pairs of context values and
// corresponding phrases, optionally followed by default phrase.
//
// Context values are actually regular experessions, so that if one phrase
// corresponds to several context values, it does not have to be repeated
// but its context value can be given e.g. as 'foo|bar|...'.
//
// If the context was not matched, default phrase is returned if
// it was given; otherwise fallback is signaled.
//
function selectByContext (/* ctxt_key,
                             valrx_1, phrase_1, ..., valrx_N, phrase_N
                             [, default_phrase] */)
{
    if (arguments.length < 1) {
        throw Error("Selector by context needs at least one argument.");
    }

    // Collect context value for the given key.
    var ctxtkey = arguments[0];
    var ctxtval = Ts.dynctxt(ctxtkey);

    // Match the context and select the corresponding phrase.
    var phrase;
    for (var i = 1; i < arguments.length; i += 2) {
        if (ctxtval.match(RegExp(arguments[i]))) {
            phrase = arguments[i + 1];
            break;
        }
    }

    // If context was not matched, select default phrase or signal fallback.
    if (phrase == undefined) {
        if (arguments.length % 2 == 0) {
            phrase = arguments[arguments.length - 1];
        } else {
            throw Ts.fallback();
        }
    }

    return phrase;
}
Ts.setcall("by-context", selectByContext);
// NOTE: You can replace "by-context" in the line above with any UTF-8 string,
// e.g. one in your language so that it blends nicely inside POs.