Localization/Concepts/Transcript: Difference between revisions

From KDE TechBase
(→‎Real-Life Examples: Adding some CJK examples)
(Raw value passing.)
Line 149: Line 149:
This section gives the details of how the interpolations in the PO msgstr are expanded before evaluation.
This section gives the details of how the interpolations in the PO msgstr are expanded before evaluation.


The interpolations are parts of msgstr between <tt>$[...]</tt>, and are parsed into a number of strings. The first string is the name of the call registered in the scripting module via <tt>Ts.setcall()</tt>, and the rest are arguments to bound JavaScript function. This means that ''all'' arguments passed by Transcript to the bound function are of JavaScript type <tt>String</tt>.
The interpolations are parts of msgstr between <tt>$[...]</tt>, and are parsed into a number of strings. The first string is the name of the call registered in the scripting module via <tt>Ts.setcall()</tt>, and the rest are the arguments to bound JavaScript function. The arguments are typically passed as JavaScript type <tt>String</tt>, except in a special case detailed below.


The special characters in the interpolation are whitespace, single quote (') and backslash (\). Whitespace separates arguments, whereas single quote can be used for text which contains whitespaces. The backslash is used as escape; it can escape whitespace in non-quoted text, or single quotes in quoted text. This is pretty much like a typical Unix shell.
The special characters in the interpolation are whitespace, single quote (') and backslash (\). Whitespace separates arguments, whereas single quote can be used for text which contains whitespaces. The backslash is used as escape; it can escape whitespace in non-quoted text, or single quotes in quoted text. This is pretty much like a typical Unix shell.
Line 155: Line 155:
Double quotes are not special. Single quotes are used instead of double quotes because it makes it easier to edit interpolations in PO files, where double quotes would have to be escaped. This also means that when escape is needed in the interpolation, it must be escaped once itself for the PO msgstr.
Double quotes are not special. Single quotes are used instead of double quotes because it makes it easier to edit interpolations in PO files, where double quotes would have to be escaped. This also means that when escape is needed in the interpolation, it must be escaped once itself for the PO msgstr.


The biggest difference from the shell expansion is that unlike with shell variables, the placeholders are expanded such that no characters within them are treated as special. Otherwise, many tricky problems could arise with whitespace or single quotes contained inside them.
The biggest difference from the shell expansion is that unlike with shell variables, the placeholders are expanded such that no characters within them are treated as special. Otherwise, many tricky problems could arise with whitespace or single quotes contained inside.


The call name bound to a JavaScript function using <tt>Ts.setcall()</tt> does not have to be a proper JavaScript identifier, but any Unicode string not containing the interpolation-special characters. This means that more "natural" call names can be used inside the msgstr, like those with dashes or non-Latin1 characters.
The call name bound to a JavaScript function using <tt>Ts.setcall()</tt> does not have to be a proper JavaScript identifier, but any Unicode string not containing the interpolation-special characters. This means that more "natural" call names can be used inside the msgstr, like those with dashes or non-Latin1 characters. E.g. call names can be in the language of the PO file, for an aesthetic impression.


Sub-interpolations, <tt>$[... $[...] ...]</tt>, are also possible. If put inside inside single quotes, they will be treated as ordinary text.
Sub-interpolations, <tt>$[... $[...] ...]</tt>, are also possible. If put inside inside single quotes, they will be treated as ordinary text. Thus, if a literal closing square bracket is needed as an argument inside the interpolation, it can be given within single quotes.


If a closing square bracket is needed as an argument inside the interpolation, it can be given within single quotes.
In case that the argument is given as <tt>^''number''</tt>, it will evaluate to the value supplied for the corresponding placeholder -- unlike the <tt>%''number''</tt> form, which evaluates to the placeholder substitution string. Also, the argument type as seen by JavaScript need no longer be <tt>String</tt>, but will correspond to the value substituted (e.g. <tt>Number</tt>). This distinction may be important in some situations. Even if the value is originally a string, the substitution string may have extra formatting (padding, tags...), which may not be desirable. In particular, the number substitutes may come tagged or locale-formatted, such that they cannot be easily parsed back into <tt>Number</tt>. In these cases, <tt>^''number''</tt> form can be used to get the raw values.


== The Transcript Interface ==
== The Transcript Interface ==
Line 203: Line 203:
: Used to access values of placeholder substitutes provided to the last message. Numbering is zero-based.
: Used to access values of placeholder substitutes provided to the last message. Numbering is zero-based.
:: <tt>''index''</tt> index of placeholder substitute
:: <tt>''index''</tt> index of placeholder substitute
: Returns <tt>String</tt>, regardless of what original substitute is in the application code.
: Returns <tt>String</tt>, regardless of the value type substituted in the application code.
 
; <tt>vals (''index'')</tt>
: Used to access values of placeholder values provided to the last message. Numbering is zero-based.
:: <tt>''index''</tt> index of placeholder value
: Returns the matching JavaScript type to the value type used in the application code. E.g. strings will be <tt>String</tt>, integers and doubles <tt>Number</tt>. May also return <tt>Undefined</tt>, in case the value cannot be represented as a reasonable JavaScript type.


; <tt>dbgputs (''msg'')</tt>
; <tt>dbgputs (''msg'')</tt>

Revision as of 19:16, 5 December 2007

Warning
This section needs improvements: Please help us to

cleanup confusing sections and fix sections which contain a todo


Where does this article belong to? Should there be a Translation section on Techbase?

There is already a section, it is called Localization. So put it to Development/Tutorials/Localization/Transcript --Dhaumann

Translation Scripting?

Current state of affairs in localization of user-visible strings (messages) in application interfaces is such that translator is sometimes forced to supply an inadequate translation. This problem typically occurs when a message contains a placeholder to be substituted at runtime, or when two unrelated strings become related by placement in the interface. In either case, the modest requirements of english language on congruence of words in a sentence allow original string to remain grammatically correct, while not so in many other languages.

Translators and programmers sometimes try to work out a change in the code which would provide a more workable alternative. However, this process is difficult, and worse yet, the outcome is still language-dependent: solving the problem for a few languages does not necessarily solve it for all other.

One way to overcome these problems in a more general and compartmentalized manner, is to provide translators with a way to modify translated strings at runtime, depending on the context (eg. particular placeholder substitutes). In other words, to script translation. Translator should be able to operate on any interface string he wishes, while the programmer should not bear any extra burden (or know about translation scripting at all).

The Transcript Engine

KDE4 comes ready with a translation scripting system, the Transcript. Several strategic choices were made in its design:

  • programmers are unaware of scripting, which means that translator can script any message without outside coordination
  • unless translator wants to script a message, he is faced with familiar, standard Gettext PO environment (i.e. translators too can be scripting-agnostic)
  • scripting is a low-level bolt-on to Gettext environment, to keep existing PO tools in the game
  • to have enough power for unforeseen needs, a general-purpose scripting language is provided: JavaScript (with extensions for interfacing with Transcript)
Note
For comparison, another translation scripting approach taking some different decisions (new translation environment down to the file format, scripting facilities more specialized, etc.) is brewing at http://wiki.mozilla.org/L20n


To script a particular message, translator writes short scripting calls into msgstr in the PO file which expand into parts of msgstr (interpolations), and JavaScript code which defines these calls into accompanying Transcript module file (eg. foo.po can be augmented with foo.js).

In case the application draws translations from several PO files, scripting calls defined in one of the Transcript modules are available in all used PO files. Since every KDE app uses kdelibs.po, calls defined in kdelibs.js are available everywhere.

The scripting process is illustrated by several examples. More detailed explanations of the elements are given in following sections.

A Useless Example

In Nevernessian it is impolite to speak out a greet with the same tone of voice throughout; instead, the name of the person must be shouted out. Hence, the translator wants to capitalize the placeholder substitute in the following login greet in neverness.po:

  1. neverness_login.cpp:10

msgid "Hello, %1!" msgstr "Heelyy, %1!"

So the translator adds a scripted msgstr, with an interpolation:

  1. neverness_login.cpp:10

msgid "Hello, %1!" msgstr "Heelyy, %1!" "|/|" "Heelyy, $[shout %1]!"

The first thing to note is that, while a bit longer, msgstr is still a proper PO msgstr, which means that it can be edited and processed by the usual PO tools.

The first part of msgstr is same as before, and called the fallback in this context: if the scripted part happens to fail in some way, the fallback translation is used. The fallback is followed by the fence |/|, which separates the fallback and scripted translation (and, for that matter, indicates that this message is scripted).

Finally, there is the scripted translation after the fence. Compared to fallback, it contains the interpolation $[shout %1], which is supposed to evaluate to a capitalized version of the placeholder substitute. It is composed of the call name, shout, and one argument to it, the %1 placeholder which will be replaced by its substitute. The syntax and expansion rules for interpolations are similar to Unix shell.

The call shout itself is defined in the Transcript module neverness.js, which contains only these lines:

function capitalize (str) {

   return str.toUpperCase();

}

Ts.setcall("shout", capitalize);

Here the function capitalize is an ordinary JavaScript function which takes a string argument and returns all-caps version of it.

The link with the PO file is established by the call to Ts.setcall() -- the Transcript interface is represented by the property functions of the Ts object. In this variant, the Ts.setcall() takes the name of the call for the interpolations in the PO messages (a string), and the JavaScript function which will actually be invoked (bound to the call).

That's it, now the fair Nevernesse folks are greeted properly.

Basic Case Resolution

One problem frequently encountered is wrong noun case when placeholder is substituted in the msgstr. For example, in many languages every KDE app has such a problem in the Help menu, with one or both of "About %1..." and "%1 &Handbook". This can be scripted in kdelibs.po like this:

msgid "&About %1" msgstr "&O %1" "|/|" "&O $[get-case dative %1]"

The get-case interpolation is supposed to get the dative case of whatever app name the %1 happens to be. The Transcript module kdelibs.js contains the definition of get-case, as well as the dictionary of cases:

function getProperty (prop, key) {

   return _dict_[key][prop];

} Ts.setcall("get-case", getProperty);

_dict_ = {}; function addDictCases (key, gen, dat, acc, ins) {

   if (!_dict_[key])
       _dict_[key] = {};
   _dict_[key]["genitive"]     = gen;
   _dict_[key]["dative"]       = dat;
   _dict_[key]["accusative"]   = acc;
   _dict_[key]["instrumental"] = ins;

}

// dictionary entries follow: addDictCases("KWrite", "KWritea", "KWriteu", "KWrite", "KWriteom"); addDictCases("Konsole", "Konsole", "Konsoli", "Konsolu", "Konsolom"); ...

Function getProperty, bound to get-case call, simply returns the entry from the dictionary of forms. Function addDictCases is responsible for adding the static entries (name and its cases) into the dictionary, which is done in the final few lines for all apps of interest.

This completes the example, but for better modularization, it is also possible split out the dictionary insertion in a separate file, eg. appdict.js:

// appdict.js addDictCases("KWrite", "KWritea", "KWriteu", "KWrite", "KWriteom"); addDictCases("Konsole", "Konsole", "Konsoli", "Konsolu", "Konsolom"); ...

and use Transcript interface to load this file in the kdelibs.js:

// kdelibs.js ... ... ... Ts.load("appdict");

Note that Ts.load() takes filename without extension, and assumes its location is relative to the folder of the parent file (ie. in this case kdelibs.js and appdict.js should be in the same folder).

Dynamic Case Setting

The previous scripted example solves the original problem, but introduces the burden of maintaining the dictionary insertion file. There is no way around this when the placeholder substitutes are "dead" strings from outside (eg. from .desktop files), but when they are coming from KDE's PO files at runtime, this burden can be removed.

The app name in KDE's Help menu indeed comes from the app PO file, and it is of course encountered at runtime before the menu strings come into focus. This allows setting the cases of app name in the PO msgstr which contains it. For example, katepart.po contains the "KWrite" string, and the forms could be set at that point:

msgid "KWrite" msgstr "KWrite" "|/|" "$[set-cases KWritea KWriteu KWrite KWriteom]"

The set-cases is a side-effect interpolation: it should set the dictionary entries, but this particular message should in any case use the ordinary translation. Assuming all the definitions from previous example are still in effect, here is how set-cases could be defined in kdelibs.js:

function dynamicSetCases (gen, dat, acc, ins) {

   addDictCases(Ts.msgstrf(), gen, dat, acc, ins);
   Ts.fallback();

} Ts.setcall("set-cases", dynamicSetCases);

In other words, this is little more than a wrapper to "static" addDictCases from previous example, but two new elements of Transcript interface appear. First is the Ts.msgstrf() function, which returns the finalized ordinary translation (placeholders substituted), and which is needed in this case as the dictionary key. Second is the Ts.fallback() function, which signalizes the Transcript engine to disregard the result of the scripted part of msgstr and use the ordinary translation.

Admittedly, the use of Ts.fallback() in this case is not necessary, but given for introductory purpose; dynamicSetCases might as well return the ordinary translation via Ts.msgstrf().

The PO Shell

This section gives the details of how the interpolations in the PO msgstr are expanded before evaluation.

The interpolations are parts of msgstr between $[...], and are parsed into a number of strings. The first string is the name of the call registered in the scripting module via Ts.setcall(), and the rest are the arguments to bound JavaScript function. The arguments are typically passed as JavaScript type String, except in a special case detailed below.

The special characters in the interpolation are whitespace, single quote (') and backslash (\). Whitespace separates arguments, whereas single quote can be used for text which contains whitespaces. The backslash is used as escape; it can escape whitespace in non-quoted text, or single quotes in quoted text. This is pretty much like a typical Unix shell.

Double quotes are not special. Single quotes are used instead of double quotes because it makes it easier to edit interpolations in PO files, where double quotes would have to be escaped. This also means that when escape is needed in the interpolation, it must be escaped once itself for the PO msgstr.

The biggest difference from the shell expansion is that unlike with shell variables, the placeholders are expanded such that no characters within them are treated as special. Otherwise, many tricky problems could arise with whitespace or single quotes contained inside.

The call name bound to a JavaScript function using Ts.setcall() does not have to be a proper JavaScript identifier, but any Unicode string not containing the interpolation-special characters. This means that more "natural" call names can be used inside the msgstr, like those with dashes or non-Latin1 characters. E.g. call names can be in the language of the PO file, for an aesthetic impression.

Sub-interpolations, $[... $[...] ...], are also possible. If put inside inside single quotes, they will be treated as ordinary text. Thus, if a literal closing square bracket is needed as an argument inside the interpolation, it can be given within single quotes.

In case that the argument is given as ^number, it will evaluate to the value supplied for the corresponding placeholder -- unlike the %number form, which evaluates to the placeholder substitution string. Also, the argument type as seen by JavaScript need no longer be String, but will correspond to the value substituted (e.g. Number). This distinction may be important in some situations. Even if the value is originally a string, the substitution string may have extra formatting (padding, tags...), which may not be desirable. In particular, the number substitutes may come tagged or locale-formatted, such that they cannot be easily parsed back into Number. In these cases, ^number form can be used to get the raw values.

The Transcript Interface

Transcript provides extensions to the JavaScript, which interface with the PO file and the Transcript environment. They are all function properties of the Ts object, accessible as Ts.func(args).

setcall (name, func)
setcall (name, func, obj)
Binds the call name to the JavaScript function, for use in the interpolations inside the PO file.
name name of the call. Can be any Unicode string, for ease of use in the msgstr
func function object
obj object to act as this inside the function; if omitted, this refers to global object
Returns Undefined.
load (file*)
Evaluates the code in the specified files, in the left to right order. File paths are expected to be relative to current module's folder.
file file name without extension
Returns Undefined.
fallback ()
Forces Transcript to use ordinary translation, regardless of whether the interpolation evaluates successfully or fails. The evaluation of the script is not aborted, any other interpolations will evaluate too.
 
Returns Undefined.
msgid ()
Returns msgid of the last message, with placeholders intact.
msgstrf ()
Returns finalized ordinary translation of the last message, with placeholders substituted.
msgctxt ()
Returns msgctxt of the last message, with placeholders intact.
msgkey ()
Returns a String which is implementation-dependent combination of msgctxt and msgid with placeholders intact. Used to uniquely identify the message within the PO file, usefull as a dictionary key.
nsubs ()
Returns the number of substitutes provided for placeholders in the last message. It is equal to the highest-numbered placeholder for a proper i18n call in the application code, but i18n calls do not have to be quite proper.
subs (index)
Used to access values of placeholder substitutes provided to the last message. Numbering is zero-based.
index index of placeholder substitute
Returns String, regardless of the value type substituted in the application code.
vals (index)
Used to access values of placeholder values provided to the last message. Numbering is zero-based.
index index of placeholder value
Returns the matching JavaScript type to the value type used in the application code. E.g. strings will be String, integers and doubles Number. May also return Undefined, in case the value cannot be represented as a reasonable JavaScript type.
dbgputs (msg)
Outputs a debug message in the shell when KDE has been compiled with debug option.
msg message string
Returns Undefined.
callForall (name, func, obj)
callForall (name, func)
Similar to setcall(), but the function is executed on every message after it has been finalized (if the message was explicitly scripted, the function is executed after the script has been evaluated). The function receives no arguments, and its return value is ignored, i.e. it cannot change the finalized message; the call is used for side-effects. The name parameter is used only for reporting errors. When several callForall() have been issued, the functions are invoked in the order of issue.
 
More precisely, these calls are made on every message after the Transcript engine has been initialized, which happens on first explicitly scripted message. Thus, if earlier application of the calls is necessary, Transcript can be jump-started by scripting one early encountered message with only a single empty interpolation: msgstr "...|/|$[]"; this will do nothing for that message, but it will start the engine.
 
Returns Undefined.

Repository Organization

For the PO file in the KDE repository LL/messages/kdemodule/foo.po, the corresponding Transcript module should be located at LL/scripts/kdemodule/foo/foo.js. Note the extra subfolder in the module path, named like the basename of the PO/JS file. This subfolder is introduced because the module's main JS file might need other supporting files, which can then be located conveniently in the same folder.

The autogen.sh script, used for some time already to generate build system support for PO files, will also generate build support for Transcript modules. In particular, it must be rerun whenever a new module subfolder is added (e.g. the LL/scripts/kdemodule/foo), but not when particular files within it are added or removed. Every file in the module subfolder will be installed automatically, so put there only what is needed at runtime.

Real-Life Examples

CJK Accelerator Keys

In CJK environment, accelerator keys are wrapped in the parenthesis. For example, translation of "&File" in Korean is "파일(&F)", and in Japanese, "ファイル(&F)". When it comes to the names of action, it is shared among toolbar icon names, menu bar, and so on.

However, in toolbar icon name, "Configure shortcut" dialog, etc., those access keys are remaining in the name, makes the name little bit awkward. In this situation, we can use transcript to get rid of those accelerator keys. Currently, the code is in Japanese and Korean kdelibs4.js file.

Korean Postpositions

In Korean language, postpositions are widely used. Among those, 8 postpositions are in pair and they change forms according to the word. Because PO file itself doesn't know what the substituted word(like %1) is, proper postpositions couldn't be added to the string.

Fortunately, the rule is programmable, and proper postpositions could be added via transcript. Right now a little part of postposition rules is programmed into Korean version of kdelibs4.js.