Development/Tutorials/Localization/Unicode: Difference between revisions

Latest revision as of 12:46, 13 July 2012

Introduction to Unicode

Tutorial Series	Localization
Previous	None
What's Next	Writing Applications With Localization In Mind
Further Reading	The Unicode website Unicode article at Wikipedia Unicode for Unix/Linux FAQ

Abstract

Unicode is a standard that provides a unique number for every character and symbol in various writing systems used around the world which makes it the basis for translatable software and international content creation. This tutorial provides a brief introduction to Unicode and how to work with Unicode data using Qt.

What Is Unicode?

Unicode is hardware, development platform and language independent. As of version 5 there are codepoints assigned up to 0x10FFFF (but not all are used), the first 256 entries being identical to ISO 8859-1 a.k.a. ISO Latin-1. This also means that the first 128 characters are identical to the familiar ASCII mapping. Codepoints were assigned with ease of implementation for legacy character sets in mind, so Unicode is littered with such coincidences to keep translation tables and logic short.

Why use Unicode?

If you use a legacy character set like ISO 8859-15 it is difficult to create documents containing more than only a couple of languages. You can interchange them only with people using the same character set. Even then, many writing systems contain more characters than can be uniquely addressed by an 8-bit number.

Therefore Unicode is the best of the sensible ways to mix different scripts and languages in a document and to interchange documents between people with different locales. Unicode makes it possible to do things such as writing a Russian-Hungarian dictionary.

UTF-8 and UTF-16

UTF stands for "Unicode Transformation Format". The two most commonly used variants, UTF-8 and UTF-16, define how to express a Unicode character's assigned codepoint in bits.

UTF-16 signifies that every character is represented by the 16-bit value of its Unicode number. For example, in UTF-16 b (LATIN SMALL LETTER B) has a hex representation of 0x0062. There are also surrogate pairs used to encode characters with codepoints beyond 0xFFFF.

UTF-8 signifies that the Unicode characters are represented by a stream of bytes in a special encoding scheme described in [1] and [2]. Unlike UTF-16, it is compatible to ASCII. Higher codepoints take two or three bytes to encode, rarely more than that.

Unicode in KDE Applications

Storing Unicode strings in memory and displaying Unicode on screen is very easy with Qt: QString and QChar provide full Unicode support transparently to your application.

QString and QChar both internally represent characters using UTF-16. If you read in text pieces from or write it to an 8-bit source such as QTextStream you must use a QTextCodec to convert the text between representations. Normally you would use the UTF-8 codec to convert a Unicode string between a QByteArray and a QString.

Getting an appropriate Unicode QTextCodec is quite simple:

QTextCodec Utf8Codec  = QTextCodec::codecForName("utf-8");
QTextCodec Utf16Codec = QTextCodec::codecForName("utf-16");

Display

Any widget that uses QString or QChar when painting text will therefore be able to show Unicode characters as long as the user has an appropriate font available.

For flexible text editing and display, consult the Qt documentation on rich text processing. The Scribe system provides a set of classes centered around the QTextDocument class that makes it easy to accurately display and print formated text, including Unicode.

In-Memory Buffers

To read Unicode data from an in-memory buffer, such as a QByteArray, one can use the codecs created earlier in this section:

QByteArray buffer;

// UTF-16
QString string = Utf16Codec->toUnicode(buffer);

// UTF-8
QString string = Utf8Codec->toUnicode(buffer);

Writing to a buffer looks very similar:

QString string; // my Unicode text
QByteArray array; // Buffer to store UTF-16 Unicode
QTextStream textStream(array, IO_WriteOnly);

textStream.setEncoding(QTextStream::Unicode);
textStream << string;

File Handling

For storage and retrieval to and from a file KDE applications should provide the option to store text data in Unicode instead of the locale character set. This way the user of your application can read and save documents with the locale charset as the default while having the option to store it in Unicode. Both UTF-8 and UTF-16 should be supported. European users will generally prefer UTF-8 while Asian users may prefer UTF-16.

Reading Unicode data from a text file is demonstrated in the following example using the QTextCodecs we created earlier for UTF-8 data:

QTextStream textStream;
QString line;

// UTF-16, if the file begins with a Unicode mark
// the default is ok, otherwise:
// textStream.setEncoding(QTextStream::Unicode);
line = textStream.readLine();

// UTF-8
textStream.setCodec(Utf8Codec);
line = textStream.readLine();

Writing is equally straight-forward, also using the QTextCodecs we created earlier:

QTextStream textStream;
QString line;

// UTF-16
textStream.setEncoding(QTextStream::Unicode);
textStream << line;

// UTF-8
textStream.setCodec(Utf8Codec);
textStream << line;

Unicode Character Input

Being able to load, store and display Unicode is only part of the puzzle. One of the final pieces is allowing the user to input Unicode characters. This is generally solved using platform-specific multibyte input methods (IM), such as XIM on X11. KDE provides good support for these input methods via Qt and your application should not need to do anything special to enable such input.

Sample Unicode Files and Fonts

An easy way to get Unicode sample data is to take the kde-i18n package and convert some files from there to Unicode using the recode command line tool:

recode KOI8-R..UTF-8      ru/messages/kdelibs.po
recode ISO-8859-1..UTF-16 de/messages/kdelibs.po
recode ISO-8859-3..UTF-8  eo/messages/kdelibs.po

The majority of operating systems today come packaged with Unicode TrueType fonts, see [3].

@@ Line 1: / Line 1: @@
-=== What Is Unicode? ===
-Unicode is a standard that provides a unique number for every character and symbol in the various languages used around the world. It is hardware, development platform and language independant. The current assigned codes are all within the range of a 16-bit integer with the first 256 entries being identical to [http://en.wikipedia.org/wiki/ISO_8859 ISO 8859 or Latin-1]. This also means that the first 128 characters are identical to the familiar ASCII mapping.
+{{TutorialBrowser|
-{{note|For further details visit the [http://www.unicode.org/ Unicode website]. There is also a good [http://www.cl.cam.ac.uk/~mgk25/unicode.html Unicode for Unix/Linux FAQ] available online.}}
+series=Localization|
-=== Why use Unicode? ===
+name=Introduction to Unicode|
-If you use a local 8-bit character set you can only create single language documents and interchange them only with people using the same charset. Even then, many languages contain more symbols than can be uniquely address by an 8-bit number.
+next=[[../i18n|Writing Applications With Localization In Mind]]|
-Therefore Unicode is the only sensible way to mix languages in a document and to interchange documents between people with different locales. Unicode makes it possible to do things such as easily write a Russian-Hungarian dictionary or store documents in Asian languages which may have thousands of possible symbols.
+reading=[http://www.unicode.org/ The Unicode website]<br>[http://en.wikipedia.org/wiki/Unicode Unicode article at Wikipedia]<br>[http://www.cl.cam.ac.uk/~mgk25/unicode.html Unicode for Unix/Linux FAQ]
+}}
-=== UTF-8 and UTF-16 ===
+== Abstract ==
-UTF stands for "Unicode Transfer Format". The two variants, UTF-8 and UTF-16, define how to express unicode characters in bits.
+Unicode is a standard that provides a unique number for every character and symbol in various writing systems used around the world which makes it the basis for translatable software and international content creation. This tutorial provides a brief introduction to Unicode and how to work with Unicode data using Qt.
-UTF-16 signifies that every character is represented by the 16bit value of its Unicode number. For example, the Latin-1 characters in UTF-16 have a hex representation of 00nn where nn is the hexadecimal representation in Latin-1.
+== What Is Unicode? ==
-UTF-8 signifies that the Unicode characters are represented by a stream of bytes. An 8-bit value whose value lies between 0 to 127 coresponds to an ASCII character. Other characters are represeted by more than one byte. Because the characters "\0" and "/" cannot occur in a multibyte character in UTF-8, you can treat UTF-8 strings as null-terminated C-Strings.
+Unicode is hardware, development platform and language independent. As of version 5 there are codepoints assigned up to 0x10FFFF (but not all are used), the first 256 entries being identical to [http://en.wikipedia.org/wiki/ISO_8859-1 ISO 8859-1 a.k.a. ISO Latin-1]. This also means that the first 128 characters are identical to the familiar [http://en.wikipedia.org/wiki/ISO_646 ASCII] mapping. Codepoints were assigned with ease of implementation for legacy character sets in mind, so Unicode is littered with such coincidences to keep translation tables and logic short.
-Below is a simple depiction showing how the the bits of Unicode character codes are put into UTF-8 bytes:
-{|
+== Why use Unicode? ==
-|+ UTF-8 Representations
-|-
-! Bytes !! Usable Bits !! Representation
-|-
-| 1 || 7 || 0vvvvvvv
-|-
-| 2 || 11 || 110vvvvv 10vvvvvv
-|-
-| 3 || 16 || 1110vvvv 10vvvvvv 10vvvvvv
-|-
-| 4 || 21 || 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
-|}
+If you use a legacy character set like ISO 8859-15 it is difficult to create documents containing more than only a couple of languages. You can interchange them only with people using the same character set. Even then, many writing systems contain more characters than can be uniquely addressed by an 8-bit number.
-=== The Unicode Byte Order Mark ===
+Therefore Unicode is the best of the sensible ways to mix different scripts and languages in a document and to interchange documents between people with different locales. Unicode makes it possible to do things such as writing a Russian-Hungarian dictionary.
-Files stored in Unicode should start with two bytes containing the hexidecimal value FEFF if stored in big-endian byte order and FFFE if stored in little-endian bye order. This marks the content as being Unicode and conveniently notes the byte order used when storing the file. Note that big-endian is the recommended and prefered byte order for storage.
+== UTF-8 and UTF-16 ==
-=== Unicode in KDE Applications ===
+UTF stands for "Unicode Transformation Format". The two most commonly used variants, UTF-8 and UTF-16, define how to express a Unicode character's assigned codepoint in bits.
+UTF-16 signifies that every character is represented by the 16-bit value of its Unicode number. For example, in UTF-16 <tt>b</tt> (LATIN SMALL LETTER B) has a hex representation of 0x0062. There are also surrogate pairs used to encode characters with codepoints beyond 0xFFFF.
+UTF-8 signifies that the Unicode characters are represented by a stream of bytes in a special encoding scheme described in [http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt] and [http://tools.ietf.org/html/rfc3629]. Unlike UTF-16, it is compatible to ASCII. Higher codepoints take two or three bytes to encode, rarely more than that.
+== Unicode in KDE Applications ==
 Storing Unicode strings in memory and displaying Unicode on screen is very easy with Qt: QString and QChar provide full Unicode support transparently to your application.
-QString and QChar both internally represent characters using their 16bit value. If you read in text pieces from or write it to an 8bit source such as QTextStream you must use a QTextCodec to convert the text from or to its 16bit representation. Normally you would use the UTF-8 codec to convert a Unicode string between a QByteArray and a QString.
+QString and QChar both internally represent characters using UTF-16. If you read in text pieces from or write it to an 8-bit source such as QTextStream you must use a QTextCodec to convert the text between representations. Normally you would use the UTF-8 codec to convert a Unicode string between a QByteArray and a QString.
 Getting an appropriate Unicode QTextCodec is quite simple:
-<code cppqt n>QTextCodec Utf8Codec  = QTextCodec::codecForName("utf-8");
+<syntaxhighlight lang="cpp-qt">
-QTextCodec Utf16Codec = QTextCodec::codecForName("utf-16");</code>
+QTextCodec Utf8Codec  = QTextCodec::codecForName("utf-8");
+QTextCodec Utf16Codec = QTextCodec::codecForName("utf-16");
+</syntaxhighlight>
-=== Display ===
+== Display ==
 Any widget that uses QString or QChar when painting text will therefore be able to show Unicode characters as long as the user has an appropriate font available.
@@ Line 57: / Line 52: @@
 For flexible text editing and display, consult the Qt documentation on rich text processing. The Scribe system provides a set of classes centered around the QTextDocument class that makes it easy to accurately display and print formated text, including Unicode.
-=== In-Memory Buffers ===
+== In-Memory Buffers ==
 To read Unicode data from an in-memory buffer, such as a QByteArray, one can use the codecs created earlier in this section:
-<code cppqt n>
+<syntaxhighlight lang="cpp-qt">
 QByteArray buffer;
@@ Line 68: / Line 63: @@
 // UTF-8
-QString string = Utf8Codec->toUnicode(buffer);</code>
+QString string = Utf8Codec->toUnicode(buffer);
+</syntaxhighlight>
 Writing to a buffer looks very similar:
-<code cppqt n>
+<syntaxhighlight lang="cpp-qt">
 QString string; // my Unicode text
 QByteArray array; // Buffer to store UTF-16 Unicode
@@ Line 78: / Line 74: @@
 textStream.setEncoding(QTextStream::Unicode);
-textStream << string;</code>
+textStream << string;
+</syntaxhighlight>
-=== File Handling ===
+== File Handling ==
-For storage and retrieval to and from a file KDE applications should provide the option to store text data in Unicode instead of the locale character set. This way the user of your application can read and save documents with the locale charset as the default while having the option to store it in Unicode as an option. Both UTF-8 and UTF-16 should be supported. European users will generally prefer UTF-8 while Asian users may prefer UTF-16.
+For storage and retrieval to and from a file KDE applications should provide the option to store text data in Unicode instead of the locale character set. This way the user of your application can read and save documents with the locale charset as the default while having the option to store it in Unicode. Both UTF-8 and UTF-16 should be supported. European users will generally prefer UTF-8 while Asian users may prefer UTF-16.
 Reading Unicode data from a text file is demonstrated in the following example using the QTextCodecs we created earlier for UTF-8 data:
-<code cppqt n>QTextStream textStream;
+<syntaxhighlight lang="cpp-qt">
+QTextStream textStream;
 QString line;
@@ Line 96: / Line 94: @@
 // UTF-8
 textStream.setCodec(Utf8Codec);
-line = textStream.readLine();</code>
+line = textStream.readLine();
+</syntaxhighlight>
 Writing is equally straight-forward, also using the QTextCodecs we created earlier:
-<code cppqt n>QTextStream textStream;
+<syntaxhighlight lang="cpp-qt">
+QTextStream textStream;
 QString line;
@@ Line 109: / Line 109: @@
 // UTF-8
 textStream.setCodec(Utf8Codec);
-textStream << line;</code>
+textStream << line;
+</syntaxhighlight>
+== Unicode Character Input ==
-=== Unicode Character Input ===
 Being able to load, store and display Unicode is only part of the puzzle. One of the final pieces is allowing the user to input Unicode characters. This is generally solved using platform-specific multibyte input methods (IM), such as XIM on X11. KDE provides good support for these input methods via Qt and your application should not need to do anything special to enable such input.
-=== Sample Unicode Files and Fonts ===
+== Sample Unicode Files and Fonts ==
 An easy way to get Unicode sample data is to take the kde-i18n package and convert some files from there to Unicode using the <tt>recode</tt> command line tool:
-<code>recode KOI8-R..UTF-8      ru/messages/kdelibs.po
+<syntaxhighlight lang="bash">
+recode KOI8-R..UTF-8      ru/messages/kdelibs.po
 recode ISO-8859-1..UTF-16 de/messages/kdelibs.po
-recode ISO-8859-3..UTF-8  eo/messages/kdelibs.po</code>
+recode ISO-8859-3..UTF-8  eo/messages/kdelibs.po
+</syntaxhighlight>
+The majority of operating systems today come packaged with Unicode TrueType fonts, see [http://en.wikipedia.org/wiki/Unicode_font#List_of_Unicode_fonts].
-The majority of operating systems today come packaged with TrueType fonts that cover the full spectrum of the Unicode character range. If you do not have an Unicode fonts on your system there are several that can be downloaded from the Internet and a quick Internet search should provide links to various options.
+[[Category:Programming]]