Development/Tutorials/Localization/Unicode

    From KDE TechBase
    Revision as of 21:13, 29 December 2006 by Aseigo (talk | contribs)
    (diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

    What Is Unicode?

    Unicode is a standard that provides a unique number for every character and symbol in the various languages used around the world. It is hardware, development platform and language independant. The current assigned codes are all within the range of a 16-bit integer with the first 256 entries being identical to ISO 8859 or Latin-1. This also means that the first 128 characters are identical to the familiar ASCII mapping.

    Note
    For details see http://www.unicode.org/.


    Why use Unicode?

    If you use a local 8-bit character set you can only create single language documents and interchange them only with people using the same charset. Even then, many languages contain more symbols than can be uniquely address by an 8-bit number.

    Therefore Unicode is the only sensible way to mix languages in a document and to interchange documents between people with different locales. Unicode makes it possible to do things such as easily write a Russian-Hungarian dictionary or store documents in Asian languages which may have thousands of possible symbols.

    UTF-8 and UTF-16

    UTF stands for "Unicode Transfer Format". The two variants, UTF-8 and UTF-16, define how to express unicode characters in bits.

    UTF-16 signifies that every character is represented by the 16bit value of its Unicode number. For example, the Latin-1 characters in UTF-16 have a hex representation of 00nn where nn is the hexadecimal representation in Latin-1.

    UTF-8 signifies that the Unicode characters are represented by a stream of bytes. An 8-bit value whose value lies between 0 to 127 coresponds to an ASCII character. Other characters are represeted by more than one byte. Because the characters "\0" and "/" cannot occur in a multibyte character in UTF-8, you can treat UTF-8 strings as null-terminated C-Strings.

    Below is a simple depiction showing how the the bits of Unicode character codes are put into UTF-8 bytes:

    UTF-8 Representations
    Bytes Usable Bits Representation
    1 7 0vvvvvvv
    2 11 110vvvvv 10vvvvvv
    3 16 1110vvvv 10vvvvvv 10vvvvvv
    4 21 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv