Wikipedia: UTF-8

Showing revision 6

UTF-8 is a variable-length encoding? that is used to represent Unicode-encoded text using a stream of bytes.

Description

UTF-8 is currently standardized as [RFC 2279: UTF-8, a transformation format of ISO 10646], which is quite extensive and detailed. However, a short summary is brought below, in the case that the reader is interested only in a general overview.

The characters that are smaller that 128 are encoded with a single byte that contains their value. In other cases, several bytes are required. The bytes' upper bit is always 1, so that they may not look like ANSI control characters. The encoded character is divided into several groups of bits, which are then divided among the lower positions inside the bytes.

For example, the character alef (א), which is Unicode 0x05D0, is encoded into UTF-8 in this way:

It falls into the range of 0x80 to 0x7FF. That's why it has to be encoded using 2 bytes, 110xxxxx 10xxxxxx.
Hexadecimal 0x5D0 is eqivalent to binary 101-1101-0000.
The 11 bits are put in their order into the position marked by "x"-s: 11010111 10010000.
The final result is the two bytes, more conveniently expressed as the two hexadecimal bytes 0xD7 0x90. That's the letter aleph in UTF-8.

Advantages

A Unicode symbol takes from 2 to 4 bytes. Some symbols (including the English alphabet) in UTF-8 will take as little as 1 byte, although others may take up to 6. So that, UTF-8 generally saves some space.
Most of the existing computer software (including whole operating systems) was not written with Unicode in mind, and using Unicode with them might create some compatibility issues. For example, the C standard library marks the end of a string with a character that has an 1-byte code 0x00 (hexadecimal). In 2-byte Unicode the English letter "A" will be coded as 0x0041. The library will consider the first byte 0x00 as the end of the string and will ignore anything after it. UTF-8, however, is designed so that each byte "makes sense" independently. That's why UTF-8 will probably not suffer from such severe problems as is presented above.
UTF-8 strings can be sorted using standard byte-oriented sorting routines (however there will be no differentiation between stroke and capital letters with values exceeding 128).

Disadvantages

UTF-8 is variable-length; that means that different characters take sequences of different lengths to encode. The acuteness of this could be decreased, however, by creating an abstract interface to work with UTF-8 strings, and making it all transparent to the user.