[Home]UTF-8

HomePage | Recent Changes | Preferences

Difference (from prior author revision) (major diff, minor diff)

Changed: 18c18
* Most of the existing computer software (including whole operating systems) was not written with Unicode in mind, and using Unicode with them might create some compatibility issues. For example, the C standard library marks the end of a string with a character that has an 1-byte code 0x00 (hexadecimal). In 2-byte Unicode the English letter "A" will be coded as 0x0041. The library will consider the first byte 0x00 as the end of the string and will ignore anything after it. UTF-8, however, is designed so that each byte "makes sense" independently. That's why UTF-8 will probably not suffer from such severe problems as is presented above.
* Most existing computer software (including operating systems) was not written with Unicode in mind, and using Unicode with them might create some compatibility issues. For example, the C standard library marks the end of a string with a character that has an 1-byte code 0x00 (hexadecimal). In 2-byte Unicode the English letter "A" will be coded as 0x0041. The library will consider the first byte 0x00 as the end of the string and will ignore anything after it. UTF-8, however, is designed so that encoded bytes never take on any of the special ASCII 'special character' values, preventing this and similar problems.

Added: 20a21
* Although encoded characters are variable length, their encoding is such that their boundaries can be delineated without elaborate parsing.

UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding that is used to represent Unicode-encoded text using a stream of bytes.

Description

UTF-8 is currently standardized as RFC 2279 (UTF-8, a transformation format of ISO 10646), which is quite extensive and detailed. However, a short summary is brought below, in the case that the reader is interested only in a general overview.

The characters that are smaller that 128 are encoded with a single byte that contains their value: these correspond exactly to the 128 7-bit ASCII characters. In other cases, several bytes are required. The bytes' upper bit is always 1, in order form them to be always greater than 128 and not look like any of the 7-bit ASCII characters (particularly the ones used for control, e.g. Carriage Return). The encoded character is divided into several groups of bits, which are then divided among the lower positions inside these bytes.

For example, the character alef (א), which is Unicode 0x05D0, is encoded into UTF-8 in this way:

So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the UCS-2 characters use three bytes; however for representing the fuller UCS-4 (which is currently not used since even its subset, the UCS-2 is not yet filled in completely), up to 6 bytes may be required.

Advantages

Disadvantages


HomePage | Recent Changes | Preferences
This page is read-only | View other revisions
Last edited December 8, 2001 7:10 pm by 213.121.100.xxx (diff)
Search: