[Home]UTF-8

HomePage | Recent Changes | Preferences

Showing revision 12
UTF-8 is a variable-length encoding? that is used to represent Unicode-encoded text using a stream of bytes.

Description

UTF-8 is currently standardized as [RFC 2279: UTF-8, a transformation format of ISO 10646], which is quite extensive and detailed. However, a short summary is brought below, in the case that the reader is interested only in a general overview.

The characters that are smaller that 128 are encoded with a single byte that contains their value: these correspond exactly to the 128 7-bit ASCII characters. In other cases, several bytes are required. The bytes' upper bit is always 1, so that they may not look like ANSI control characters or any of the 7-bit ASCII characters.

The encoded character is divided into several groups of bits, which are then divided among the lower positions inside the bytes.

For example, the character alef (א), which is Unicode 0x05D0, is encoded into UTF-8 in this way:

Advantages

Disadvantages


HomePage | Recent Changes | Preferences
This page is read-only | View other revisions | View current revision
Edited November 24, 2001 6:19 am by 213.253.39.xxx (diff)
Search: