Wikipedia: UTF-8

Showing revision 3

UTF-8 is a variable-length encoding that is used to represent Unicode-encoded text using a stream of bytes. This is important for two reasons:

A Unicode symbol takes from 2 to 4 bytes. Some symbols (including the English alphabet) in UTF-8 will take as little as 1 byte, although others may take up to 6. So that, UTF-8 generally saves some space.
Most of the existing computer software (including whole operating systems) was not written with Unicode in mind, and using Unicode with them might create some compatibility issues. For example, the C standard library marks the end of a string with a character that has an 1-byte code 0x00 (hexadecimal). In 2-byte Unicode the English letter "A" will be coded as 0x00 0x41. The library will consider the first byte 0x00 as the end of the string and will ignore anything after it. UTF-8, however, is designed so that each byte "makes sense" independently. That's why UTF-8 will probably not suffer from such severe problems as is presented above.