[Home]Character encodings in HTML

HomePage | Recent Changes | Preferences

HTML has been in use since 1991, but the first standardized version with a reasonably complete treatment of international characters was version 4.0, not published until 1997. Considerable care must be exercised when creating HTML pages with special characters outside the range of normal ASCII to ensure two goals: the integrity of the information stored in the HTML document, and proper display of the document by the largest possible variety of browsers.

The Document Character Set

When HTML documents are served to the viewer, there are two ways to tell the browser what specific character encoding is used. First, HTTP headers can be sent by the server along with each page. A typical header looks like this:

Content-Type: text/html; charset=ISO-8859-1

The other method is for the HTML document to include this information at its top, inside the HEAD element.

<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">

Either method advises the receiver that the file being sent uses the character set specified. Of course, it would be a very bad idea to send incorrect information. For example, a server where multiple users may place files created on different machines cannot promise that all the files it sends will conform (some users may have machines with different character sets). For this reason, many servers simply do not send the information at all, to avoid making any false promises.

Browsers receiving a file with no character set information must make a blind assumption. The safest is probably to assume ISO-8859-1, but it is also common for browsers to assume the character set native to the machine on which they are running. The consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 126) may appear incorrectly. This presents few problems for English-speaking users, but European users require characters outside that range for everyday use.

It is important to point out that successful viewing of a page is not necessarilty an indication that it is encoded correctly. If the creator of a page and the reader are both assuming some machine-specific character set, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers with different native sets will not.

Character Entity References

to be continued...


HomePage | Recent Changes | Preferences
This page is read-only | View other revisions
Last edited March 24, 2001 7:43 am by Lee Daniel Crocker (diff)
Search: