Wikipedia: Old English poetry/Talk

Difference (from prior major revision) (no other diffs)

Removed: 52d51

Added: 53a53,54

Certainly have, I just recently bought myself a new copy, the old one was falling apart. One of the great critical analyses of English history, and considerably more accurate than many serious takes on the subject. sjc

This unicode stuff: what are the pros and cons? sjc

Well, the pros are that it's much more standard, and will hopefully render correctly on all modern browsers; whereas inserting the characters as they are in iso-8859-1 charset, which is what you did, is technically wrong (this charset isn't set to be the page's charset in the page's headers as returned by the wiki script) and won't work correctly with people who view the Web with some default encoding other than iso-8859-1. For instance, since I read/write a lot of Russian material on the web, I normally browser with the default encoding set to cyrillic, and I saw strange cyrillic characters instead of thorn and eth on the Old English poetry page.

Using unicode characters encoded as HTML entities also enables you to have many languages simultaneously on the page, not just latin characters with some diacritics, as in iso-8859-1.

The cons are that some browsers (I believe only very old ones by now, and perhaps some extremely light e.g. on PDAs) won't process and show HTML entities correctly.

Anybody wants to add to this?

--AV

I basically agree with this, although it's not correct to refer to HTML character entity references as "Unicode". æ, ð and þ have been around for a long time, so even fairly old browsers handle them correctly. I tried the page in the oldest browser I could find (Netscape 3.0) and it displayed fine. --Zundark, 2001 Oct 14

: You're right that historically it's not correct to refer to entity refs as Unicode; however, in recent times, since the emerging of HTML 4.0, XML, XHTML etc. they are really viewed inside HTML and XHTML standards as convenient aliases of the numerical character references, and these directly reference Unicode. That's why I think it useful to consider, nowadays, things like ð to be aliases standing directly for the appropriate Unicode character. --AV

Well, it looks like we should be using Unicode throughout then. Anyone know where we can get a definitive list of them for pretty much any language we might need? It might be a useful page to have up here with links. sjc

They are not Unicode (although, as AV says, modern browsers will generally map them to Unicode). There's a complete list (for HTML4.01) at http://www.w3.org/TR/html4/sgml/entities.html . Those in the first table should work even in fairly old browsers, but most of those in the other two tables are less well supported. Other characters can be obtained by using Unicode instead, and you can get a complete list of Unicode codes from http://www.unicode.org/Public/UNIDATA/ by downloading the UnicodeData.txt file (plus the huge Unihan.txt file if you want Chinese-Japanese-Korean characters too). --Zundark, 2001 Oct 14

Thanks muchly. sjc

: Please note also Unicode and HTML, and especially Wiki special characters. Hmmph, the Unicode and HTML page could be usefully extended with the official Unicode names for characters, not just numbers. --AV

My description will be simplistic, since we are discussing the matter from a purely practical point of view. Pieces of software in different (human) languages (e.g. Hebrew and English Windows) often do not agree upon the representation of computer characters (that do not belong to English alphabet, punctuation or digits). So, feeding Hebrew Windows a symbol which looks like a thorn in American Windows will not necessarily result in a thorn.

Using Unicode named entities is a way to bypass this restriction. By writing a name, you're no longer assuming that the software that was used to write the page agrees with the software displaying it. That assumption is often incorrect, since in the browser does not have enough information about the software used to author the text to figure out how it chose to encode some the symbols. With Unicode, however, you say explicitly to the web browser "give me a thorn". A good (HTML 4.0-compliant) browser then should look up the font table and try to display the symbol, no matter how it is represented internally.

The conclusion is that using Unicode is the only fully correct practice. It enables people from all over the world see the page correctly at the moment they arrive at it. Also, it is expandable so it allows using several alphabets over a single page. Unicode named entities are supported by most of the recent browsers (both IE and Netscape since Version 4). Although older browser might have problems with Unicode, any solution optimized for them will break a much bigger number of Unicode-compliant system that for some reason do not use the same encodings for some characters.

--Uriyan

Have you read 1066 And All That, Steve? There're some great parodies of Old English poetry there ;)

Sing a song of Saxons
In the Wapentake of Rye
Four an twenty eaoldormen
Too eaold to die ...

That's more like Middle English, I suppose, but they also have a great take on the Beowulf there. --AV