Wikipedia: Wiki Canonization

Wiki Canonization is the algorithm by which the text of Free Links is converted to an url.

The current algorithm is essentially

convert all spaces to underscores
uppercase the first letter

This is correct as of version 0.91 (March 2001). The uppercase first letter is not yet strictly enforced.

A better strategy might be

remove all accents, umlauts and other diacriticals
convert all non-alphanumerics to underscores
delete all leading and trailing underscores, and replace any consecutive underscores with a single underscore
uppercase the first letter, and any letter that follows an underscore
lowercase all characters that were not explicitly uppercased in #3

This keeps coming up, so lets chat.

What I would most enjoy hearing is an example of two distinct Free Links that, by my suggestion, would canonize to the same url, but really shouldn't.

I don't think there are any. Indeed, canonizing different things to the same thing is often good, even when it leads to ambiguity. The ambiguity can be resolved on the canonical page.

It is unfortunate but true that we have to be careful to support what has been done in the past. We only used CrammedTogetherWords? links for a month or so, and we still haven't gotten rid of them completely, and I expect to keep finding a few months from now.

: At what stage can we expect them to be phased out of the software? I keep running into things like MacIntosh? ...

A reply from the UseModWiki author (CliffordAdams):

1. remove all accents, umlauts and other diacriticals

: I am not sure this would be best. The current approach is to either ignore non-English characters (if $NonEnglish = 0), or treat them as letters (if $NonEnglish = 1). I would rather not try to recognize accented characters just to remove the accents. Also, some conversions may not be that simple like "Gödel" where the preferred English version is "Goedel".

I was, indeed, suggesting that we do the wrong thing, and turn [[Gödel]] into [[Godel]]. I still believe that doing so would make for fewer user mistakes than not doing so.

2. convert all non-alphanumerics to underscores

: Currently, only three URL-safe non-alphanumeric characters are allowed (comma ",", dash "-", and period "."). I might consider converting other punctuation to underscores in the future. For now, however, I would rather restrict the range of links.

3. delete all leading and trailing underscores, and replace any consecutive underscores with a single underscore

: There are some attempts to do this already in the code, but it should (and will) be stricter.

4. uppercase the first letter, and any letter that follows an underscore

: This may happen in the next release, at least as a site option. It will require a conversion script to convert the old mostly-lowercase page names.

5. lowercase all characters that were not explicitly uppercased in #3

: This is unlikely, as I want the old-style WikiName links to be allowed in the Free Link format.

I would assert that WikiName WiKiNaMe [[Wikiname]] and [[WIKINAME]] should all point to the same page.

March 28, 2001 update:

The development version of UseModWiki now converts all separate "words" in a page name to start with uppercase letters. Words are separated by spaces/underscores or punctuation. (In this version the 5 punctuation characters (),.- are allowed in page titles.)

Links to pages (either within pages or in URLs) will be able to use lowercase words. For instance: naming conventions [naming Conventions]? Naming conventions [Naming Conventions]? ...will all link to the same page (which will be titled titled "Naming Conventions"). One could link to http.../naming_conventions or http.../naming_Conventions, etc.

The only drawback of this solution that I can find is that irrelevant words (like and, of, or a) will also be capitalized in page titles. For instance, the page The Canon of Scripture will be titled [The Canon Of Scripture]?. The page title (fully capitalized) is what is shown on the Recent Changes page, the search results page, and the top of the actual wiki page itself.

Before the code can be changed, a conversion script must be run to convert all the old pages with lowercase words to uppercase names. In most cases this will be easy, but there are a few pages with names that differ only in case. Links will not have to be converted--they will work as-is.

Does anyone strongly object to this plan? Eventually further canonization may occur, but this simple step should solve some recent problems. --CliffordAdams

Sounds great! I have no objections, but I think it would be nice if ' could be used in page titles, since possessives are so often used. I'm vaguely aware that this might mess up the use of italics or bold, so why not at least permit ' as long as there is no ' adjacent to it? I'd like to be able to write [Plato's Republic]. -- LMS

: I looked for any reason to disallow the ' character, and I don't see any major problems. (I had some concerns about handling the files using UNIX shell utilities, but that is also a concern for the () characters.) If someone uses two or three ' characters in a row it will simply look strange (and can be fixed with a quick edit). --CliffordAdams

First of all, thanks for the parentheses, that will be a big help. And I don't think your present proposal willcause any major problems.

There are two separate functions here that are currently closely tied, but that perhaps could be separated at some point: (1) the page "address" and (2) the displayed page title. The address should be such that accidental or ad hoc links are maximized. For example, if I'm typing some text on another page and use the word "republic", I should be able to put brackets around it on a whim and hope it goes somewhere useful. It would be nice to do the same thing with "Plato's ''Republic''", "Kurt Gödel", and "War and Peace". Removing punctuation, extra spaces, and markup; anglicizing foreign characters (or maybe encoding them); and standardizing case achieves this purpose nicely. But doing the same thing to the displayed title of a page makes it ugly and (more seriously) less accurate.

I can imagine two ways to produce cleaner titles and simple ad hoc links: (1) Rather than simply removing punctuation and foreign characters, encode them in a way that allows them to be reproduced for the title, perhaps URL-encoding. That takes care of "Plato%27s" and "G%F6del", but not the capitalization of "War and Peace". (2) Allow the page itself to override the address-title with its own display-title. This takes care of all of the above, at the risk of possibly confusing users by allowing titles that don't relate to the address at all. Perhaps allow only certain changes; for example, the given title must cannonize to the same address. Also, I assume this is harder to implement.

If you decide to eventually adopt one of these schemes or a similar one, it might be a good idea to use a near-term solution that does not conflict with its later adoption.

: Eventually I would like to fully separate content storage from titles. Rather than the current system where a page has one true title, I would like for pages to have multiple names. This is not likely to happen soon. The display-title idea is a good one, but I think it would be more confusing than helpful. --CliffordAdams

Also, it is unclear from your description of the proposed capitalization scheme whether "Nirvana (band)" would remanin as is or become "Nirvana (Band)". --Lee Daniel Crocker

: It will become "Nirvana (Band)". Any [a-z] character after a space or punctuation character will be uppercased. --CliffordAdams

: Except for single quotes, please?? :-) "Plato'S Republic" is going to look kinda funny. --LMS

Please see CliffordAdams' remarks at Wikipedia bugs; as well as similar discussions on Naming conventions and Feature requests.

I have the impression that the new wiki canonization began to be used in http://es.wikipedia.com (and nobody told me!) The result is that I had some problems to find some pages I had written. Please, before using it here remember to run the script that capitalizes all word initials. --AstroNomer

It appears that the same software change that let us handle foreign characters better (and parentheses! We finally have parentheses, at least in the foreign ones) also updated to the new canonization, which was planned for a long time. It did "lose" all of the old pages with lowercase words in them, so we definitely can't do this to the main Wikipedia without changing every page title first. You can get to the old pages by typing the URL manually, e.g., http://es.wikipedia.com/wiki.cgi?action=edit&id=Cómo_se_edita_una_página; then you can copy the text into the new page. A pain in the ass, but since there are only a dozen pages or so, it's not that bad. --LDC

: Yes, I already did it with some of them, that's why i said "I had some problems" and not "I lost them!" :) --AstroNomer

So does this mean that the articles at the English Wikipedia will convert to all leading uppercase?

Opinions, perhaps strongly worded (grin)...

I think that page-name (the 'address') and page-title must be decoupled sooner or later. Delay only makes the task more difficult. These have different functions and it seems that much of the discontent with current practice is caused by trying to balance "good" page-names against "good" page-titles, causing both to be less than ideal.

Titles can default to the page-name is some form, but only on creation of the page. After that, they should be subject to editing--after all, you trust the community to maintain the rest of the page's content. Requiring the title to canonify (?) to the address could be a possible 'feature'.

Page-names/free-links should be canonified with an aggressive algorithm. Keep in mind that the primary goals of the page-name are to provide linkage into the physical implementation of the Wiki and to provide a minimally ambiguous (sorry) 'address' based on arbitrarily complex titles.

A non-goal for page-names is to provide a 'user-friendly' alternative to entering a search string or selecting a free link within a page. A page-name that 'looks' similar to the page-title whould be nice, but not a strict requirement.

Granted, this would make the software more complex, but I think the users would be happier. As the project grows, I suspect most of the users will be casual accessors who will be somewhat put off by strangly formatted titles and links.

--loh

What happened finally with the wiki canonization discussed here?? The new scheme was implemented on the non-english wikis, but the new version (magnus' one) used the ols scheme. what will happen? (specially since the links and article names in the non-english wikipedias are made using the all uppercase scheme... AN

Basically, we never switched to the newest version of UseModWiki, because we never solved the problem of the namespace crunch, where "Foo Bar" and "foo bar" were distinct articles in Wikipedia but one of which would be lost in converting to the newest version of UseModWiki.

After working on the [Nupedia Chalkboard]?, which does use the latest version of UseModWiki, as for myself, I've more or less come to the conclusion that it's rather better to have the old format, which does distinguish between upper and lower case in titles. I'm not directly responsible for the fact that Magnus' software does distinguish between upper and lower case in titles, but neither did I speak up when I noticed that it still does. Frankly, I was a bit relieved that it did, because the titles do look nicer according to the older standard. I think we'd have even more of a problem (a wetware problem) with people making links All Upper Case, since they think that'd be necessary for the link to work. --LMS