This is a complete rewrite of this page. I'm going to work through everone's comments. I will simply delete the comment if I've taken care of it, or reply if there's some reason why I haven't or don't intend to take care of it.
To keep this all very simple, I'm going to unattribute all of the comments and questions and just list them, like an FAQ or something.
The current version returns results from the full text of all the articles in Wikipedia. It is currently updated when I run a script, which I do frequently while I'm working on it. After today, it will be updated either every few hours or every night, depending on what I decide based on the server load.
The current new version that I have written is fast -- it uses FastCGI? and a btree? file. It also has a semi-crude but semi-clever ranking algorithm for helping to push the best match to the top. The algorithm may be tweaked if we notice major empirical problems with it. It counts words in the title of an article much more strongly than words in the body of the article.
The code will be released tomorrow morning, I hope. I need to clean it up a bit, it's sloppy.
Google search box added, simplified. Ignore the placement, I'll rearrange later. Do we still need this, given the fact that I'm doing fulltext? Perhaps this should be an option with the main search box, or perhaps I should just link to this search?
REDIRECT pages are completely ignored. Empirically, most of them are simple respellings that clutter the results. This has a cost, as per someone's "mountain lion" and "puma" example, but since we're doing fulltext, that cost has been minimized.
Please tell me again. It might have gotten lost or maybe I thought I fixed it. Unless it is mentioned here, I think I fixed it.
I plan to add this later today. Actually what I plan to do is make this a radiobutton option.
The reason for this isn't usually about causing more pageviews for more ads, but good standards for not putting too much on a single page for people with slow modems (i.e. most people). After I get things stabilized, I will look carefully at what the optimal default should be (10? 15? 50?). Ideally, we should set your preference in the preferences cookies and respect that. So you can get 200 at a time if you want, and other people can get the standard default, say.
Oh, I see what you mean. Where I had it in the code before, it was showing up on pages that you can't really edit. Now it is not at the top of pages that you can edit. I'll study this.
(It isn't really about the search engine, though.)
Tough call. I'm thinking about removing them, but having the text on them still work and point to the main page. So if someone mentions a word on a /Talk page, you'll be sent to the main page instead when you search on that word.
That's problematic, of course. But currently, we return a lot of talk pages that are probably unnecessary.
The other thing to do is simply exclude all /Talk pages, period. I actually prefer this solution myself, but...
Anyhow, I think we should identify /Talk strictly as pages which are named such-and-such/Talk.
I disagree. I can make them a little longer, but remember that we want the page to load quickly for people. The idea is not to read the page here, but to just get a quick idea of whether this is your context. Look at what Google does -- I'm already returning significantly more.
I think the italics look nice. :-)
I think that a better solution will come when we move to a MySQL solution. Certain pages can be flagged as personal, for example, and then handled differently. For now, this is a lot of work for minimal benefit.
All of the things listed there will change soon, so as to lean more towards encyclopedias. That was just cut and pasted from another site I own.
I can't do substring searches with my current setup, period. I should emphasize that I can't do it, not to say that it can't be done.
However, the right thing for a search engine to do with your example is to automatically list all the Márquez as such, but to ALSO list these under Marquez by squashing the fancy 'a' down to a regular 'a'. In this way, people can type either and get decent results. Right now, I don't do that. Anything that people type, goes into the system as-is. This is good in a way but bad in a way.
One thing to keep in mind is that at least on the English language wikis, most people won't have the least clue how to type in those fancy foreign letters. I don't. (I had to cut and paste yours to include it above!)
A, ha. That's funny. Bomis, which is my main site, and the site that pays the bills for all our Wikipedia and Nupedia fun and games, is a web 'ring' search engine. So 'ring' is a stopword there.
The cause of the problem you identified is that I did a cute trick with 'ing', basically causing the search engine to treat 'thinking' and 'think' in the same way. I do this with 's' at the end, too. With 's', this cute trick eliminates all the woes of singulars and plurals, so that "horses" and "horse" return the same thing, which is good.
There are some funny side-effects, I see. I'll make an exception for 'ring' and 'thing' on the next revision!
See the previous entry for a clue as to the cause. My immediate thought is that I have a bug -- in the body, I silence a closing 's', and in the title, I don't. So your search for 'Gauss' actually searches for 'Gaus' which, in the body, is equivalent to 'Gauss'. But I didn't do the trick correctly in the title.
This is kind of fun. I'm giving away all my "secrets" from the Bomis search engine. I've never thought they were all that valuable as secrets, but they have been secret for a few years now. Cute tricks, mostly. :-)
I agree that it should be explained somewhere. But I think that it's not annoying behavior. Like many other design choices, it's really an empirical matter. Does it help more often than it hurts? In my experience, it helps on the vast majority of searches, but hurts only sometimes.
The horse example is a good one. It seems very unlikely to me that someone would really care much about 'horse' versus 'horses'. Other examples, though, illustrate the downside more clearly. For a few words, chopping off the 's' smashes together two words of very different meanings, thus cluttering the results.
But more generally, and this is particularly true of a site with an overall smallish set of data (and Wikipedia still is, despite our fast progress, pretty small as compared to the web as a whole!), it helps a *lot*. If you're searching for information about Aleutian indians or Aleutians, you're searching for the same general sort of thing, and there's room in our search results to show both. (Basically, we don't have an article on either, so we'd better show you 'Aleutian Islands', which we do have.)
A better thing to do, and perhaps this is a useful compromise that I can think about, is to keep the two separate in the database, but upon searching, to search for both forms at the same time, and then to blend the results. Exact matches are given more weight than inexact matches. This should drive your 'horses' article to the top, if it exists, while also returning the 'horse' articles ranked lower. Or vice-versa as the case may be.
And I also agree that there should be a way to turn off any behavior that anyone doesn't like. But this is somewhat advanced for now. Still, and this is especially true once I publish the code (Tuesday, I bet), we can all pitch in to polish it up.
The main thing to remember is that search engines need to return what people are really looking for, even when they aren't good at formulating a proper request. So we want to 'fail gracefully'. If someone enters 'Aleutians' we want to give them something potentially useful, and not pretend that 'Aleutian' isn't relevantly similar. So crushing 's' is surely a part of any valid search strategy.
Search for "Paris" does not return the Paris page, but does return articles with the word "paring". This would appear to be because the search engine is dropping the final 's' in the word "Paris". - Tim
So if I want to search for cooking and I don't want to wade through a billion matches for cook, how do I do that? Maybe we could have do what I actually want and do what Larry thinks I want as checkboxes ;) -- Greg Lindahl
The current version returns results from the full text of all the articles in Wikipedia. It is currently updated when I run a script, which I do frequently while I'm working on it. After today, it will be updated either every few hours or every night, depending on what I decide based on the server load.
The today mentioned was quite sometime back. Apparently the search index has not been updated for weeks. Is the cron job broken or something?