[Home]History of Wikipedia commentary/Proposal for an Encyclopedian Recycling Endeavor

HomePage | Recent Changes | Preferences

Revision 2 . . November 10, 2001 3:24 pm by ManningBartlett [*moved]
Revision 1 . . October 6, 2001 6:08 pm by The Cunctator [*moved from Encyclopdian]
  

Difference (from prior major revision) (no other diffs)

Changed: 1,172c1
Today, August 26th, marks the completion of a distributed endeavor to copy into Wikipedia,
articles from a 1911 encyclopedia that someone had thoughtfully digitized and placed into
the Project Gutenberg archives. This endeavor had been suggested by someone back when
wikipedia first started, and I had taken the initiative to get the first 20 or so items
posted in (starting, of course, with the letter 'A'.) 20 items was just a drop in
the bucket though; check out Alan Millar/Status to see them all. Thankfully folks
like Alan, with lots more endurance than I were available to continue posting these
articles, and today I stand amazed that we completed it! Even though it was only
one volume, that's still a lot of articles!

In my opinion, these articles greatly augment Wikipedia, with necessary data that is
unlikely to just "happen" to be entered by visitors. Consider, for example,
Alphonso X of Spain, a medieval spanish king. Certainly worthy of mention, both
as a world leader and because of his early involvement in astronomy. But would an
entry on this fellow just happen to show up through normal Wikipedia processes? Maybe,
but probably not. Needless to say there are thousands of moderately important people
like Alphonso who *should* be listed in Wikipedia, yet most likely *won't*; at least not
anytime soon.

I don't mean to disparage Wikipedia, quite the contrary. Wikipedia has a number of
strengths that the proprietary encyclopedias will likely *never* have. Of these many
strengths, let me choose just one for elaboration: Timeliness. Let me diverge a bit
on a little example.

This week a new
planetoid-thingee was discovered out in the comet belt. Very important scientific
discovery, but let's say within a few weeks your daughter needs to write a report on it
for high school, and needs more in-depth info than available in those terse CNN news
items. That hard-bound dead-tree encyclopedia might have some useful articles on
asteroids and the solar system, and probably will only be a few years out of date, but
it certainly won't have anything useful on this newly discovered planetoid. Fortunately
you bought your daughter a new computer today and it came with a digital CD ROM
encyclopedia. Unfortunately, due to space constraints, this encyclopedia's asteroid
article is extremely terse (though it does have a photo of an asteroid, but it's
copyrighted with full legal protections of course). And since the CD ROM was published
months before the new planetoid thingee was discovered, you're not likely to find it
there either. You decide to try an online encyclopedia, yet these appear to be just an
online version of the CD ROM you bought; maybe the pay-to-view site can afford to stay
more up to date, but you've paid for two encyclopedia's, and aren't too thrilled about
paying for another. Surely there must be another option...

Your daughter giggles at you. "Silly, just go to Wikipedia!" You do so, and she looks
at the Recent Changes page to see if anyone's been keeping up to date with the news. Sure
enough: Just today (the 26th of August, only a few days after the discovery), folks have
been busily posting away on all manner of astronomical topics. Asteroid is rather
terse, but at least includes mention of that planetoid thingee, "asteriod found in 2001,
identified as 2001 KX76". Asteroid/Talk includes some extra interesting info not yet
incorporated into the main article. Trans-Neptunian object, Kuiper Belt, Planet,
Near-Earth asteroid, Solar system, and Planet X have also been updated (or newly
added). Clicking around, she's able to find much more information on astronomers, other
recent discoveries, and historical information to fill in her report. She's even able to
make contact with some other students and teachers interested in this newly found body, and
thereby learn of further sources of information on it, available through the web. When she
is finally done and turns in her report, she decides to also post it up to Wikipedia, as
article [2001 KX76]?. :-)

Wonderful! Wikipedia comes to the rescue and serves its role in the passing of knowledge
to those who need it.

But you notice two things that are kind of odd. First, of course, there's very few photos in
wikipedia, but that's a whole 'nother topic of discussion. At least there's a photo of
Galileo on his page. Second, and perhaps more importantly,
many of the supporting articles seem to be rather terse. For example, you compare your
proprietary encyclopedia's lengthy dissertation on Pluto with wikipedia's dinky Pluto entry.
Saturn is not much better. Charon doesn't even exist... Hmm...

Timeliness may be a strength of wikipedia, but depth may be its weakness. Certainly
we can expect better articles to come from the planets; after all they're big and will
always be there and new things are likely to be discovered about them. But what about
other, older topics? Say you needed to know about the origins of astronomy in the 13th
century. Luckily there's that aforementioned article on Alphonso X of Spain, but what of
historical figures with names starting with B-Z?

The 'A' encyclopedia was digitized by hand, by someone who happened to have a 1911 edition
on hand. Digitization is a lot of work, but it can be done; Project Gutenberg's been at it
for years, and when you think about it, they're not so much different, organizationally,
than we are.

So here is my proposal.

I think we could turn our distributed, collaborative talents and processes towards a
mini-Project Gutenberg endeavor to digitize and copy into Wikipedia a full set of
out-of-copyright encyclopedias.

I think in the interests of practicality and to make distribution of efforts a bit simpler, we may want to allow variation in years... We could have the 1911 A, 1922 B, 1909 C, etc. I think if we allow this, it eliminates a lot of need for coordination. There will of course be variability in where one volume stops and another starts, and we might have a few articles slip through, but if those articles are important, we may "pick them up" through usual Wikipedia evolution.

There's several steps we'd need to take:

a) First, we would need to determine when a rough cut-off date for copyrights is. Is 1911 the only year open to us, or could a 1920 or 1930 edition be used?

b) Next, we need folks to keep an eye out as they go about their lives, for some old encyclopedia sets. Look in grandparent's closets and bookshelves, a dusty corner office in an old college, book-heavy garage sales, and used bookstores. The set needn't be complete, but it's important that the print quality be good enough to scan. See if you can buy or have one or more of the books.

c) Now, even if an encyclopedia is in the right date range, we still need to pause and verify that the particular edition in hand *is* in the public domain. Copyright law can be complicated, and wikipedia *certainly* doesn't want to take the risk doing anything that could risk a lawsuit from a jealous encyclopedia company some day.

d) I think the easiest and fastest way to scan an encyclopedia volume is also rather destructive: Tear off the cover, break the binding, and cut the pages into loose-leaf, then run them through a scanner. I bet a multi-page feeding scanner would letcha get through a bunch of volumes at once. I suppose one could justify the ruining of an antique book in the knowledge that it's probably near the end of its life in paper form anyway and is bound for a new and even more meaningful life in an electronic form. Note that by leveraging the US Postal System (or fedex), this step need not necessarily be done by the same fellow who did step B. :-)

e) Next is the hard part: Proofreading. But maybe this step could be skipped, if the scanner is good enough. I've noticed that with the volume 'A' articles, spelling and format correction is quick to occur when the article appears in wikipedia. So maybe this step could be just a quick QA to ensure the page isn't garbled and in need of re-scanning.

f) With the article digitized, the next step is to get it into wikipedia. This is the step we already know how to do very well, so nothing more need be said. Judging by how quickly articles have been submitted lately, I'm guessing someone has developed a tool or process we could reuse.

Steps b-f can proceed in parallel; someone in New Jersey could be working on volume 9 of a 1919 encyclopedia, while someone else does volume 21 of a 1912 edition. Coordination can be done peer-to-peer, as folks ask who is working on what, and can see from what's *not* in wikipedia what still needs to be done.

BryceHarrington




I agree that this is a very important project, and should definitely be done. I am pretty sure that 1911 is the newest public domain version of EB; it is also a very good one (I believe EB was sold to the US afterwards and for a while not much material was added), so sticking with it would be fine.

I think that if we indeed get the scanning project going, we should also donate it to Project Gutenberg, since they gave as volume 1. This means we release it in the public domain I guess.

However, I'm wondering if Project Gutenberg is scanning the other volumes right now. Does anybody know who scanned volume 1? Those people should be contacted. --AxelBoldt

The rest of the 1911 Encyclopedia Britannica is available on CD-ROM as image files. See http://www.classiceb.com/
So the physical scanning part is already done (no destruction of
books required :-), and all that remains
is the OCR. The files could be then given to Project Gutenberg
and also used here. --Alan Millar



May I suggest we try doing something the ["Christian Classics Ethereal Library"] has done -- you set up a website with the scanned in image of each page, and a text-area for people to transcribe it into. Or if you OCR it in, the OCR won't always be perfect, so you can set this up so people can correct the OCR'd text from the original image.

The problem with using Wikipedia to correct things is that it is unlikely to result in an exact transcription of EB, which is what Project Gutenburg would want.

Finally, I might note that a lot of texts that are public domain (such as EB1911) in the US may still be copyrighted elsewhere, due to past differences in copyright law -- but then so long as Wikipedia is located on a US server, that shouldn't be a problem.

Also, http://www.classiceb.com claims their scanned-in images are copyrighted. As per Public Domain Resources/Talk, they are (probably) wrong -- mere scanning is not of sufficent novelty to create copyright. But they still might cause legal hassles -- so maybe we better just scan it in ourselves.

-- Simon J Kissane

:They claim copyright on their images and CDs, but not on the text itself; maybe we should just contact them to ask about OCR. They'll probably allow it but won't allow that we put up their images on a web site for people to grab and OCR. Everybody who wanted to participate in the OCR effort would have to buy their own CD set, at $109 a pop. --AxelBoldt



If you don' want to rip a volume apart you can do what I did reasonably successfully with a lot of the Catholic Encyclopedia articles I scanned - photocopy the page, then scan it, ocr, then proof-read it. Of course, I was avoiding working on my dissertation and had free access to copying machines... --MichaelTinkler



Text from: http://www.classiceb.com/faqs.html

However, while the original text of these older editions of the Encyclopedia Britannica are in the public domain, the ClassicEB?® CDs are copyrighted by ClassicEB?.com in all respects except for the original text. This means that purchasers of any ClassicEB?® CDs may not copy our CDs in any way under penalty of violating copyright laws. Users may, of course, print off on paper any or all images contained on the CDs, in unlimited amounts, without violating ClassicEB?.com's rights or copyright laws.

If you own your own physical set of the public domain editions of the Encyclopedia Britannica, you may scan your set onto CD and offer them yourself without being in violation of copyright laws. But you may not copy the images contained on the ClassicEB?® CDs, nor may you use in any way the manual or any of the index tools contained on the CDs which were designed by ClassicEB?.com. Violators of ClassicEB?.com's rights will be prosecuted.

:They claim its copyrighted, but I doubt that the images in fact can be copyrighted, since they contain no novelty. (Unlike photographs, there is no creativity in the arrangement, selection of view or lighting.) Just because they say it is copyrighted doesn't mean it is. Though of course, them suing us (even if they lose) wouldn't be fun. -- Simon J Kissane

:One other thing they probably can't sue you for is re-encoding their scans in a new format, reorganizing them with your own indexes, and burning your own CD from that. Even if you take the raw data from their scans rather than rescanning the paper yourself, you're probably within the scope of Feist. I had planned to buy their CD anyway, just so I can do some manual edits on what are clearly some bad OCRs of the original. When I get the CD I'll do some back-of-the-envelope estimates on what it would take to recode and reburn it for us. --LDC




One thing nobody has mentioned is that a lot of things have changed
since 1911, leading to a lot of misleading ideas in articles
scanned directly. A disclaimer at the bottom of the page helps,
but knowing that it's an old text doesn't fix everything, and
it can be hard to expunge all the wrong data. Articles on
physics, for example, would need to be carefully reviewed by
somebody fully up-to-date, lest we bring relativity back into
debate...

Even things like the article on King Alphonso might now be considered apocryphal nowadays - it's my understanding that
history gets revised now and again (even by non-revisionists :))


:I think this is an issue with all content in the Wikipedia, not just entries from old encyclopedias. It is being discussed significantly in a number of different places on the site here. (Before you can expunge wrong data, you have to define "wrong".) Not to pick on anyone in particular, but I don't perceive any substantial science behind the entry for Pheromones, for example.

:But that just goes back to the basic scheme of Wikipedia: It's not Nupedia, and it isn't intended to be. It's a free-for-all, and anyone can fix anything. --Alan Millar


Yesterday, I just found the complete 1911 Encyclopaedia Britannica at my favourite used bookstore, only $180 (Canadian). I don't have the funds for it at the moment (and my scanner is broken), but I'm keeping it in the back of my mind...

Working on a project like this gets me seriously annoyed with the current state of copyright law. I have a [World Book Encyclopedia]? set from the 1970s that I can't even give away. The company that makes World Book won't make a cent off of this material, but because of the copyright, we can't use it for Wikipedia; instead we can only reuse material that hasn't been current for almost a century. *sigh* -- STG



I've actually got a paper set of the 9th and 10th edition EB, dating 1870-1900 or so. Many of the articles are so old that they might as well be rewritten from scratch, but some seem worth putting up for editing... the volumes are in too good a condition to contemplate wholesale destruction as exhorted above, so I'll tend to be adding shorter articles typed by hand on random subjects. At the very least, it'll help fill in details on obscure topics (and I was impressed by the speed at which extra information was filled in on the Fifth monarchy men) -- Malcolm Farmer


My opinion is that timeliness is one of our strengths, as those things that interest us get added. I expect that almost every subject will be covered in the next twenty years, and about half of the information will change as well. If we remember that this GREAT project started in January 2001, then compared to that other work, we are accomplishing a great amount. Gutenburg is trying to get Public Domain works in to electronic form, we are creating a new work, which I feel is a different and (in my opinion more satisfying] thing. The Old Stuff is a good start, but those of us living have been creating and discovering a great amount of New Stuff, which needs documentation as well.--mike dill
contents moved to [1]

HomePage | Recent Changes | Preferences
Search: