Human-Powered Internet Archive Book Project 113
Carl Bialik from the WSJ writes "A group led by the Internet Archive is planning a massive, ambitious effort to scan millions of old books and make them available for Web searching early next year. Behind that effort are about a dozen scanners, employees making about $10 an hour to manually scan volumes -- some more than a century old -- one page at a time, on special contraptions. The Wall Street Journal Online visits a University of Toronto library to watch one of the scanners in action: 25-year-old Liz Ridolfo."
Sorta. (Score:5, Informative)
If you look at the current books on Distributed Proofreaders [pgdp.net], you'll see that some of them credit the Million Books Project for the page scans.
Re:Diffrent? (Score:1, Informative)
bullshit
Re:Why not join the Gutenberg Project (Score:4, Informative)
Project Gutenberg and the Open Content Alliance are working on two slightly different things:
The OCA is making available the images of scanned pages. That's fine for reading an entire book, but you can't search it, nor copy a section of text into a document of your own.
Project Gutenberg makes available plain text, usually illustrated HTML, and occasionally other versions, of public domain books, which can be used by anyone for no cost.
If you'd like to help prepare public domain ebooks, visit Distributed Proofreaders [pgdp.net] and proofread a page a day (or more!).
Re:Diffrent? (Score:4, Informative)
The Open Content Alliance is a consortium of non-profit and for-profit groups which is dedicated to building a free archive of digital text and multimedia. It was conceived in 2005 by Yahoo and the Internet Archive. It was conceived in response to Google Print's closed nature, and aims to keep public domain works in the public domain on-line. These results will then be used in the search results of participating search engines. You can see a sample of the open content at openlibrary.org
A large difference between the OCA's approach and that of Google Print is that the OCA intends to ask a copyright holder before digitising a work that is still under copyright, while Google Print will digitise any book unless explicitly told not to do so by November 1, 2005.
So, Google Print will almost certainly be better when searching for copyrighted material. For public domain works, we'll have to wait and see.
IMHO, it seems like a little cooperation here would make a lot of sense for both parties - they could save money trading digital copies 1-for-1 while remaining in (healthy) competition.
Re:Diffrent? (Score:5, Informative)
Re:Why not join the Gutenberg Project (Score:1, Informative)
Re:Can only be a good thing (Score:3, Informative)
That is called periodic storage, and for anything you wish to preserve, it is necessary. You're argument is a bit weak, considering that any information in book or electronic format needs to be recopied periodically. Books need to be done so less then electronic copies, however electronic copies are cheaper and easier to store, which offsets the costs.
The OP wasn't saying to burn the paper books after their stored, merely to put them in electronic format ASAP because some of them might not be around for too long (funny how those books that are in danger of becoming extinct haven't been backed up in paper format, even though paper lasts for so long).
Re:Contributing to Gutenberg (Score:5, Informative)
I maintain several lists that show the DP harvesting status of several image collections, including The Internet Archive's Canadian Libraries collection [ntlworld.com], Google Print [ntlworld.com], and Early Canadiana Online [ntlworld.com]. As you can see, we will not be running short of material to work on for a very long time, even without any of these recently announced initiatives. That said, it's always great to see more material be made freely available, rather than locked up behind expensive subscription services like Jstor and EEBO.
Re:Good Bad Ugly (Score:3, Informative)
20000 dollars, 40-50 weeks a year, 40-50 hours a week
yep, that's 10 dollars an hour...
Does that mean all the PHD students should be kicked out of their labs and shouldn't be able to handle expensive books?
Re:Good Bad Ugly (Score:4, Informative)
Actually, you can buy a robotic book scanner [kirtas-tech.com] (there's a demo video of it). No doubt it costs an arm and a leg although it may be worth it if you're scanning a large enough volume of books.
Scanning with precision is difficult (Score:2, Informative)
The (Jack) Vance Integral Edition [vanceintegral.com] was a volunteer effort to produce a limited edition 42 volume set of the complete works of Jack Vance, restored to as close to the author's original manuscripts as possible.
(The project is complete, and an amazing success.)
The team scanned and edited many of Jack's early works for which there was no good clean manuscript. They developed software tools that would compare scans from different editions to automatically find errors. It turns out that even the best human editor still missed "scanos" (typos produced by the scanning process) that the automated tools found.
Even so, in the final books there were a handful of errors that slipped through, despite extremely careful editing by hundreds of volunteers.
Re:How can I help? (Score:3, Informative)
As a few others have mentioned, jump in to Distributed Proofreaders [pgdp.net]. We take the raw images (either scanned specifically for DP or taken from scanning projects like this) and produce checked, corrected text, which then goes to Project Gutenberg [gutenberg.org]. A few hours a week can help a lot.