Human-Powered Internet Archive Book Project

Human-Powered Internet Archive Book Project 113

Posted by Zonk on Saturday November 12, 2005 @02:39AM from the hope-she-likes-the-way-books-smell dept.

Carl Bialik from the WSJ writes "A group led by the Internet Archive is planning a massive, ambitious effort to scan millions of old books and make them available for Web searching early next year. Behind that effort are about a dozen scanners, employees making about $10 an hour to manually scan volumes -- some more than a century old -- one page at a time, on special contraptions. The Wall Street Journal Online visits a University of Toronto library to watch one of the scanners in action: 25-year-old Liz Ridolfo."

Human-Powered Internet Archive Book Project

This discussion has been archived. No new comments can be posted.

Search 113 Comments Log In/Create an Account

Comments Filter:

Contributing to Gutenberg (Score:2, Interesting)

by watermodem ( 714738 ) writes: on Saturday November 12, 2005 @02:45AM (#14014239)

Will the scans be added to the Project Gutenberg collection?

It's lighter! (Score:4, Interesting)

by HolyCrapSCOsux ( 700114 ) writes: on Saturday November 12, 2005 @02:46AM (#14014244)

Last time I moved, It took many VERY HEAVY boxes to Move all my books. Maybe I'll scan them all..

All though anything useful has to be illegal... :(

Re:Diffrent? (Score:1, Interesting)

by Anonymous Coward writes: on Saturday November 12, 2005 @02:53AM (#14014264)

The Internet Archive is a non-profit. As for sufficiently old books, they're out of copyright anyway, and neither Google nor the Archive will have problems. Meanwhile, as an author, I would be fairly happy for a non-profit such as this to scan my publications, providing search and excerpts. I don't think I'd even be too up in arms if they used opt-out - they are, after all, just extending the role of the library. On the other hand, when Google does it, with the aim of making its shareholders richer (whether that's by providing for-pay services, advertising, or merely control that provides potential for revenue) I most certainly will refuse. (captcha: archival. hehe)

Can only be a good thing (Score:2, Interesting)

by LordofEntropy ( 250334 ) writes: on Saturday November 12, 2005 @02:55AM (#14014270)

Getting written works off of paper and stored electronically should be a priority--bits are much easier to store, preserve, and copy for future use.

In Stanislaw Lem's science fiction book "Memoirs Found in a Bathtub", all the paper in the world gets eaten by a virus and chaos ensues. Interesting read if you've missed it, has made me paranoid about how much the world still depends on paper.

Re:Why not join the Gutenberg Project (Score:2, Interesting)

by TWooster ( 696270 ) writes: <twoosterNO@SPAMgmail.com> on Saturday November 12, 2005 @03:03AM (#14014296)

That's a good question, but I can't help but wonder if this is the miracle of capitalism at work. Right now we're in the eeeearly stages of this sort of thing, and the copyright laws, the mechanics, et al are still rather unexplored. Besides, I have to think -- the scanned images themselves are probably copyrighted by those who scanned, but chances are the plaintext isn't (considering they're copying it already, and not reinterpreting it). So the more people who want to scan whatever, the better, even if they overlap. Consider it error checking.

The real test and business opportunity comes in the distribution phase. The first person to have a huge library of old books, and contracts with publishing houses for new books (with "purchases" by the end users, and DRM encumbered, of course) is the person who will win the market and define the (capitalistic) best way to scan and distribute.

And come the semantic web, things get really interesting. Already we have tons of sites that do cross-referencing between academic papers -- at least, the citations, as well as categorization by topic. When we can start doing this for books based not only on genre, but topic or specific references to persons, or general concepts ("Book X mentions technology Y on page Z. Click here for link!")... well, things will become far more informative. I suspect that in this field, the information -- the texts -- may become free, but the computerized (and human-assisted) analyzation, linking, value-added stuff will be the new commodity. He who has the best algorithm wins.

I guess information has always wanted to be free, but the analysis of said information lies firmly in the realm of economics.

Scanner: I want. (Score:3, Interesting)

by sakusha ( 441986 ) writes: on Saturday November 12, 2005 @03:04AM (#14014298)

Wow, that book scanner rig is just what I've been dreaming of for years. I've been thinking about mounting a couple of glass plates at a 90 degree angle, and then I could put the open book on apex of the glass, then photograph it with a couple of cameras underneath. This rig is just exactly what I was thinking of, but upside down and even cleverer, with a footpedal to lift the glass up and down onto the book. A very nice piece of design work.
The obvious advantage of this rig is that you don't have to open the spine 180 degrees and smash the books flat onto a single glass plane, you don't have to open the book up more than 90 degrees, so it's gentle on the spine of fragile old books. And the glass wedge is always self-centering against the spine of the book. The only way this scheme could work better is if there was a way to turn the pages automatically. But these are old and presumably valuable works, safer to let paid low-wage drones to do the work than risk mechanical damage.

Book Scanners (Score:3, Interesting)

by jab ( 9153 ) writes: on Saturday November 12, 2005 @04:11AM (#14014425) Homepage

Here's a list of book scanning equipment [harvard.edu]. I've seen the one from Kirtas in action, it's fun to watch.

How can I help? (Score:1, Interesting)

by Anonymous Coward writes: on Saturday November 12, 2005 @04:18AM (#14014442)

How can I help? I'm willing to give a couple of hours a week, I don't have a scanner, but I'm willing to type...if this is truly "open", I will be more than willing to contribute my time.

Manual seems safer to me.... (Score:3, Interesting)

by fantomas ( 94850 ) writes: on Saturday November 12, 2005 @08:14AM (#14014861)

"It seems a pity to use such a manual method"

Interesting - I don't understand your line of thinking - interested to hear more. Is the argument that automated page turning is *cheaper* so it's a pity that the project spends a lot on labour charges (manual scanning)? Or is the argument that the automated page turning is easier on the fragile old books? I'd appreciate if you could offer more details about the technology - the company's demo video shows a vacuum device lifting pages, but both examples are with modern books. Honest question: surely the advantage here is a low labour cost method of scanning huge numbers of pages (like the telephone directory example they show). But if you have fragile books, surely the advantage of a human is that they can see that individual pages might be particularly fragile, maybe even needing support or repair to scan, while the pre-set vacuum device will plough on regardless, it won't be able to make a decision on the quality of the pages. Does it have any sensing devices built in? My experience of older books (e.g. nineteenth century) is that in some cases the paper can be very brittle.

Re:RTFA? (Score:3, Interesting)

by commbat ( 50622 ) writes: on Saturday November 12, 2005 @11:39AM (#14015371) Homepage

I'd RTFA if the black text didn't overlap a black image. IE-only web designers should be shot.

This is when the 'remove this object' firefox extension [mozilla.org] comes in handy. Just remove the image and the text is readable. 'Undo last remove' to get the image back.

I don't think you should have been modded down.

Re:Diffrent? (Score:2, Interesting)

by Chubby_C ( 874060 ) writes: on Saturday November 12, 2005 @12:08PM (#14015490)

with all these companies now deciding they want to scan books (Google, Amazon) why not partner up on this project, it would greatly reduce the overall costs as each company would scan the same books as the other.
At least partner up for the process of scanning even if they have different plans as to what to do with the scans

Libraries (Score:2, Interesting)

by andrewburt ( 856855 ) writes: on Saturday November 12, 2005 @02:40PM (#14016154)

Borrowing from a library or reading in a bookstore are hugely different, for these reasons:
(1) The library paid for the copy you're borrowing. (Or somebody paid for it, in case the book was donated to the library.) Thus the author was paid for that copy. If you read a whole copyrighted book via a Content Display Site (CDS - Google Print, Amazon Search Inside, etc.) and never buy the book, the author wasn't paid. Copyright law is about creating new copies; you're not creating a new copy when you read in a store or from a library.
(2) Browsing in a bookstore is pretty inconvenient. You can't take the copy with you to look at any time you want. (Unless you buy it! That's sort of the point.) Bookstores know that few people really read entire books in the store -- else they'd go out of business. However, reading a book from a CDS doesn't have that limitation: You can take it with you, on your laptop, etc. This is particularly critical in light of digital paper, when the digital copy is the paper copy.
(3) Libraries and bookstore reading isn't anywhere near free: You have to move your physical body to the bookstore to read. For one thing, you can't likely do that at 3am. (And certainly not in your pajamas.) You can't do it from your bed, couch, or desk, without getting up. You have to spend time to move your body down there, which might be 10min-30min each way; 20-60min round trip, plus say 10min to find the book, a place to sit, etc; call it 30-70min. If you value your time at say, $10/hr, that's $5-12. Then there's the cost of transportation. If the library/bookstore is three miles away, 6mi. round trip, and gas costs $2.50/gal., and you get 20mi/gal., that's another $.75. The IRS figures driving a car costs $.405/mile in repairs, wearing it out, etc., so that's another $2.40. So you're at something like $8-15 to go read a "free" book.
Really -- if it were that free, people would do a lot more of it.
Yet reading a free copy from a CDS doesn't have those limitations. It is much closer to $0, actually and truly free. THAT's the problem.
(4) You can't pass on a "free" copy you read in the store or from the library. You have to leave the book at the bookstore (or buy it); you have to return the book to the library. Reading a book in digital form that was stolen from a CDS, you could pass that copy on to others by email, via a web page, P2P software, etc.
So, bottom line, bookstore/library reading isn't really free. CDS copies are essentially free, and that's the problem. They're too convenient to read free.
This is one of the reasons we formed the COCOA Association ( http://www.copyrightaccess.com/ [copyrightaccess.com] ), to make more copyrighted work available. (Note, COCOA does not inhibit indexing and searching and returning text snippet search results -- just what page images can be displayed.) If you support this, please sign our petition at http://www.petitiononline.com/cocoa/petition.html [petitiononline.com] -- thanks!
Dr. Andrew Burt,
Chair, The COCOA Association

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Human-Powered Internet Archive Book Project 113

Human-Powered Internet Archive Book Project More Login

Human-Powered Internet Archive Book Project

Contributing to Gutenberg (Score:2, Interesting)

It's lighter! (Score:4, Interesting)

Re:Diffrent? (Score:1, Interesting)

Can only be a good thing (Score:2, Interesting)

Re:Why not join the Gutenberg Project (Score:2, Interesting)

Scanner: I want. (Score:3, Interesting)

Book Scanners (Score:3, Interesting)

How can I help? (Score:1, Interesting)

Manual seems safer to me.... (Score:3, Interesting)

Re:RTFA? (Score:3, Interesting)

Re:Diffrent? (Score:2, Interesting)

Libraries (Score:2, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot