Forgot your password?
typodupeerror
Books Media The Internet Technology

Human-Powered Internet Archive Book Project 113

Posted by Zonk
from the hope-she-likes-the-way-books-smell dept.
Carl Bialik from the WSJ writes "A group led by the Internet Archive is planning a massive, ambitious effort to scan millions of old books and make them available for Web searching early next year. Behind that effort are about a dozen scanners, employees making about $10 an hour to manually scan volumes -- some more than a century old -- one page at a time, on special contraptions. The Wall Street Journal Online visits a University of Toronto library to watch one of the scanners in action: 25-year-old Liz Ridolfo."
This discussion has been archived. No new comments can be posted.

Human-Powered Internet Archive Book Project

Comments Filter:
  • How is this diffrent from something like Googles project? And how will the Copyright holders feel? Also, this could be pretty usefull...
    • Re:Diffrent? (Score:4, Insightful)

      by way2trivial (601132) on Saturday November 12, 2005 @02:47AM (#14014247) Homepage Journal
      Stories over 75 years old don't have the same copyright protections..

      anyone can do 'a christmas carol' because it's copyright has expired..

      using however, someones PRECISE arangement of the text is not permissible however- that has it's own copyright...
      so if I buy a current day copy from amazon, I cant scan it in... but if I buy a copy that's last edition/print was more than 75 years ago, it is fair game.

      • Re:Diffrent? (Score:1, Informative)

        by Anonymous Coward
        > so if I buy a current day copy from amazon, I cant scan it in

        bullshit
      • If you bought a copy of any classic book that is out of copyright, and it's a literal republication of the original (not a 'modern interpretation' or new translation or anything else) than you could, I believe, scan, OCR, and distribute the resulting text. The literary work -- be it Shakespeare's, Clemens', Dickens', etc. -- is no longer protected by copyright.

        You could not, however, scan the book and distribute the images of the pages. Because although the original author's text is not under copyright prot
      • using however, someone's PRECISE arrangement of the text is not permissible however- that has its own copyright...

        That is true in the UK and Commonwealth countries, but not in the U.S., so far as I can tell. The UK has something called "typographical arrangement copyright" which is what you were referring to. This lasts for 25 years, independent of any copyright of the text itself.

        The U.S., however, has no explicit equivalent stated in its copyright laws. I suppose one might make a claim that normal cop

    • And how will the Copyright holders feel?

      There probably aren't any. Copyrights do expire.

      • Regarding the article, they are only scanned books from before 1923 - for which the copyright has expired. Copyright USED to expire - the whole point of both copyrights and patents was that they grant the author a SHORT period of exclusivity to encourage creation.

        But that's not true anymore. Currently copyrights have an expiration date, but the expiration date has consistently gotten farther away faster than it has gotten closer.

        Essentially NOTHING expires now unless somebody didn't do their paperwork.

        Tha
    • Re:Diffrent? (Score:1, Interesting)

      by Anonymous Coward
      The Internet Archive is a non-profit. As for sufficiently old books, they're out of copyright anyway, and neither Google nor the Archive will have problems. Meanwhile, as an author, I would be fairly happy for a non-profit such as this to scan my publications, providing search and excerpts. I don't think I'd even be too up in arms if they used opt-out - they are, after all, just extending the role of the library. On the other hand, when Google does it, with the aim of making its shareholders richer (wheth
    • Re:Diffrent? (Score:4, Informative)

      by arrrrg (902404) on Saturday November 12, 2005 @03:05AM (#14014299)
      From the Wikipedia article on the Open Content Alliance [wikipedia.org]:

      The Open Content Alliance is a consortium of non-profit and for-profit groups which is dedicated to building a free archive of digital text and multimedia. It was conceived in 2005 by Yahoo and the Internet Archive. It was conceived in response to Google Print's closed nature, and aims to keep public domain works in the public domain on-line. These results will then be used in the search results of participating search engines. You can see a sample of the open content at openlibrary.org

      A large difference between the OCA's approach and that of Google Print is that the OCA intends to ask a copyright holder before digitising a work that is still under copyright, while Google Print will digitise any book unless explicitly told not to do so by November 1, 2005.


      So, Google Print will almost certainly be better when searching for copyrighted material. For public domain works, we'll have to wait and see.

      IMHO, it seems like a little cooperation here would make a lot of sense for both parties - they could save money trading digital copies 1-for-1 while remaining in (healthy) competition.
      • IMHO, it seems like a little cooperation here would make a lot of sense for both parties - they could save money trading digital copies 1-for-1 while remaining in (healthy) competition.

        This is very true. However you see this sort of thing in a lot of emerging industries -- two competitors will duplicate each other's work until eventually one defeats the other in the marketplace and buys up their work at fire-sale prices. As long as either one thinks that they can "win," there's little incentive to help.

        Too
    • Re:Diffrent? (Score:5, Informative)

      by Dave114 (168228) on Saturday November 12, 2005 @03:11AM (#14014317)
      It's different. Take a look at the Open Content Alliance's FAQ [opencontentalliance.org]. Below are a few excerpts from it:

      What can people do with materials contained in the OCA archive?

      The OCA will encourage the greatest possible degree of access to and reuse of collections in the archive, while respecting the rights of content owners and contributors. Generally, textual material will be free to read, and in most cases, available for saving or printing using formats such as PDF. Contributors to the OCA will determine the appropriate level of access to their content.

      How will the OCA deal with copyrighted content?

      The OCA is committed to respecting the copyrights of content owners. All content providers who contribute to the OCA must agree with the founding principles of the OCA, contained in the OCA Call for Participation, which describes how their materials and associated metadata will be accessed and used. Further, all contributors of collections can specify use restrictions on material that they contribute.

      Will copyrighted content be digitized or placed in the OCA archive without explicit permission from rights-holders?

      No. OCA contributors must secure the permission of all concerned copyright holders prior to submitting materials to the OCA for digitization or inclusion in the archive.

    • Re:Diffrent? (Score:2, Interesting)

      by Chubby_C (874060)
      with all these companies now deciding they want to scan books (Google, Amazon) why not partner up on this project, it would greatly reduce the overall costs as each company would scan the same books as the other.

      At least partner up for the process of scanning even if they have different plans as to what to do with the scans

  • Will the scans be added to the Project Gutenberg collection?
    • Sorta. (Score:5, Informative)

      by Grendel Drago (41496) on Saturday November 12, 2005 @02:52AM (#14014260) Homepage
      Project Gutenberg frequently makes use of the page scans for source material. What PG does is to run the images through OCR, proofread and post-process it. It's more useful than a stack of page images, but considerably more work.

      If you look at the current books on Distributed Proofreaders [pgdp.net], you'll see that some of them credit the Million Books Project for the page scans.
      • Re:Sorta. (Score:1, Insightful)

        by spxero (782496)
        But at the same time, wouldn't it be better if this outfit did the scanning for PG and PG edited and finalized? What is the point to race against another organization to provide the same works without making profit?*

        *Until advertisement factors in. Advertisement ALWAYS factors in...
        • I'm pretty sure PG has no advertising, so there is no profit factor. And I'm fairly certain that someone could come along for PG and use the Internet Archive's images to convert into text.

          Internet Archive doesn't have to specifically give them the images though.
        • The focuses of OCA and PG are really quite different: PG is most interested in preserving the essential information of a book (ie, its text), while OCA's interest is in preserving the form of the book (ie, its fonts, pages format, coloration, even down to the yellowing of the pages). That having been said, there's a lot each can do for the other (and has!).

          The Archive has archived most of PG's material, because even though the Books department of The Archive is focussed mostly on preserving books, The Ar

    • by jonathan_ingram (30440) on Saturday November 12, 2005 @04:17AM (#14014440) Homepage
      The scans won't be added to Project Gutenberg, but it's very likely that the scans will be used by Project Gutenberg's Distributed Proofreading [pgdp.net] project, which I'm involved in. We're already 'harvesting' images from quite a few sites, as well as all the images our volunteers scan. Now that there are several large and relatively well funded scanning operations getting off the ground, I imagine that over time an ever increasing proportion of the works that go through DP will be based from harvested images.

      I maintain several lists that show the DP harvesting status of several image collections, including The Internet Archive's Canadian Libraries collection [ntlworld.com], Google Print [ntlworld.com], and Early Canadiana Online [ntlworld.com]. As you can see, we will not be running short of material to work on for a very long time, even without any of these recently announced initiatives. That said, it's always great to see more material be made freely available, rather than locked up behind expensive subscription services like Jstor and EEBO.
  • It's lighter! (Score:4, Interesting)

    by HolyCrapSCOsux (700114) on Saturday November 12, 2005 @02:46AM (#14014244)
    Last time I moved, It took many VERY HEAVY boxes to Move all my books. Maybe I'll scan them all..

    All though anything useful has to be illegal... :(
    • Re:It's lighter! (Score:4, Insightful)

      by Hosiah (849792) on Saturday November 12, 2005 @03:11AM (#14014318)
      Ahem: years ago, I made up the "moving time" rule that books *must* be packed in the smallest available boxes. Anything of dimensions around 2x1x1 feet. After straining on the book boxes previously, it occurred to me that it's human nature to (a) pack books first, reasoning that you're not going to be doing much reading in the next couple days anyway... and (b) upon first beginning to pack, grab the biggest box to start with.
    • Are you familiar with the WorldWide TV Show and Event "The Secret" [whatisthesecret.tv]
      It is based on old teachings and books that were banned for many reasons.
      Somebody felt threatened...for some reason!!!

      I say, "let me have the access to everything"
      and I am a big girl and can make up my own mind! :-)

      Pat

  • From TFA, they are only scanning works that are out of copyright and in public domain, so this is not the same as what google is doing.
  • Getting written works off of paper and stored electronically should be a priority--bits are much easier to store, preserve, and copy for future use.

    In Stanislaw Lem's science fiction book "Memoirs Found in a Bathtub", all the paper in the world gets eaten by a virus and chaos ensues. Interesting read if you've missed it, has made me paranoid about how much the world still depends on paper.
    • Getting written works off of paper and stored electronically should be a priority--bits are much easier to store, preserve, and copy for future use.

      Preservation?

      Do you really think your magnetic/optical/flash/etc storage will last as long as printed paper...even assuming you can find a CD reader in 50 years? Maybe you mean to recopy the data every few years, but if something gets lost for a few decades, it's lost for ever.
      • Maybe you mean to recopy the data every few years,

        That is called periodic storage, and for anything you wish to preserve, it is necessary. You're argument is a bit weak, considering that any information in book or electronic format needs to be recopied periodically. Books need to be done so less then electronic copies, however electronic copies are cheaper and easier to store, which offsets the costs.

        The OP wasn't saying to burn the paper books after their stored, merely to put them in electronic format AS
  • by stev3 (640425) <sasper&gmail,com> on Saturday November 12, 2005 @02:56AM (#14014272) Homepage Journal
    Why hello, Ms. Liz Ridolfo. I'm happy to see you are into computers (at least I'll tell myself that) and you like to put your pictures online.

    Please email me at superdesperateteengeek@needtogetlaid.net
    • Re:Hey there... (Score:2, Insightful)

      This had to be about as funny as the US PATRIOT Act.

      No, I think that actually has a leg up on this comment.

    • $10/hour? The whole thing could be made sexier if Suicide Girls / Geek Girls scanned the books for $100/hour in the nude.

      ====
      "Sexy? What's wrong with being sexy?" -- Spinal Tap

  • Good Bad Ugly (Score:5, Insightful)

    by mpapet (761907) on Saturday November 12, 2005 @03:01AM (#14014285) Homepage
    The good:
    Old books prior to copyright laws are being scanned.

    The bad:
    Pay is roughly $10/hr. Now, I happen to be concerned that someone being paid so little should be handling rare books. Not to mention the college graduate getting paid so little.

    The ugly:
    The digital camera contraption costs $30,000!! There's a few scanner manufacturers left in the world and none of them have exploited this niche. Shame on them.
    • Now, I happen to be concerned that someone being paid so little should be handling rare books. Not to mention the college graduate getting paid so little.

      May we assume that you will therefore be donating additional funds, up to the level of your concern or the amount you can afford (whichever is less)?
    • Pay is roughly $10/hr. Now, I happen to be concerned that someone being paid so little should be handling rare books.

      I would tend to think this is a good thing. It means that the people doing it aren't neccesarily in it for the money. Being paid by the hour also gives them an incentive to take their time about it. ;)

      As long as the people hired are screened for at least a medium-high level of respect for old books, I don't see a problem here.

    • Re:Good Bad Ugly (Score:3, Informative)

      by rm999 (775449)
      lets look at the average PHD student:
      20000 dollars, 40-50 weeks a year, 40-50 hours a week

      yep, that's 10 dollars an hour...

      Does that mean all the PHD students should be kicked out of their labs and shouldn't be able to handle expensive books?
    • Re:Good Bad Ugly (Score:4, Informative)

      by Dave114 (168228) on Saturday November 12, 2005 @05:41AM (#14014605)
      There's a few scanner manufacturers left in the world and none of them have exploited this niche.

      Actually, you can buy a robotic book scanner [kirtas-tech.com] (there's a demo video of it). No doubt it costs an arm and a leg although it may be worth it if you're scanning a large enough volume of books.

    • Actually, as this is being done in association with university libraries, I think they shouldn't have any problems getting reliable help at $10USD an hour, because that's significantly more than a lot of other on-campus jobs pay. I know from personal experience that many of the students that get paid to videotape campus events and have access to thousands of dollars of semi-pro videography gear are only paid $8-10 an hour. Same for stage electricians, scene shop carpenters and painters, and audio technician
  • Scanner: I want. (Score:3, Interesting)

    by sakusha (441986) on Saturday November 12, 2005 @03:04AM (#14014298)
    Wow, that book scanner rig is just what I've been dreaming of for years. I've been thinking about mounting a couple of glass plates at a 90 degree angle, and then I could put the open book on apex of the glass, then photograph it with a couple of cameras underneath. This rig is just exactly what I was thinking of, but upside down and even cleverer, with a footpedal to lift the glass up and down onto the book. A very nice piece of design work.
    The obvious advantage of this rig is that you don't have to open the spine 180 degrees and smash the books flat onto a single glass plane, you don't have to open the book up more than 90 degrees, so it's gentle on the spine of fragile old books. And the glass wedge is always self-centering against the spine of the book. The only way this scheme could work better is if there was a way to turn the pages automatically. But these are old and presumably valuable works, safer to let paid low-wage drones to do the work than risk mechanical damage.
  • Will it automatically provide full text or scanned image files for works that have gone out of copyright? And do the restrictions against scanning , storage or reproduction also lapse when copyright lapses? This would be massive. Lots of publishers just reissue old work with new copyrights attached to them.

    Personally I've read lots of old science fiction from copyright lapsed works, there is some in Gutenberg, and like it quite a bit, though I'd like to find more of them.

    For example I'm looking for Perry

    • Will it automatically provide full text or scanned image files for works that have gone out of copyright?


      If by "automatic" you mean after it's been scanned by someone, the images processed, placed onto the server and put into the system. Then yes, it will; automatically provide the scanned image files.

      nd do the restrictions against scanning , storage or reproduction also lapse when copyright lapses?

      Yes, because it becomes a public domain work. You can do anything (from publishing it unchanged, creating y
  • It seems a pity to use such a manual method. This... http://www.kirtas-tech.com/ [kirtas-tech.com] is designed to scan books, especially old and fragile books, automatically. It handles the pages even more gently than a trained person. It's not cheap, but is does around 1,000 pages per hour, and the operator just loads books in and takes them out when they're done. I looked at the company a couple of years ago (I'm a VC) and get regular updates from them. A LOT of libraries are using them now.
    • "It seems a pity to use such a manual method"

      Interesting - I don't understand your line of thinking - interested to hear more. Is the argument that automated page turning is *cheaper* so it's a pity that the project spends a lot on labour charges (manual scanning)? Or is the argument that the automated page turning is easier on the fragile old books? I'd appreciate if you could offer more details about the technology - the company's demo video shows a vacuum device lifting pages, but both examples are with

      • You make good points. My argument is that the Kirtas scanner is cheaper, because it's at least twice as fast as a human operator, and becaus one operator can run several machines at once. When I spent time with the company, I was impressed at just how gently it handled pages: it fans a page out using gentle puffs of air, and lifts pages using a large-area (dry) sponge. But, you're right: if a book is really fragile, or on the point of disintegration, a manual approach is better.

  • My ad-blocking software wouldn't let me at the page, which confused me. So I disabled it.

    http://online.wsj.com/public/article_print/SB11311 1987803688478-VNpw62xi_JA4avE8cxOZf0pf_nM_20061109 .html [wsj.com]

    And of course, direct linkage to the picture of the girl.
    Because that's the only reason 90% of you would click on the link anyways

    The Girl Has Nice Shoes [wsj.com]

    As an aside, cigarettes + old books = bad
    "This book almost killed me," Ms. Ridolfo said to her boss, Gabe Juszel, who was preoccupied with a stack of books and di

    • Not to mention that girls + cigarette smell = not terribly attractive

      Unless the 'girl' is Lauren Bacall [ntropie.de].

    • Was I the only one who appreciated the irony when she said "This book almost killed me," and then went out to smoke a cigarette?

      Oh, well ... she did mention that her previous job involved stapling sticks of gum to flyers for a club, so I guess she's moving up in the world.
    • This is Liz. I don't smoke in the library for god's sake. The fact that the author even discussed what I do on my break upset me because it has nothing to do with my job. They also forgot to mention that the mind-numbing jobs that I had were to put myself through university, and that I love books and what I do. There is so much going on here that they left out. At least my shoes were accurately represented.
  • What's the legality difference between this and say regular libraries? Don't regular libraries loan material freely? What changes when it becomes electronic, it just means that the people will be able to keep them for longer or as long as they want, no? IMHO, I like the idea of doing this. It'll make doing books for school much easier knowing that there's a backup copy of it floating around somewhere on the interweb.
    • For one, copyright laws tend to have a special clause inside them for public libraries. This group hasn't been classified as a public library ;) Another point is that libraries (as a rule) don't photocopy books and then give them away to people indefinitely. Instead they legally buy copies of a book, which they lend out for a finite period of time.
    • Libraries (Score:2, Interesting)

      by andrewburt (856855)
      Borrowing from a library or reading in a bookstore are hugely different, for these reasons:

      (1) The library paid for the copy you're borrowing. (Or somebody paid for it, in case the book was donated to the library.) Thus the author was paid for that copy. If you read a whole copyrighted book via a Content Display Site (CDS - Google Print, Amazon Search Inside, etc.) and never buy the book, the author wasn't paid. Copyright law is about creating new copies; you're not creating a new copy when you read i

  • How can I help? (Score:1, Interesting)

    by Anonymous Coward
    How can I help? I'm willing to give a couple of hours a week, I don't have a scanner, but I'm willing to type...if this is truly "open", I will be more than willing to contribute my time.
    • Re:How can I help? (Score:3, Informative)

      by jnik (1733)
      How can I help? I'm willing to give a couple of hours a week, I don't have a scanner, but I'm willing to type...if this is truly "open", I will be more than willing to contribute my time.

      As a few others have mentioned, jump in to Distributed Proofreaders [pgdp.net]. We take the raw images (either scanned specifically for DP or taken from scanning projects like this) and produce checked, corrected text, which then goes to Project Gutenberg [gutenberg.org]. A few hours a week can help a lot.

  • $10/hr is crazy for scanning books.

    Send the scans to india or eastern europe to be scanned for a fraction of the price. I mean really. This seems to be a serious operation - why not maximize the use of available resources? Spending $10/hr on scanning is just dumb.

    • You'd send hundred-year-old books overseas to be handled by extremely cheap (and poor) labourers? Are you operating under the assumption that once they're scanned, you won't need the originals any more?
      • No, i'm operatnig under the assumption that it's far less expensive to to the transfer and pay for the necessary oversight, given the number of books invoved. those hourly costs very quickly get very expensive.

        I am not saying send them to a random person with a scanner. However, this can be done competently.

    • $10/hr isn't very much (~20k a year). Why risk sending old (possibly valuable) books overseas to be scanned by unskilled cheap foreign labor when you can have it done under your supervision locally while employing local people. Not spending $10/hr on scanning is just dumb.
      • Why risk sending old books overseas to be scanned by unskilled cheap foreign labor

        You're right, there is plenty of skilled US citizens that work for $10/hr.
        • You're right, there is plenty of skilled US citizens that work for $10/hr.

          There ARE plenty of (reasonably) skilled US citizens, in the form of college students, willing to work for $10 an hour. I'm sure you wouldn't have any trouble finding people to work the scanner at that rate at any large university. Especially if you offered flexible/non-daytime hours: the most popular campus jobs in my experience were always the ones that you could work in the evening or at night.

          In fact I'm a little disappointed that
          • by that same standard, you can find plenty of equally skilled foreign college students willing to work for $2 an hour. I am on a business trip now in eastern europe. lord knows there are plenty of responsible college students here who would do the scanning work for 1/5th the price or less. the savings would eaisly pay for adequate supervision and shipment as necessary.
  • by Cow Jones (615566) on Saturday November 12, 2005 @05:30AM (#14014578)

    employees making about $10 an hour to manually scan volumes -- some more than a century old

    I think that if they hired younger people to scan the books, it might go a little faster.
    Imagine a 100 year old at this job...

    "...(mumble mumble) in my day we used priests to copy books (mumble mumble) oh dear, I tore another page, darn Parkinson (mumble mumble)"

  • I can't help but think of midgets in a running wheel. Is that an improvement over a "hamster-powered" book project?
  • 10$ per hour for the humans, tens of thousands for the scanners. Damn you machine-overlords!

    On the other hand, the whole project is funded by Microsoft and Yahoo, which creates the usual good (open content!) / evil (paid for with the devil's money!) dilemma. ...

    That's enough coffee for me, I suppose...
  • The (Jack) Vance Integral Edition [vanceintegral.com] was a volunteer effort to produce a limited edition 42 volume set of the complete works of Jack Vance, restored to as close to the author's original manuscripts as possible.

    (The project is complete, and an amazing success.)

    The team scanned and edited many of Jack's early works for which there was no good clean manuscript. They developed software tools that would compare scans from different editions to automatically find errors. It turns out that even the best human edi

  • I do hope they're not duplicating efforts... and whether they even know about Project Gutenberg. http://www.promo.net/pg/>

              mark
  • At the University of Toronto the Internet Archive pays $12 Canadian / hour to the scanners, and $11-$12 American in the US. The exchange rates keep changing so judging Canadian pay by US translations is a bit confusing. With experience the Archive will adapt as well, but the Archive is interested in maintaining a reasonable wage while keeping the overall cost cheaper than most commercial offerings. The reason for that is to encourage the open nature that the Archive supports.

    What would be the equivale
    • What would be the equivalent local rate for scanners in Europe?

      Probably about $35 an hour, they'd only work seven hours, three days a week, and they'd be on strike half the year anyway. And you can't fire any of them. ;-)
  • I remember being put in charge (still not sure how it happened, but it did) of my HS senior class's slideshow thing for the end of the year banquet, and everyone brought in 2 or 3 pictues to be scanned for it..... It wasn't fun........ But then again, for $10/hr, it couldn't have been any worse then most other crappy jobs....
  • so thats why the bible is rewritten a thousand times? i wanan see jesus and the apostles sue the televangelists.

"The vast majority of successful major crimes against property are perpetrated by individuals abusing positions of trust." -- Lawrence Dalzell

Working...