Google To Digitize Much of Harvard's Library

Catch up on stories from the past week (and beyond) at the Slashdot story archive

Google To Digitize Much of Harvard's Library 296

Posted by timothy on Tuesday December 14, 2004 @02:46AM from the that's-a-lot-of-library dept.

FJCsar writes "According to an e-mail sent today to Harvard students, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system, which is second only to the Library of Congress in the number of volumes it contains. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, a FAQ detailing the Harvard pilot program with Google will be available at hul.harvard.edu."

This discussion has been archived. No new comments can be posted.

Google To Digitize Much of Harvard's Library

Load All Comments

Search 296 Comments Log In/Create an Account

Comments Filter:

Nice! (Score:2)

by mind21_98 ( 18647 ) writes:

But aren't there projects that are already doing this?
- Re:Nice! (Score:2)
  
  by ravenspear ( 756059 ) writes:
  
  Already digitizing the Harvard library?
  
  No.
- Re:Nice! (Score:5, Informative)
  
  by RollingThunder ( 88952 ) writes: on Tuesday December 14, 2004 @03:18AM (#11079531)
  
  Well, there's the Distributed Proofreaders [pgdp.net] project for Project Gutenberg [gutenberg.net]... but PG isn't a "we must be the source" attitude from what I've seen. As far as PG is concerned, the more eBooks, the better.
  
  DP probably isn't threatened either - they just shift focus to books that are not in the Harvard collection to avoid duplication of effort.
  
  Parent Share
  twitter facebook
  - Re:Nice! (Score:3, Insightful)
    
    by dvdeug ( 5033 ) writes:
    
    DP probably isn't threatened either - they just shift focus to books that are not in the Harvard collection to avoid duplication of effort.
    
    Are they really going to provide proofread texts? A novel might only take a couple hours to process, but math is going to take hand markup, and some of the more complex critical editions are a bear. Even at only 2 hours a book (and that's not including scanning time), 4 million volumes adds up to 8 million man-hours or a million man-days. At seven bucks an hour that's
    - Re:Nice! (Score:2)
      
      by RollingThunder ( 88952 ) writes:
      
      Very true - DP demonstrates that it's not trivial to OCR these texts without error. Part of me was assuming that Google must have some amazing improvements up their sleeve to manage it in a completely automated fashion.
      
      Having looked at their Catalogs beta, I still suspect they just may have... not only is the text OCR'd well enough to search it, but it even highlights the words in the text. They could certainly have hand-proofed, but that strikes me as not a very Google thing to do.
      - Re:Nice! (Score:2)
        
        by advocate_one ( 662832 ) writes:
        
        scan it twice on diferent hardware, OCR both scans and do a diff on the text files to find the errors.
  - - Re:Nice! (Score:2)
      
      by RollingThunder ( 88952 ) writes:
      
      The PG books may be full of errors, but hopefully the raw scans from DP will be kept after the OCR, doubleproof, and postproduction is complete. That way you can still go back and see what the heck a given item really said.
      
      I should scrounge around their forums and see if they state what the final disposition of the scans is.
      - Re:Nice! (Score:4, Informative)
        
        by Charles Franks ( 686911 ) writes: on Tuesday December 14, 2004 @10:06AM (#11080753)
        
        Actually we do save the images. Many of the initial projects images are saved on CD's but anything from the last few years will make its way to the 'Open Library System' which is an image archive of the DP page scans. You can find a pre-alpha version at: http://www.pgdp.org/ols [pgdp.org] There are images for about a 1,000 projects there with many more pending me getting around to importing them. Lots of work to be done, developers welcome. Charles Franks Founder, Distributed Proofreaders [pgdp.net]
        
        Parent Share
        twitter facebook
- Re:Nice! (Score:5, Informative)
  
  by happyemoticon ( 543015 ) writes: on Tuesday December 14, 2004 @03:36AM (#11079576) Homepage
  
  I happen to work for one.
  
  It's focused on putting otherwise one-of-a-kind materials online for preservation and ease of access, rather than Byron: The Critical Anthology or Cather on the Rye. It's kind of a mammoth, innefficient beaurocracy, though; I don't agree with some of the practices (such as sending texts off to India to be scrivened, rather than just using OCR software), they're very, very slow to incorporate data, and there are a lot of other problems which stem from the fact that most of them are not computer people, but MIMS holders (librarians).
  
  The fact that Google is doing it gives me hope. Hell, maybe I can jump ship.
  
  Parent Share
  twitter facebook
  - - Re:Nice! (Score:2)
      
      by kalidasa ( 577403 ) * writes:
      
      He just OCRed it, that's all. Seriously, there's a reason that stuff is sent to Asia to be hand-input rather than OCRed: someone typing a language they don't know is always more accurate than OCR, and therefore the practice reduces editing time. (Someone typing a language they DO know is a different issue: they tend to "improve" things, albeit unintentionally: they read "affect" and type "effect," etc.)
One more reason... (Score:2, Insightful)

by Anonymous Coward writes:

to never leave my apartment.
ads (Score:5, Funny)

by clovercase ( 707041 ) writes: on Tuesday December 14, 2004 @02:52AM (#11079439) Homepage

will there be ads for particle accelerators, scanning tunneling microscopes and tokamaks in the margins?

Share
twitter facebook
- Re:ads (Score:5, Funny)
  
  by IntelliTubbie ( 29947 ) writes: on Tuesday December 14, 2004 @03:48AM (#11079612)
  
  will there be ads for particle accelerators, scanning tunneling microscopes and tokamaks in the margins?
  
  Yes, but it'll be mixed in with ads for V14gr4, male "enhancement", and Nigerian wealth opportunities. When the scientists complain, the humanities faculty will protest that spam is a perfectly valid epistemology, and that the scientists' attempt to impose an orthodoxy of "truth" in advertising is simply a power grab to extend Western, white male hegemony. At which point, the scientists will defect to MIT's library down the street.
  
  Cheers,
  IT
  
  Parent Share
  twitter facebook
  - Re:ads (Score:3, Funny)
    
    by tsm_sf ( 545316 ) * writes:
    
    Yeah, Theodoric of York [jt.org] has always held himself in pretty high esteem.
Google Cars (Score:2, Funny)

by Zilverfire ( 819134 ) writes:

Google is diversifying extravagently, pretty soon all of us geeks will be driving google cars that can cross reference the library of congress
Will it be like google scholar? (Score:5, Interesting)

by baronben ( 322394 ) writes: <`moc.liamg' `ta' `legips.neb'> on Tuesday December 14, 2004 @02:53AM (#11079448) Homepage

Ever since they introduced Google Scholar [google.com], I've been wanting something like this for my university [utoronto.ca]. For those of you who don't know, finding articles on a subject can be a pain in the ass, as subjects are indexed on several different systems (depending on subject, date, and journal). None of them, not one, has a decent interface or gets results that are as good as google. Google scholar lets you search through academic texts, but its limited to what's available, usually working papers or pre-published drafts. If there is some way that google could team up with Academic printers to index as many journals and texts as possible, this would make everyone's life a lot better.
I think this is a great start, There's incredible profit here too, universities spend millions for catalogue systems. If I could use one interface to search for books, chapters, and articles on a subject, I could spend more time actually learning, and less time looking at the same damn "no results" page on GeoWeb. Grrrr.

Share
twitter facebook
- Re:Will it be like google scholar? (Score:2, Interesting)
  
  by ISEENOEVIL ( 206770 ) * writes:
  
  As long as we don't have something like Google comes in and picks up these prestigious library resources, Yahoo comes and gets another set, and then Microsoft picks still more. I have a feeling some of these resources are wanting to be universally accessed. This is one step closer, but still not close enough if you have to use 3+ different major search engines. My library fees that are tacked onto tuition would actually be used if I could use my preferred search engine to access everything my university
- Re:Will it be like google scholar? (Score:5, Interesting)
  
  by Txiasaeia ( 581598 ) writes: on Tuesday December 14, 2004 @03:11AM (#11079520)
  
  "If I could use one interface to search for books, chapters, and articles on a subject, I could spend more time actually learning, and less time looking at the same damn "no results" page on GeoWeb. Grrrr."
  Or finding that perfect article in the MLA database, only to find out that nobody in Canada subscribes to the journal, nor does anybody have the journal on fulltext. I'd rather have a more comprehensive fulltext database in plaintext rather than digitalised copies of everything anyway - makes searching a hellova lot easier.
  
  Parent Share
  twitter facebook
  - Re:Will it be like google scholar? (Score:5, Insightful)
    
    by baronben ( 322394 ) writes: <`moc.liamg' `ta' `legips.neb'> on Tuesday December 14, 2004 @03:22AM (#11079539) Homepage
    
    That's a great point, that I think should be addressed (it has a bit, with some free-online journals, but nothing major). In the world of digital publishing, why do journals cost thousands of dollars a year. Its certainly not in costs, academics pay the journals to defray the cost of publishing, and editors and referees generally get only an honorarium, if anything.
    
    Sure, the company needs to get some money to cover the costs of printing, distribution, and other things, plus the associations that sponsor the journal want some money to help hold conferences, but why, oh why, must they price journals so expensively that many colleges can't even afford them?
    
    Parent Share
    twitter facebook
    - Re:Will it be like google scholar? (Score:2)
      
      by blincoln ( 592401 ) writes:
      
      Sure, the company needs to get some money to cover the costs of printing, distribution, and other things, plus the associations that sponsor the journal want some money to help hold conferences, but why, oh why, must they price journals so expensively that many colleges can't even afford them?
      
      Printing a publication is expensive. Journals are advertisement-free, which is why they cost so much. I used to work for a student newspaper and it was ridiculous how much money we were paid for ads. Without that rev
      - Why journals are expensive. (Score:5, Interesting)
        
        by commodoresloat ( 172735 ) writes: on Tuesday December 14, 2004 @06:02AM (#11079979)
        
        The reason there are so few copies is because they are so expensive. Chicken and Egg.
        No; the reason there are so few copies is there are so few people who want to read specialized journals. And the small audience only accounts for a small part of what many academic journals charge.
        No; the problem is not overhead costs or small audiences. The problem is that the owners of much of that kind of content are greedy bastards. There is no reason for the outrageous price of some journals. Some scientific journal subscriptions are in the tens of thousands; even many liberal arts journals are far from cheap. And if you want to copy an article for your students to buy at kinkos, expect them to pay 35 cents a page or more for the copyrights alone.
        And many of them are worse than the RIAA in terms of access to content electronically. Journal articles are included in databases sold to some universities You can read articles in some databases but only by loading a .gif of every page one at a time. No copy and paste, no text access at all. So much technology going into preventing the thing from being copied that the online version is actually less useful than the dead tree version rotting on the shelf.
        I think this is a great move by Google and Harvard, and I like the idea behind google scholar, but I expect this kind of work to be resisted by many of journals and professional organizations, to the extent that they have in a say in it. This will be a huge boon in terms of the availability of public domain resources, but unfortunately outdated perspectives on intellectual property are likely to hold back real progress for something really useful to scholars in a systematic way. At least until those perspectives change significantly.
        
        Parent Share
        twitter facebook
        
        Re:Why journals are expensive. (Score:3, Informative)
        
        by Anonymous Coward writes:
        
        A link which backs the "greedy bastards" theory :
        http://math.berkeley.edu/~kirby/journals.html [berkeley.edu]
        
        Re:Why journals are expensive. (Score:5, Informative)
        
        by commodoresloat ( 172735 ) writes: on Tuesday December 14, 2004 @02:00PM (#11082997)
        
        The prestige of a journal is related to the difficulty of getting an article past peer review, not to the fact of the journal being available online or in paper. So there is no "trick" at all other than for the prestigious journals that already exist to start making content available online or in other electronic form.
        As for fulltext articles, try JSTOR if you want to see how to do it wrong. Page by page in gif format, and some huge pdfs with all pictures and no ability to process text. Useless!! Yes you can print it out but then I'd just as soon get the hardcopy in the first place.
        
        Parent Share
        twitter facebook
        
        Re:Why journals are expensive. (Score:3, Insightful)
        
        by DarkSarin ( 651985 ) writes:
        
        I wasn't saying that the prestige of the journal had anything to do with the medium, but that there is a lot of name recognition.
        
        JSTOR varies in quality from journal to journal--some are actually okay, while others suck. I know that I have gotten pdf's from JSTOR, but I wonder if that is a function of JSTOR or the amount that a person/institution is paying for access.
        
        Most journals that I have dealt with online where I had to pay (because the university wasn't a subscriber) wanted between $15 and $25 for
- Re:Will it be like google scholar? (Score:3, Interesting)
  
  by belg4mit ( 152620 ) writes:
  
  Also try Scirus [scirus.com] from the facts at FAST [fastsearch.com]. I've often had better luck there than on google.
  - Re:Will it be like google scholar? (Score:2)
    
    by belg4mit ( 152620 ) writes:
    
    s/fact/folk/
- Re:Will it be like google scholar? (Score:3, Interesting)
  
  by Rich0 ( 548339 ) writes:
  
  The one thing that something like google is lacking is persistant results sets. When I do serious searching I usually start with broad terms and figure out what it takes to narrow things down to a scale that I'm willing to work with.
  
  Good quality search engines have lots of qualities that Google lacks. You could search for two words located within 3 words of each other. You could search for these two words within 3 words of each other while two other words don't occur within 6 words of each other. Index
  - Re:Will it be like google scholar? (Score:3, Insightful)
    
    by tootlemonde ( 579170 ) writes:
    Good quality search engines have lots of qualities that Google lacks.
    One solution is to use google to locate a superset of the target articles and then use a more powerful search engine to winnow the google result set. For an individual, this approach would mean maintaining a personal index of the articles but that is a problem of storage space and bandwidth which is relatively cheap.
    The two main problems that google solves is
    
    having access to the articles in the first place
    
    reducing the number of poss
- Re:Will it be like google scholar? (Score:2, Informative)
  
  by treerex ( 743007 ) writes:
  
  I've been using CiteSeer [citeseer.com] for years in my research, and still prefer it over Google Scholar [google.com].
  
  For computing research CiteSeer and the ACM DL [acm.org] are the two places to go. Scholar may obviate the need for going to both places, someday, but for now it needs to mature a bit.
- How about Web of Science? (Score:2)
  
  by Bohnanza ( 523456 ) writes:
  
  If there is some way that google could team up with Academic printers to index as many journals and texts as possible, this would make everyone's life a lot better.
  If you're willing to pay, this is exactly what Web of Science [isinet.com] does. It contains just about every article from every journal for the last hundred years.
  WoS uses citation indexing, as ISI has done for many years, since well before Google came into existance. You can find newer articles by finding those which have cited the old article you're lo
- - Re:Will it be like google scholar? (Score:2)
    
    by aussie_a ( 778472 ) writes:
    
    Actually it's quite probable we'll lose the e and print books will be known as "traditional books", "paper books" or "print books" in common language. The e is used to differentiate the minority from the norm, once it is the norm it is likely it'll be lost.
    - Re:Will it be like google scholar? (Score:2)
      
      by zebs ( 105927 ) * writes:
      
      More e-mail is sent than conventional post....probably... but I don't see e-mail being called mail any time soon
      - Re:Will it be like google scholar? (Score:2)
        
        by trifakir ( 792534 ) writes:
        
        I myself use both mail and e-mail for e-mail and snail mail for mail.
So... (Score:4, Funny)

by Anonymous Coward writes: on Tuesday December 14, 2004 @02:53AM (#11079451)

If I download a book, when do I have to upload it again? What is the late fee if I forget?

Share
twitter facebook
Google to cache the Universe (Score:3, Funny)

by sjrstory ( 839289 ) writes: on Tuesday December 14, 2004 @02:56AM (#11079456) Homepage

Seeing as Google cached the entire Internet (the last page of the Internet can be seen here): http://www.google.ca/search?q=cache:dQrQDn0dHW8J:w ww.1112.net/lastpage.html+the+end+of+the+Internet& hl=en&client=firefox-a [google.ca] Google is now looking to cache everything else in the Universe :)

Share
twitter facebook
get your scuba gear... (Score:2, Insightful)

by uighur ( 818297 ) writes:

because its time to dive into the deep web. Projects like this are the key to unlocking the vast stores of important which are currently not readiy accessed online. Personally I'd like to see a Google-run free access Lexis-Nexus project.
- Re:get your scuba gear... (Score:2)
  
  by burns210 ( 572621 ) writes:
  
  scholar.google.com
  
  They are getting there.
15 million volumes? (Score:3, Funny)

by Anonymous Coward writes: on Tuesday December 14, 2004 @02:58AM (#11079465)

Please, give me the the values in standard metrics, like Libraries of Congress!

Share
twitter facebook
- Re:15 million volumes? (Score:3, Funny)
  
  by HoneyBunchesOfGoats ( 619017 ) writes:
  
  From Fascinating Facts About the Library of Congress: [loc.gov]
  The Library of Congress is the largest library in the world, with nearly 128 million items on approximately 530 miles of bookshelves. The collections include more than 29 million books and other printed materials, 2.7 million recordings, 12 million photographs, 4.8 million maps, 5 million music items and 57 million manuscripts.
  So to answer your question, it's about 0.52 LoC if you count only the books. :)
  - Re:15 million volumes? (Score:2)
    
    by Afrosheen ( 42464 ) writes:
    
    I wonder how many miles of classified documents they or the Pentagon have under wraps, just waiting to be discovered?
  - Re:15 million volumes? (Score:5, Informative)
    
    by pmc ( 40532 ) writes: on Tuesday December 14, 2004 @04:49AM (#11079791) Homepage
    
    The Library of Congress is the largest library in the world, with nearly 128 million items on approximately 530 miles of bookshelves.
    
    The British Library (www.bl.uk) has 150 million items (but fewer bookshelves) so the claim of "largest" is a bit dubious.
    
    For /. readers 1 BL = 1.17 LoC
    
    Parent Share
    twitter facebook
    - Re:15 million volumes? (Score:5, Funny)
      
      by commodoresloat ( 172735 ) writes: on Tuesday December 14, 2004 @06:11AM (#11080010)
      
      The British Library (www.bl.uk) has 150 million items
      He means just books and such. It's not fair counting umbrellas.
      
      Parent Share
      twitter facebook
    - Re:15 million volumes? (Score:2)
      
      by kalidasa ( 577403 ) * writes:
      
      I suspect that on a word-to-word basis, the LoC would come out ahead: many of the items in the BL might be very short (for instance, is each papyrus fragment counted as a separate item? Many of those have only part of one word on them.). At any rate, the claim that Harvard Libraries is second only to the LoC would only be credible if they're talking about the US, because I'm pretty sure that both the BL and the Bibliotheque Nationale are bigger that Harvard Libraries.
    - Re:15 million volumes? (Score:3, Funny)
      
      by clambake ( 37702 ) writes:
      
      The British Library (www.bl.uk) has 150 million items (but fewer bookshelves) so the claim of "largest" is a bit dubious.
      
      For /. readers 1 BL = 1.17 LoC
      
      Sorry, I still don't understand... Could you express that in terms of how man shuttle explosions would be required to completely destroy one BL?
    - - Re:But it's the same damn book every time (Score:2)
        
        by gnalre ( 323830 ) writes:
        
        No thats not right.
        
        The penguins were developing WMD's
      - Re:But it's the same damn book every time (Score:2)
        
        by henrygb ( 668225 ) writes:
        
        The Empire is a little smaller now so South African penguins are now free (and happy in their black and white plumage), but officially the British Antarctic Territory is still part of it.
  - Re:15 million volumes? (Score:2)
    
    by commodoresloat ( 172735 ) writes:
    
    The Library of Congress is the largest library in the world
    How many Libraries of Congress is it?
Images and formatting? (Score:3, Insightful)

by MacFury ( 659201 ) writes: <me.johnkramlich@com> on Tuesday December 14, 2004 @03:00AM (#11079476) Homepage

I should RTFA but what about images and general formatting? I suppose you could find the relevant text, then try and get the physical book...but if you could view the book in it's original formatting...that would be sweet.
Just how much storage space will all this data consume? It seems like a massive undertaking.

Share
twitter facebook
Are these volumes stored as text or pictures? (Score:3, Insightful)

by wealthychef ( 584778 ) writes: on Tuesday December 14, 2004 @03:03AM (#11079486)

I am ambivalent about this. Will the books be stored as text to enable searching? If so, given that part of a book's character is its font and typesetting, will ALL the flavor of these books really be captured, in the same way that it would be to read them? Something seems likely to be "lost in translation" here.

Share
twitter facebook
- Re:Are these volumes stored as text or pictures? (Score:3, Insightful)
  
  by clovercase ( 707041 ) writes:
  
  i think your comments would be salient if they were going to scan the documents and the BURN the originals. putting massive content on the web for free is the best way to push content all over the world. some internet user in sri lanka doesn't have the bandwidth to download images of the pages, and would never have the opportunity to view the actual documents in a library at harvard. if everyone digitized all the valuable content (and i presume that much of the content in harvar's libraries are valuable)
- Re:Are these volumes stored as text or pictures? (Score:4, Interesting)
  
  by robla ( 4860 ) * writes: on Tuesday December 14, 2004 @03:09AM (#11079513) Homepage Journal
  
  I would hope the handle it in just like catalog.google.com [google.com]
  
  Parent Share
  twitter facebook
- Re:Are these volumes stored as text or pictures? (Score:5, Insightful)
  
  by Txiasaeia ( 581598 ) writes: on Tuesday December 14, 2004 @03:14AM (#11079526)
  
  I think you're missing the point. I'm not so much concerned with getting rid of dead tree books (I love reading paper books for enjoyment); I would, on the other hand, prefer all my academic sources to be electronic. As I mentioned in reply to another poster, it's a huge pain to look something up on MLA or Expanded ASAP only to find out that my university doesn't carry it and the interlibrary loan system can't get it for two or three weeks because it's backlogged as it is. I could care less about the spiffy fonts and typesetting; give me the plaintext so I get my research done!
  
  Parent Share
  twitter facebook
- Both Images & Uncorrected OCR should be availa (Score:5, Insightful)
  
  by dananderson ( 1880 ) writes: on Tuesday December 14, 2004 @03:38AM (#11079584) Homepage
  
  Typically, both page images and uncorrected OCR are made available. Correcting OCR is too labor-intensive for thousands of books.
  The uncorrected OCR is very useful for indexing (by Google or others), as the 5% or fewer typos are not enough to interfere with indexing keywords. Uncorrected OCR can also be corrected later.
  The page images are tied with the uncorrected OCR so you can see exactly what's there.
  For an example, see books at University of Michigan's Making of America (MoA) Exhibit [umich.edu], which has thousands of 19th century books and periodicals available.
  
  Parent Share
  twitter facebook
  - Re:Both Images & Uncorrected OCR should be ava (Score:3, Funny)
    
    by drooling-dog ( 189103 ) writes:
    
    For an example, see books at University of Michigan's Making of America (MoA) Exhibit, which has thousands of 19th century books and periodicals available.
    I see they've recently added the complete run of the Journal of the U.S. Association of Charcoal Iron Workers. If I'd known that, I could've saved a bundle on gift subscriptions...
- Re:Are these volumes stored as text or pictures? (Score:2)
  
  by supabeast! ( 84658 ) writes:
  
  "...will ALL the flavor of these books really be captured, in the same way that it would be to read them?"
  
  For the vast majority of the people who will ever use the tool, that won't matter. Most of the world's libraries don't hold onto old scholarly stuff indefinately, assuming that they ever bought a lot of the obscure stuff. It seems likely that because this will be limited to public domain works, most of them will be old and hard to find, so anyone looking at them will quite likely have had no way to acc
The Fight against Plagiarism (Score:5, Interesting)

by manmanic ( 662850 ) writes: on Tuesday December 14, 2004 @03:04AM (#11079491)

One reason why this is in the interest of big old universities like Harvard is that it will make it much easier to detect plagiarism in students' essays. If published books were included in Google's index, a plagiarism detection service like Copyscape [copyscape.com] would also be able to check whether content was lifted from printed material, as well as from the web.

Share
twitter facebook
- Flipside: The false positive problem (Score:3, Insightful)
  
  by rsborg ( 111459 ) writes:
  
  Ok, so this is just a bit of devil's avocate, but what happens if you just *happen* to have a writing style similar to someone else who was printed before... what if you read something, and unknowingly wrote something in a similar vein in your essay? I assume you could check it yourself, but then that would just introduce extra cost to even write the essay in the first place... or worse, the plagiarists could just "tweak" their papers ensuring that they're "below the radar" by changing enough style to not b
  - False positives can be double-checked manually (Score:2, Insightful)
    
    by wrinkledshirt ( 228541 ) writes:
    
    The professor can just wait until the match comes up, and then double-check at that point.
    
    You'd want to do a thorough overview of any potential instance of cheating anyway. A quick run-through would determine whether or not a paper happened to contain an identical sentence clause or three identical paragraphs.
    
    I think the bigger problem would be the second one you described -- that students could plagiarize and then go through each paragraph, changing the wording slightly so as to avoid positive matches. S
  - Re:Flipside: The false positive problem (Score:2, Interesting)
    
    by Gori ( 526248 ) writes:
    
    Well, there are such things as references.
    
    Using work of other people in academic work is not only possible, but greatly encouraged. Just make sure that it is very clear what comes from whom.
    
    In many ways, science is done exactly as Open Source software. Take what you need, modify and improve it where appropriate, and make sure you give full credit where due.
    
    As a teacher, I have given full points to a paper that has hardly any text of their own, as long as they are properly referenced, and used together to
How will the books be scanned? (Score:2, Interesting)

by supersat ( 639745 ) writes:

About two months ago, Jeff Dean (an employee of Google) gave a talk [washington.edu] at the University of Washington about the inner workings of Google. One thing he mentioned was Google Print and how they scan books: they slice 'em up into individual pages, and then feed them through a scanner. This doesn't seem like an acceptable way to archive a library's collection. So, how are they scanning them in? Why not use this method for Google Print?
- "Slice and scan" is used for new books only (Score:4, Informative)
  
  by dananderson ( 1880 ) writes: on Tuesday December 14, 2004 @03:31AM (#11079558) Homepage
  
  I'm not familiar with Google Print, but typically "slice and scan" is usually used for new books only. That's because there's multiple copies of the book available and the paper is usually flat and dust free.
  For older books, most archivists use a cradle and photograph the pages. It's easier on the book, requires no slicing, and there's no scanner to clog with dust.
  The disadvantage is the scanner operators need a little bit more training, but that's not a big problem.
  
  Parent Share
  twitter facebook
But will you be allowed to copy the materials? (Score:2)

by Animats ( 122034 ) writes:

Or will they try to lock them up with an EULA, the DMCA, and some eBook system?
- Re:But will you be allowed to copy the materials? (Score:2)
  
  by QuantumG ( 50515 ) writes:
  
  well even if they do try to lock em up I can't see how they'd win a case if you were copying material that is in the public domain.
Reminds me of the U of Michigan and U. Microfilms (Score:3, Informative)

by Ungrounded Lightning ( 62228 ) writes: on Tuesday December 14, 2004 @03:21AM (#11079538) Journal

Back around the '60s or so the University of Michigan cut a similar deal with University Microfilms.

U Microfilms set up and ran a microfilming operation in the library system, microfilming everything that wasn't under copyright (and much that was with permission of the copyright holders, such as several large newspapers and many magazines and other periodicals), along with much of the University's records. Rare books, etc.

(If I have this right) the U got microfilm prints of the documents for free and didn't have to pay for the microfilming of its records. University Microfilms made its money by selling microfilms of the various publications (forwarding royalties, where appropriate, to the copyright holders). The rare books, for instance, could now be studied on microfilm with no further stress on the original, and their content became available at many other colleges and libraries. Good deal all around.

University Microfilms was founded by a regent, who was later slammed for conflict of interest. He dropped out of the Board of Regents but the business deal continued.

Share
twitter facebook
clinton (Score:2)

by sewagemaster ( 466124 ) writes:

if this is clinton's "library" that's tp be "googlized" and "digitized", then that'll be an interesting "shot"... ;)

Homer: mmmmmm digitized google....
University of California is anti-digital (Score:5, Informative)

by dananderson ( 1880 ) writes: on Tuesday December 14, 2004 @03:24AM (#11079545) Homepage

This is great. Compare this pro-digitalization attitude of Harvard, Stanford, and others with the University of California's (UC's) anti-digital position.
For books in Special Collections, they won't allow copies to be digitalized unless they are (1) paid a fee to scan the book (fair enough) and (2) paid a royalty to post the book to the web.
The royalty amounts to hundreds or thousands of dollars per book (about $100/page or image). This allows the libraries to act as a "profit center" for the universities. This policy applies to all UC campuses (I've tried UCB, UCLA, UCI, UCSD).
This is true even though the book is in the public domain (because they have physical possession and nobdy can make copies until you sign a license agreement). This is true even if you're using the book for non-commercial purposes (such as free posting to the web).
Something is wrong here. People donate to UC libraries (either books or money) for the public good. They don't donate so the library can start a business licensing public-domain books.
Despite that, I have been able to scan many books (by using books in open stacks or purchasing them). These books concern Yosemite history and are at http://www.yosemite.ca.us/history/ [yosemite.ca.us]

Share
twitter facebook
- Re:University of California is anti-digital (Score:2)
  
  by rritterson ( 588983 ) * writes:
  
  Too bad I can't mod you up, because I just had to reply instead. I go to UCB and often hear us brag about how we have the second largest 'public' collection in the nation (or is it world?), after the Library of Congress (Harvard is bigger, but is privately owned). It makes me quite sad that is our policy if what you say is true. Donations to the library are down, funding is short, and access to many journals has been cut. Digitizing books would save money and resources, and benefit everyone. Public Unive
- Re:University of California is anti-digital (Score:3, Interesting)
  
  by JoshuaDFranklin ( 147726 ) * writes:
  
  Got a link for that policy?
  
  Ever tried a Freedom of Information Act (FOIA) request? Strange as it may seem, that apparently works in the State of Washington.
  - FOIA fees (Score:3, Informative)
    
    by KMSelf ( 361 ) writes:
    
    FYI, FOIA isn't free, though the fees are pretty nominal [cia.gov]. $0.10/page, $18/hr, after the first 100 pages, with a significant educational discount.
    The thought of having a spook do my photocopying for me just sounds.... Hrm. Ironic?
- Re:University of California is anti-digital (Score:2)
  
  by kalidasa ( 577403 ) * writes:
  
  I don't see how they can impose a royalty on the TEXT of a book from their special collections, if the book was published before 1923. On their images (product of their scanning), sure, as they create the images and so own the copyright on the images; but they don't own the copyright on the original books.
Text of Dec 13th Email (Score:5, Informative)

by olvr ( 840066 ) writes: on Tuesday December 14, 2004 @03:34AM (#11079570)

December 13, 2004

Dear Colleague,

I am writing today with news of an exciting new project within the Harvard libraries. As all of us know, Harvard's is the world's preeminent university library. Its holdings of over 15 million volumes are the result of nearly four centuries of thoughtful and comprehensive collecting. While those holdings are of primary importance to Harvard students and faculty, we have, for several years, been considering ways to make the collections more useful and accessible to scholars around the world. Now we are about to begin a project that can further that global goal-and, at the same time, can greatly enhance access to Harvard's vast library resources for our students and faculty.

We have agreed to a pilot project that will result in the digitization of a substantial number of volumes from the Harvard libraries. The pilot will give the University a great deal of important data on a possible future large-scale digitization program for most of the books in the Harvard collections. The pilot is a small but extremely significant first step that can ultimately provide both the Harvard community and the larger public with a revolutionary new information location tool to find materials available in libraries.

The pilot project will be done in collaboration with Google. The project will link Harvard's library collections with Google's resources and its cutting-edge technology. The pilot project, which will be announced officially tomorrow, is the result of more than a year of careful consultation at many levels of the University. We could not have achieved a meaningful pilot project without the efforts of the Harvard Corporation; the President, Provost, Chief Information Officer, and Office of General Counsel; the University Library Council; and senior managers within the College Library and the University Library.

A full description of the pilot program follows here, with further materials available on the Harvard home page tomorrow.

With best regards,
Sidney Verba
Carl H. Pforzheimer University Professor and
Director of the University Library

Project Description:
Harvard's Pilot Project with Google

Harvard University is embarking on a collaboration with Google that could harness Google's search technology to provide to both the Harvard community and the larger public a revolutionary new information location tool to find materials available in libraries. In the coming months, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, an FAQ detailing the Harvard pilot program with Google will be available at http://hul.harvard.edu.

The Harvard pilot will provide the information and experience on which the University can base a decision to launch a large-scale digitization program. Any such decision will reflect the fact that Harvard's library holdings are among the University's core assets, that the magnitude of those holdings is unique among university libraries anywhere in the world, and that the stewardship of these holdings is of paramount importance. If the pilot is deemed successful, Harvard will explore a long-term program with Google through which the vast majority of the University's library books would be digitized and included in Google's searchable database. Google will bear the direct costs of digitization in the pilot project.

By combining the skills and library collections of Harvard University with the innovative search skills and capacity of Google, a long-term program has the potential to create an important public good. According to Harvard President Lawrence H. Summers, "Harvard has the greate
Read the rest of this comment...

Share
twitter facebook
Oxford University gets every UK book published (Score:3, Informative)

by aegilops ( 307943 ) writes: on Tuesday December 14, 2004 @03:37AM (#11079579) Homepage

The library of the University of Oxford, i.e. the Bodleian Library [ox.ac.uk], was the first "copyright" library in the UK - one of only three - which means that it automatically gets a copy of every book published in the UK [ox.ac.uk].

Aegilops

Share
twitter facebook
- Re:Oxford University gets every UK book published (Score:5, Informative)
  
  by Jon Chatow ( 25684 ) * writes: <slashdot@jdforrester.org> on Tuesday December 14, 2004 @04:33AM (#11079745) Homepage
  
  Actually, they don't automatically get copies. They have the right to get one, but they don't have much space, so they only get copies of publications that they feel like getting. The British Library would be a more interesting one to team up with, as they get a copy of every publication...
  
  Parent Share
  twitter facebook
  - Re:Oxford University gets every UK book published (Score:2, Interesting)
    
    by Andrew Aguecheek ( 767620 ) writes:
    
    Yep, fell foul of this one the other day. The National Library of Wales happens to be situated in Aberystwyth, on the same hill as the University. (Which, by the way, is a bitch to climb in the mornings... do not apply for sea-front residences unless you are sure of your fitness!) Aaaaanyway, as the librarian there tactfully explained to me: one hell of a lot of books are published every year, and there's only so much space in the place... and they like to have a Welsh Language copy too!
All well and good, except (Score:2, Offtopic)

by sulli ( 195030 ) * writes:

Harvard Sucks [harvardsucks.org]
(they admit it themselves!)
- Re:All well and good, except (Score:2)
  
  by dkleinsc ( 563838 ) writes:
  
  Hey, at least they don't go to a Safety School [safetyschool.org]
Do no evil. (Score:5, Funny)

by nels_tomlinson ( 106413 ) writes: on Tuesday December 14, 2004 @03:48AM (#11079613) Homepage

Their corporate motto is ``do no evil'', and we've all applauded that, but this is such a great thing that I think we could give them a pass on at least one evil act.
Maybe they could do something really evil to Microsoft, and then we could say: ``Well, you digitized Harvard's library, so we'll let it pass this time.''

Share
twitter facebook
Amen (Score:3, Informative)

by lavaface ( 685630 ) writes: on Tuesday December 14, 2004 @03:54AM (#11079627) Homepage

It was just a matter of time before a project of this scope got off the ground. I would like to see them team up with Project Gutenberg [gutenberg.org] (and perhaps archive.org [archive.org]) to provide images of the material. Throw in the little transcoder [xanadu.com.au] and perhaps wikipedia and we will soon have a killer information resource that can be cross-referenced to silly proportions. This is a boon for research. Projects like this and the public library of science [plos.org] will add much to collective knowledge. It would also be nice to see them team up with the newspaper project [neh.gov]! Next stop--public domain LOC!!!

Share
twitter facebook
Comment removed (Score:3, Informative)

by account_deleted ( 4530225 ) writes: on Tuesday December 14, 2004 @04:15AM (#11079695)

Comment removed based on user account deletion

Share
twitter facebook
It's about Time! (Score:2, Interesting)

by Shafe ( 72598 ) writes:

I've been emailing them asking them to do this for years. I'm glad someone is finally doing it! There is only one problem: how do they get past copyright violations? I tried to get Cornell to do this on campus, but they said a lot of their volumes (periodicals, in particular) were still under copyright and hence cannot be scanned. No, it doesn't make any sense to leave these carbon books literally fall apart when we can preserve them forever digitally, but that's the name of the game.

Someone hurry up wi
- Re:It's about Time! (Score:2)
  
  by burns210 ( 572621 ) writes:
  
  According to a couple articles(Google News has a bunch), non-copyrighted works will have their entire content viewable(Oxford, for instance, is only allowing pre-1901 books to be scanned). Book still under copyright will still show up in your results if relavant, but only show the sentence (or two) or page(or two) surrounding the particular search term... With links to buy the book online.
Mailing Lists (Score:2, Interesting)

by lousyd ( 459028 ) writes:

Call me mundane, but I want Google to index mailing lists, with a nice interface like their "Groups".
- Re:Mailing Lists (Score:2)
  
  by burns210 ( 572621 ) writes:
  
  It looks like this Digital Library is going to be part of Google Print, and be a special top-ranked entry on normal web searches...
  
  I would like to see a Library section, of all the books scanned in(preferably text, with an image linked-to, rather than a image you read off of).
  
  Also, I would think it would be neat to see a mailing-lists section either as an extension of their new google-groups2 system, if possible.
  
  Lasly, I a blog search would be neat, though tricky. Being able to do an 'Opinion' section or
- - Re:Mailing Lists (Score:2)
    
    by burns210 ( 572621 ) writes:
    
    Heck, what is next, irc logs?
second only to the Library of Congress. . . (Score:2, Informative)

by Leonig Mig ( 695104 ) writes:

... are you sure , - doesn't it mean (as is so often the case - "within the united states?" what about the British Library? What about the Bodelian at Oxford?
- Re:second only to the Library of Congress. . . (Score:3, Informative)
  
  by julesh ( 229690 ) writes:
  
  Apparently the Bodleian only has 7.2 million volumes, so this is larger than that collection.
  
  The British Library apparently has "150 million items" according to their web site, but a large number of these are not books (they claim, for instance, to have 8 million stamps). But, I'm pretty sure they have more than 15 million books.
  
  Whether or not they have more books than the Library of Congress is an interesting question.
- Re:second only to the Library of Congress. . . (Score:2)
  
  by burns210 ( 572621 ) writes:
  
  Actually the British Library is something like 30 million books greater than the Library of Congress. Harvard is second largest in the US. First is the Library of Congress, and worldwide(as far as I know) is the British Library.
- Re:second only to the Library of Congress. . . (Score:3, Informative)
  
  by Steve Cox ( 207680 ) writes:
  
  According to the British Library's website, it contains 150 million items [www.bl.uk] and gains a futher 3 million each year (but it doesn't distinguish between items and volumes - they collect any published item, and receive a copy of EVERY published item in the UK and Ireland).
  
  The Bodelian has only 7 million volumes [ox.ac.uk].
  
  I would suspect that the Brish Library is substantially larger than Stanfords, but the Library Of Congress is recognised as the largest library in the world. [guinnessworldrecords.com]
  
  Steve.
U of Michigan (Score:5, Informative)

by truesaer ( 135079 ) writes: on Tuesday December 14, 2004 @04:30AM (#11079736) Homepage

It looks like the largest portion of this will be 7 million items from the University of Michigan (compared to only 40,000 from Harvard). Good article [freep.com] from the Detroit Free Press.

Share
twitter facebook
- Re:U of Michigan (Score:3, Interesting)
  
  by truesaer ( 135079 ) writes:
  
  Actually, I see that it is actually Stanford with 8 million items that will get to claim themselves as the largest, then followed by Michigan with 7 million. I don't know why Harvard is getting any props at all with only 40k items. Here is what I found most interesting in the article [freep.com] though:
  The size of the U-M undertaking is staggering. It involves the use of new technology developed by Google that greatly speeds the digitizing process. Without that technology -- which Google won't discuss in detail --
"second only to the Library of Congress" (Score:2)

by Clansman ( 6514 ) writes:

However not if you take into account libraries such as the British Library which has 150 million items - this is bigger than Congress and Harvard combined.

Granted, some of these are just stamps :-)
Championing the commons (Score:2)

by mankey wanker ( 673345 ) writes:

Every time we capitulate to money and power and grant new extensions to existing IP laws, this is exactly what we lose - we lose that material that belongs to everyone as a whole and to which we all have a right held in common.

I love moves forward like this. Perhaps if people understood what it meant to access knowledge and information at whim they wouldn't be so keen to keep extending privately held rights any further than is reasonable.

I live for the day when people count down the days until something e
New York Times article (Score:4, Informative)

by sporktoast ( 246027 ) writes: on Tuesday December 14, 2004 @10:03AM (#11080740) Homepage

For what it is worth, there was an article in the Painted Lady [nytimes.com] about it today.

Share
twitter facebook
- Re:Not Just Harvard (Score:5, Funny)
  
  by BizidyDizidy ( 689383 ) writes: on Tuesday December 14, 2004 @02:53AM (#11079447)
  
  Also according to the summary, Einstein.
  
  Parent Share
  twitter facebook
  - Re:Not Just Harvard (Score:2)
    
    by ravenspear ( 756059 ) writes:
    
    Also according to the summary, Einstein.
    
    Yes but the FS is starting to go the way of the FA as far as the number of actual readers is concerned. I admit to occasionally falling victim to this unfortunate disease myself. Sometimes I only read the headline, and with some of the YRO ones that take up nearly the whole width of my 1280px wide monitor, sometimes I can't even get through all of that.
- Dead authors tell no tales . . . till now (Score:3, Insightful)
  
  by dananderson ( 1880 ) writes:
  
  This will be sweet. I just hope that we don't get too many authors getting pissed.
  Only public-domain books will be scanned. In all or most cases the author's are dead. However, this will revive a great body of work and widen access to many.
  One class of author may be pissed will be authors who take older works and just slap a foreword or introduction to the front and collect royalties. I've seen this done for many histories. But author's of todays works can count on royalties for themselves, their childr
- Re:Speaking of education... (Score:2)
  
  by belg4mit ( 152620 ) writes:
  
  It'd only be an if you're lame enough to spell out the acronym. If you pronounce it fak it's clearly a fak not an fak.
- Re:Just what percentage... (Score:2)
  
  by julesh ( 229690 ) writes:
  
  Google will provide online access to the full text of those works that are in the public domain Just what percentage of the current works are public domain?
  
  With a catalogue that size, probably most of them. The number of new books published per year isn't actually all that huge -- even if you acquired everything published in the US, I would expect it to take a long time for you to reach 15 million items.
  
  Note that, for instance, the LoC has 29 million books, which is understood to be a significant fracti
- Re:I beg your pardon... (Score:2)
  
  by julesh ( 229690 ) writes:
  
  Not to mention the British Library, which is larger than the Library of Congress.
  
  But of course, that doesn't come after the LoC, so whether that makes the story factually inaccurate, or just misleading, is an interesting question.
- Re:how will this be better than 'grep' (Score:2)
  
  by fuzzybunny ( 112938 ) writes:
  
  To use a completely oversimplified analogy, the end result probably won't be much different than a big fat 'grep'.
  
  However, grepping through 15 million volumes of text and making an attempt at ranking results by relevance through a fancy perl script probably would require a bit of time and resources :-)

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Nice! (Score:2)

Re:Nice! (Score:2)

Re:Nice! (Score:5, Informative)

Re:Nice! (Score:3, Insightful)

Re:Nice! (Score:2)

Re:Nice! (Score:2)

Re:Nice! (Score:2)

Re:Nice! (Score:4, Informative)

Re:Nice! (Score:5, Informative)

Re:Nice! (Score:2)

One more reason... (Score:2, Insightful)

ads (Score:5, Funny)

Re:ads (Score:5, Funny)

Re:ads (Score:3, Funny)

Google Cars (Score:2, Funny)

Will it be like google scholar? (Score:5, Interesting)

Re:Will it be like google scholar? (Score:2, Interesting)

Re:Will it be like google scholar? (Score:5, Interesting)

Re:Will it be like google scholar? (Score:5, Insightful)

Re:Will it be like google scholar? (Score:2)

Why journals are expensive. (Score:5, Interesting)

Re:Why journals are expensive. (Score:3, Informative)

Re:Why journals are expensive. (Score:5, Informative)

Re:Why journals are expensive. (Score:3, Insightful)

Re:Will it be like google scholar? (Score:3, Interesting)

Re:Will it be like google scholar? (Score:2)

Re:Will it be like google scholar? (Score:3, Interesting)

Re:Will it be like google scholar? (Score:3, Insightful)

Re:Will it be like google scholar? (Score:2, Informative)

How about Web of Science? (Score:2)

Re:Will it be like google scholar? (Score:2)

Re:Will it be like google scholar? (Score:2)

Re:Will it be like google scholar? (Score:2)

So... (Score:4, Funny)

Google to cache the Universe (Score:3, Funny)

get your scuba gear... (Score:2, Insightful)

Re:get your scuba gear... (Score:2)

15 million volumes? (Score:3, Funny)

Re:15 million volumes? (Score:3, Funny)

Re:15 million volumes? (Score:2)

Re:15 million volumes? (Score:5, Informative)

Re:15 million volumes? (Score:5, Funny)

Re:15 million volumes? (Score:2)

Re:15 million volumes? (Score:3, Funny)

Re:But it's the same damn book every time (Score:2)

Re:But it's the same damn book every time (Score:2)

Re:15 million volumes? (Score:2)

Images and formatting? (Score:3, Insightful)

Are these volumes stored as text or pictures? (Score:3, Insightful)

Re:Are these volumes stored as text or pictures? (Score:3, Insightful)

Re:Are these volumes stored as text or pictures? (Score:4, Interesting)

Re:Are these volumes stored as text or pictures? (Score:5, Insightful)

Both Images & Uncorrected OCR should be availa (Score:5, Insightful)

Re:Both Images & Uncorrected OCR should be ava (Score:3, Funny)

Re:Are these volumes stored as text or pictures? (Score:2)

The Fight against Plagiarism (Score:5, Interesting)

Flipside: The false positive problem (Score:3, Insightful)

False positives can be double-checked manually (Score:2, Insightful)

Re:Flipside: The false positive problem (Score:2, Interesting)

How will the books be scanned? (Score:2, Interesting)

"Slice and scan" is used for new books only (Score:4, Informative)

But will you be allowed to copy the materials? (Score:2)

Re:But will you be allowed to copy the materials? (Score:2)

Reminds me of the U of Michigan and U. Microfilms (Score:3, Informative)

clinton (Score:2)

University of California is anti-digital (Score:5, Informative)

Re:University of California is anti-digital (Score:2)

Re:University of California is anti-digital (Score:3, Interesting)

FOIA fees (Score:3, Informative)

Re:University of California is anti-digital (Score:2)

Text of Dec 13th Email (Score:5, Informative)

Oxford University gets every UK book published (Score:3, Informative)

Re:Oxford University gets every UK book published (Score:5, Informative)

Re:Oxford University gets every UK book published (Score:2, Interesting)

All well and good, except (Score:2, Offtopic)

Re:All well and good, except (Score:2)

Do no evil. (Score:5, Funny)

Amen (Score:3, Informative)

Comment removed (Score:3, Informative)

It's about Time! (Score:2, Interesting)