Google To Digitize Much of Harvard's Library

Google To Digitize Much of Harvard's Library 296

Posted by timothy on Tuesday December 14, 2004 @02:46AM from the that's-a-lot-of-library dept.

FJCsar writes "According to an e-mail sent today to Harvard students, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system, which is second only to the Library of Congress in the number of volumes it contains. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, a FAQ detailing the Harvard pilot program with Google will be available at hul.harvard.edu."

Google To Digitize Much of Harvard's Library

This discussion has been archived. No new comments can be posted.

Search 296 Comments Log In/Create an Account

Comments Filter:

Will it be like google scholar? (Score:5, Interesting)

by baronben ( 322394 ) writes: <`moc.liamg' `ta' `legips.neb'> on Tuesday December 14, 2004 @02:53AM (#11079448) Homepage

Ever since they introduced Google Scholar [google.com], I've been wanting something like this for my university [utoronto.ca]. For those of you who don't know, finding articles on a subject can be a pain in the ass, as subjects are indexed on several different systems (depending on subject, date, and journal). None of them, not one, has a decent interface or gets results that are as good as google. Google scholar lets you search through academic texts, but its limited to what's available, usually working papers or pre-published drafts. If there is some way that google could team up with Academic printers to index as many journals and texts as possible, this would make everyone's life a lot better.
I think this is a great start, There's incredible profit here too, universities spend millions for catalogue systems. If I could use one interface to search for books, chapters, and articles on a subject, I could spend more time actually learning, and less time looking at the same damn "no results" page on GeoWeb. Grrrr.

The Fight against Plagiarism (Score:5, Interesting)

by manmanic ( 662850 ) writes: on Tuesday December 14, 2004 @03:04AM (#11079491)

One reason why this is in the interest of big old universities like Harvard is that it will make it much easier to detect plagiarism in students' essays. If published books were included in Google's index, a plagiarism detection service like Copyscape [copyscape.com] would also be able to check whether content was lifted from printed material, as well as from the web.

Re:Will it be like google scholar? (Score:2, Interesting)

by ISEENOEVIL ( 206770 ) * writes: on Tuesday December 14, 2004 @03:07AM (#11079500) Homepage

As long as we don't have something like Google comes in and picks up these prestigious library resources, Yahoo comes and gets another set, and then Microsoft picks still more. I have a feeling some of these resources are wanting to be universally accessed. This is one step closer, but still not close enough if you have to use 3+ different major search engines. My library fees that are tacked onto tuition would actually be used if I could use my preferred search engine to access everything my university is paying so much for in one place. As it stands now I cringe when I have to navigate our electronic resources.

-Stormy

Re:Are these volumes stored as text or pictures? (Score:4, Interesting)

by robla ( 4860 ) * writes: on Tuesday December 14, 2004 @03:09AM (#11079513) Homepage Journal

I would hope the handle it in just like catalog.google.com [google.com]

Re:Will it be like google scholar? (Score:5, Interesting)

by Txiasaeia ( 581598 ) writes: on Tuesday December 14, 2004 @03:11AM (#11079520)

"If I could use one interface to search for books, chapters, and articles on a subject, I could spend more time actually learning, and less time looking at the same damn "no results" page on GeoWeb. Grrrr."
Or finding that perfect article in the MLA database, only to find out that nobody in Canada subscribes to the journal, nor does anybody have the journal on fulltext. I'd rather have a more comprehensive fulltext database in plaintext rather than digitalised copies of everything anyway - makes searching a hellova lot easier.

How will the books be scanned? (Score:2, Interesting)

by supersat ( 639745 ) writes: on Tuesday December 14, 2004 @03:14AM (#11079525)

About two months ago, Jeff Dean (an employee of Google) gave a talk [washington.edu] at the University of Washington about the inner workings of Google. One thing he mentioned was Google Print and how they scan books: they slice 'em up into individual pages, and then feed them through a scanner. This doesn't seem like an acceptable way to archive a library's collection. So, how are they scanning them in? Why not use this method for Google Print?

Re:Flipside: The false positive problem (Score:2, Interesting)

by Gori ( 526248 ) writes: on Tuesday December 14, 2004 @04:15AM (#11079693) Homepage

Well, there are such things as references.

Using work of other people in academic work is not only possible, but greatly encouraged. Just make sure that it is very clear what comes from whom.

In many ways, science is done exactly as Open Source software. Take what you need, modify and improve it where appropriate, and make sure you give full credit where due.

As a teacher, I have given full points to a paper that has hardly any text of their own, as long as they are properly referenced, and used together to make a valid point, not made by any of the sources.

So I do not think students should bother staying below the rarad. Just reference everything,and voila, you are doing science

It's about Time! (Score:2, Interesting)

by Shafe ( 72598 ) writes: on Tuesday December 14, 2004 @04:16AM (#11079698) Homepage

I've been emailing them asking them to do this for years. I'm glad someone is finally doing it! There is only one problem: how do they get past copyright violations? I tried to get Cornell to do this on campus, but they said a lot of their volumes (periodicals, in particular) were still under copyright and hence cannot be scanned. No, it doesn't make any sense to leave these carbon books literally fall apart when we can preserve them forever digitally, but that's the name of the game.

Someone hurry up with nanostorage so I can store the entire content of human knowledge on a postage stamp (with nanosecond seek time and gigabyte transfer speeds, of course)

Mailing Lists (Score:2, Interesting)

by lousyd ( 459028 ) writes: on Tuesday December 14, 2004 @04:22AM (#11079712)

Call me mundane, but I want Google to index mailing lists, with a nice interface like their "Groups".

Re:Nice! (Score:1, Interesting)

by Anonymous Coward writes: on Tuesday December 14, 2004 @04:29AM (#11079734)

But also: PG books are full of errors, and there is no source info or scans available to fix against in any sort of easy way. Many books Such as Wealth of Nations went through a number of editions during the author's lifetime. It would be nice to have the various early editions for collation. And often times new editions come out long after the death of the author with bullshit editorial changes in order to claim a new copyright. A library like Harvard will have many of the first number of editions of classic works.

Re:University of California is anti-digital (Score:3, Interesting)

by JoshuaDFranklin ( 147726 ) * writes: <joshuadfranklin.NOSPAM@ya h o o .com> on Tuesday December 14, 2004 @04:33AM (#11079744) Homepage

Got a link for that policy?

Ever tried a Freedom of Information Act (FOIA) request? Strange as it may seem, that apparently works in the State of Washington.

Re:U of Michigan (Score:3, Interesting)

by truesaer ( 135079 ) writes: on Tuesday December 14, 2004 @05:16AM (#11079849) Homepage

Actually, I see that it is actually Stanford with 8 million items that will get to claim themselves as the largest, then followed by Michigan with 7 million. I don't know why Harvard is getting any props at all with only 40k items. Here is what I found most interesting in the article [freep.com] though:

The size of the U-M undertaking is staggering. It involves the use of new technology developed by Google that greatly speeds the digitizing process. Without that technology -- which Google won't discuss in detail -- the task would be impossible, says John Wilkin, the U-M associate librarian who is heading the project.

"Going as fast as we can with the traditional means of doing this, it would take us about 1,600 years to do all 7 million volumes," he said. "Google will do it in six years."

Under the agreement, the library will get a digital copy of every book scanned. With those copies, the library can prepare special research projects, virtual exhibitions and more relevant scholarly and academic material for its students and faculty.

"If we were to do this job ourselves, it would probably cost us $600 million," Wilkin said. "That's just the human cost of preparing the material for scanning, packing it up and sending it out to vendors and then quality-control checking of the results. This is easily a billion-dollar effort."

Items will start appearing in 2005 with completion predicted for 2010. Can you imagine how many libraries there are out there? The information that could be gathered seems endless. I'm guessing they'll come up with a good way to detect duplicates in future libraries, but as anyone who has wandered through a University library knows there are a LOT of shady books that seem like they haven't been widely published and there are a LOT of things that were self published by academics in the University itself (theses, postdoc research, etc).

Re:Will it be like google scholar? (Score:3, Interesting)

by belg4mit ( 152620 ) writes: on Tuesday December 14, 2004 @05:19AM (#11079858) Homepage

Also try Scirus [scirus.com] from the facts at FAST [fastsearch.com]. I've often had better luck there than on google.

Why journals are expensive. (Score:5, Interesting)

by commodoresloat ( 172735 ) writes: on Tuesday December 14, 2004 @06:02AM (#11079979)

The reason there are so few copies is because they are so expensive. Chicken and Egg.
No; the reason there are so few copies is there are so few people who want to read specialized journals. And the small audience only accounts for a small part of what many academic journals charge.
No; the problem is not overhead costs or small audiences. The problem is that the owners of much of that kind of content are greedy bastards. There is no reason for the outrageous price of some journals. Some scientific journal subscriptions are in the tens of thousands; even many liberal arts journals are far from cheap. And if you want to copy an article for your students to buy at kinkos, expect them to pay 35 cents a page or more for the copyrights alone.
And many of them are worse than the RIAA in terms of access to content electronically. Journal articles are included in databases sold to some universities You can read articles in some databases but only by loading a .gif of every page one at a time. No copy and paste, no text access at all. So much technology going into preventing the thing from being copied that the online version is actually less useful than the dead tree version rotting on the shelf.
I think this is a great move by Google and Harvard, and I like the idea behind google scholar, but I expect this kind of work to be resisted by many of journals and professional organizations, to the extent that they have in a say in it. This will be a huge boon in terms of the availability of public domain resources, but unfortunately outdated perspectives on intellectual property are likely to hold back real progress for something really useful to scholars in a systematic way. At least until those perspectives change significantly.

Re:Oxford University gets every UK book published (Score:2, Interesting)

by Andrew Aguecheek ( 767620 ) writes: on Tuesday December 14, 2004 @06:47AM (#11080103)

Yep, fell foul of this one the other day. The National Library of Wales happens to be situated in Aberystwyth, on the same hill as the University. (Which, by the way, is a bitch to climb in the mornings... do not apply for sea-front residences unless you are sure of your fitness!) Aaaaanyway, as the librarian there tactfully explained to me: one hell of a lot of books are published every year, and there's only so much space in the place... and they like to have a Welsh Language copy too!

Re:Will it be like google scholar? (Score:3, Interesting)

by Rich0 ( 548339 ) writes: on Tuesday December 14, 2004 @08:16AM (#11080324) Homepage

The one thing that something like google is lacking is persistant results sets. When I do serious searching I usually start with broad terms and figure out what it takes to narrow things down to a scale that I'm willing to work with.

Good quality search engines have lots of qualities that Google lacks. You could search for two words located within 3 words of each other. You could search for these two words within 3 words of each other while two other words don't occur within 6 words of each other. Indexes are gennerally well-thought-out and vocabularies are sometimes controlled.

Google allows many of these features, but they're cumbersome to use. If I ran two searches and I want to merge the results I have to be copying down everything I did, and try to concoct some kind of advanced search which combines the two sets of parameters. In a decent professional search tool you just ask it to return "set 1 or set 2" - giving you a set 3 that has any item that appeared in either. This is powerful and easy to use, and there is no comparison with google.

Don't get me wrong, I'm glad Google is going into this business. I no longer have free access to just browse the literature any time I feel like it, and this tool would provide that. I just don't think that they'll close down the commercial operations anytime soon.

Personally, I think that all articles written using federal funding should be released into the public domain. The NIH could sponsor journals if none of the commercial journals are willing to publish works that have no copyright. If my tax dollars were used to pay for a study on bumblebee migration patterns, then I should be able to thumb through the report whether or not some bureaucrat thinks that I have a need to know the results. And doing so should not require a trip to some non-public library halfway around the country...

Re:Nice! (Score:0, Interesting)

by Anonymous Coward writes: on Tuesday December 14, 2004 @10:41AM (#11081100)

Worth noting that this project is putting a LOT of people out of work. Literally, they are laying off almost their entire library staff (I know a few..). Wonder if that'll be in the FAQ?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Google To Digitize Much of Harvard's Library 296

Google To Digitize Much of Harvard's Library More Login

Google To Digitize Much of Harvard's Library

Will it be like google scholar? (Score:5, Interesting)

The Fight against Plagiarism (Score:5, Interesting)

Re:Will it be like google scholar? (Score:2, Interesting)

Re:Are these volumes stored as text or pictures? (Score:4, Interesting)

Re:Will it be like google scholar? (Score:5, Interesting)

How will the books be scanned? (Score:2, Interesting)

Re:Flipside: The false positive problem (Score:2, Interesting)

It's about Time! (Score:2, Interesting)

Mailing Lists (Score:2, Interesting)

Re:Nice! (Score:1, Interesting)

Re:University of California is anti-digital (Score:3, Interesting)

Re:U of Michigan (Score:3, Interesting)

Re:Will it be like google scholar? (Score:3, Interesting)

Why journals are expensive. (Score:5, Interesting)

Re:Oxford University gets every UK book published (Score:2, Interesting)

Re:Will it be like google scholar? (Score:3, Interesting)

Re:Nice! (Score:0, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot