Google To Digitize Much of Harvard's Library 296
FJCsar writes "According to an e-mail sent today to Harvard students, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system, which is second only to the Library of Congress in the number of volumes it contains. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, a FAQ detailing the Harvard pilot program with Google will be available at hul.harvard.edu."
One more reason... (Score:2, Insightful)
get your scuba gear... (Score:2, Insightful)
Images and formatting? (Score:3, Insightful)
Just how much storage space will all this data consume? It seems like a massive undertaking.
Are these volumes stored as text or pictures? (Score:3, Insightful)
Re:Are these volumes stored as text or pictures? (Score:3, Insightful)
Re:Are these volumes stored as text or pictures? (Score:5, Insightful)
Flipside: The false positive problem (Score:3, Insightful)
Re:Will it be like google scholar? (Score:5, Insightful)
Sure, the company needs to get some money to cover the costs of printing, distribution, and other things, plus the associations that sponsor the journal want some money to help hold conferences, but why, oh why, must they price journals so expensively that many colleges can't even afford them?
Re:Nice! (Score:3, Insightful)
Are they really going to provide proofread texts? A novel might only take a couple hours to process, but math is going to take hand markup, and some of the more complex critical editions are a bear. Even at only 2 hours a book (and that's not including scanning time), 4 million volumes adds up to 8 million man-hours or a million man-days. At seven bucks an hour that's 56 million dollars. I expect we'll get scans and OCR, but no hand work; there will still be a place for DP. In fact, we'll be better off, with a huge source of scans to work from.
Both Images & Uncorrected OCR should be availa (Score:5, Insightful)
The uncorrected OCR is very useful for indexing (by Google or others), as the 5% or fewer typos are not enough to interfere with indexing keywords. Uncorrected OCR can also be corrected later.
The page images are tied with the uncorrected OCR so you can see exactly what's there.
For an example, see books at University of Michigan's Making of America (MoA) Exhibit [umich.edu], which has thousands of 19th century books and periodicals available.
Dead authors tell no tales . . . till now (Score:3, Insightful)
Only public-domain books will be scanned. In all or most cases the author's are dead. However, this will revive a great body of work and widen access to many.
One class of author may be pissed will be authors who take older works and just slap a foreword or introduction to the front and collect royalties. I've seen this done for many histories. But author's of todays works can count on royalties for themselves, their children, and their grandchildren (if the book is still selling). The copyright term is too long in the U.S., but that's another story . . .
False positives can be double-checked manually (Score:2, Insightful)
You'd want to do a thorough overview of any potential instance of cheating anyway. A quick run-through would determine whether or not a paper happened to contain an identical sentence clause or three identical paragraphs.
I think the bigger problem would be the second one you described -- that students could plagiarize and then go through each paragraph, changing the wording slightly so as to avoid positive matches. Still, you could argue that this is pretty much what academics is anyway, just with footnotes and a bibliography.
Re:Will it be like google scholar? (Score:0, Insightful)
Re:Will it be like google scholar? (Score:3, Insightful)
Good quality search engines have lots of qualities that Google lacks.
One solution is to use google to locate a superset of the target articles and then use a more powerful search engine to winnow the google result set. For an individual, this approach would mean maintaining a personal index of the articles but that is a problem of storage space and bandwidth which is relatively cheap.
The two main problems that google solves is
One could imagine a plugin for browsers that would add the additional search facilities to a google search. Until then, Google Hacks [oreilly.com] will get you started.
Re:Why journals are expensive. (Score:3, Insightful)
JSTOR varies in quality from journal to journal--some are actually okay, while others suck. I know that I have gotten pdf's from JSTOR, but I wonder if that is a function of JSTOR or the amount that a person/institution is paying for access.
Most journals that I have dealt with online where I had to pay (because the university wasn't a subscriber) wanted between $15 and $25 for a single article. This is a LOT of money, and sometimes (if you aren't in a hurry), it is easier to contact the author and ask for a reprint--they usually have them, and if they are like many researchers, they are glad to send you a copy, provided you explain what you are doing.
There is a trick to it--the current prestigious journals ARE NOT going to go to a low/no cost format for publishing online until there are one or two major competitors who are seen as valid (peer-review) and prestigious. The prestige factor is huge and rests largely on (as you mention) the peer review process AND who is publishing in the journal. Sorry, but Robert Sternberg doesn't generally publish in just any old journal--he has one or two that he will send a manuscript to, and go from there.
When my thesis advisor (who wrote two chapters for the Handbook of Research Methods in Industrial Psychology) publishes, he typically sends stuff first to the Journal of Occupational Behavior, not DarkSarin's Online Journal of Amateur Psychology or Commoderesloat's Journal of Human Weirdness. Why? Because no one has EVER heard of those journals, and if puts that on his vita, it won't make any difference to the next folks wanting to hire him for his research ability (not that he's going anywhere--he's a full professor).
But when the next university sees that he has published 10 articles in the Journal of Occupational Behavior (JOB), they say, "Hey, this guy is getting published in one of the top 10 journals in Behavioral Psychology, he's probably pretty good!" They will then probably hire him.
But when that same university interviews me, and I put down that I published 123 articles in DarkSarin's Journal of Computer Gaming Psychology, they are going say, "Wow, I've never heard of that journal--is it peer reviewed? Is it attached to a professional association (APA, MPA, SIOP, etc)? Has anybody here heard of it? Does anyone who's any good publish in that journal?" If you are REALLY lucky, they MIGHT take the time to look up the answers, but chances are slim if the position is getting very many applicants (and if it isn't, it probably isn't paying very well!).
The long and the short of it is that there is little, if any, financial pressure to offer content online for free, and that is unlikely to change without competition. There is unlikely to be much competition, because few young researchers are going to put their career on the line by publishing in any but the most prestigious journals that they can possibly get an article into. Older researchers are already in the habit of sending articles to certain journals, and so they aren't likely to change either.
There isn't a good, quick, easy solution to this, and anyone who says that there is needs to have their head checked. Sorry.