Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Books Education Google

Google Books As "Train Wreck" For Scholars 160

Following up on our earlier discussion, here's more detail on Geoffrey Nunberg's argument that Google Books could prove detrimental to academics and other scholars. Recently Nunberg gave a talk at a conference claiming that the metadata in Google Books is riddled with errors and is classified in a scheme unfit for scholarly use. This blog post was fleshed out somewhat a few days later in the Chronicle of Higher Education. Quoting from the latter: "Start with publication dates. To take Google's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, [and] Stephen King's Christine... A search on 'internet' in books written before 1950 and turns up 527 hits. ... [Google blames some errors on the originating libraries.] ...the libraries can't be responsible for books mislabeled as Health and Fitness and Antiques and Collectibles, for the simple reason that those categories are drawn from the Book Industry Standards and Communications codes, which are used by the publishers to tell booksellers where to put books on the shelves. ... In short, Google has taken a group of the world's great research collections and returned them in the form of a suburban-mall bookstore." The head of metadata for Google Books, Jon Orwant, has responded in detail to Numberg's complaints in a comment on the original blog post — and says his team has already fixed the errors that Nunberg so helpfully pointed out.
This discussion has been archived. No new comments can be posted.

Google Books As "Train Wreck" For Scholars

Comments Filter:
  • by timeOday ( 582209 ) on Monday September 07, 2009 @08:03PM (#29345215)
    To read the article, it is mostly a problem for people who are essentially studying trends in metadata itself, such as the emergence of some particular word over time. The "oddball" categorizations, I agree, why would anybody browse the "technology" section of a collection with millions of titles?

    The odd thing about complaining about this is, what are they comparing to? A hypothetical perfect online database that doesn't exist anyways? The article says google got it wrong in some cases where, e.g. the Harvard Library got it right. OK, that's an issue for all of us deciding whether to search on our nearest computer, or at the Harvard library.

    To me, google's project was a long time coming - somebody had to scan the world's back catalog. Maybe it would be better if governments had done it, but (and this is the point) they didn't. Google is.

  • by QuantumG ( 50515 ) * <qg@biodome.org> on Monday September 07, 2009 @08:04PM (#29345223) Homepage Journal

    We are trying to correctly amalgamate information about all the books in the world. (Which numbered precisely 168,178,719 when we counted them last Friday.)
          - Jon Orwant (Google)

    why does that number seem incredibly low to me?

  • by Aurisor ( 932566 ) on Monday September 07, 2009 @08:22PM (#29345349) Homepage

    As someone who majored in English Literature in college, I can tell you that academics love getting their panties in a bunch over what is Scholarly Publication and what is not. Some teachers will actually have special assignments that have to be written entirely using Scholarly sources, or in response to a Scholarly article.

    Before the advent of the internet, I can see how it might have been useful to have an in-group comprised of people who had some sort of qualifications to write about something, but it seems antiquated in light of the ease with which we can independently verify claims.

    Usually, if someone's going to write something that's actually useful, they'll write an actual book. Soon thereafter, a bunch of "Scholars" will come along and write a bunch of journal articles and tell us all about how the useful work was one of three things: misogynistic, code for a religious statement, or arcane, carefully-hidden innuendo.

    Sorry if I sound bitter, but I spent a lot of time reading this crap, and very little of it was as insightful or interesting as even my classmates' comments.

  • Anonymous Coward (Score:5, Interesting)

    by Anonymous Coward on Monday September 07, 2009 @08:28PM (#29345381)

    Google has scanned many volumes of the Laws of Indiana, which go back to 1816. These are the session laws of the Indiana General Assembly and have never been copyrighted. However, Google has arbitrarily decided not to make most post-1922 volumes it has digitized, and even some pre-1922 volumes (e.g. 1877, 1893, 1895, 1909, 1917 and 1918), available, using the claim of copyright.

    Google has done all the decision-making here. Anyone who might object to the classification of one of these volumes as copyrighted and thus available in "snippet-view only" presumably would have the burden of proving the contrary. (And where would you even start? Who would you contact? I have seen nothing on this.)

    Once (or if) the settlement is approved early this fall, Google's "rights" attach to these volumes. If I understand correctly, at that point any individual who wishes to access one of these volumes of Indiana's session laws not already in "full view" will have to pay for it, and for the money will obtain only individual rights, NOT the right to make it freely available to others.

    Broader implications: Finally, this analysis has been limited to volumes of Indiana session laws, but surely similar situations exist more broadly.

    For more on this, see this Aug. 2, 2009 Indiana Law Blog entry: http://indianalawblog.com/archives/2009/08/courts_my_probl.html

  • by Anonymous Coward on Monday September 07, 2009 @08:29PM (#29345393)

    Harvard's library is about 16 million books, the library of congress has about 32 million books, so that total seems reasonable. If this is just a merge the card catalogs of the world, I'm actually surprised that the number is not smaller. I admit, there are probably some books that are in no database/card catalog, maybe sitting in a cave in Tibet or somewhere is the middle east, but those that we can find and identify? This seems about right - I would have guessed 100-200 million.

  • by LifesABeach ( 234436 ) on Monday September 07, 2009 @08:31PM (#29345413) Homepage
    With all the class act talent that Google hires right out of college, why can't Google create its own Public Library on the Internet? Chrome could be the entry way to any book that is in the Public Domain, or by the Authors written permission. Turning the page of a book could be as simple as the [Back], or [Next] button. The "Card Catalog" would be a No-Brainer. No Library goes through these many hops. There's even translation to other languages, Brail, and Audio; from my viewpoint, this SHOULD be the challenge, not what word category is or isn't. If it's a case of "buy the book", then to buy 10 copies of "Gone with the Wind", and ONLY allow up to 10 readers to ONLY read "Gone with the Wind". Google could even have a "Google Online Library Card"; this is were the company hums "Ka-Ching".
  • by Looce ( 1062620 ) * on Monday September 07, 2009 @08:36PM (#29345459) Journal

    ... is that academics can't rely on Google Books to make their bibliographies, because the publication date and authorship information, which are used in all citation styles (MLA, Harvard, etc.) are incorrect on Google Books for an apparently large amount of books. Categories aren't used in citations, they're used by searchers.

    Jon Orwant of Google said that 1899 was a placeholder year for unknown publication dates, as provided by some of their metadata providers... which leads me to ask if they sanitise their data or do any research into publication dates themselves!

  • Card catalogs (Score:5, Interesting)

    by dpbsmith ( 263124 ) on Monday September 07, 2009 @08:49PM (#29345533) Homepage

    Tangential, but "card catalogs." Ha! I once had a compelling need to look up an article in the Occasional Papers of the Bingham Oceanographic Collection. So I went to the card catalog.

    It wasn't under O. It wasn't under P. It wasn't under B. It wasn't under C.

    It was under N.

    Why? Because, naturally, as of course everybody knows, the Bingham Oceanographic Collection is part of the Peabody Museum. Which is part of Yale. Which (drum roll...)... ...is in New Haven.

    The great thing here is that you can't even say there was an error in the card catalog, unless filing something under a heading that is perfectly correct, but under which nobody would dream of looking for it, is considered an error.

  • by Potor ( 658520 ) <farker1&gmail,com> on Monday September 07, 2009 @08:49PM (#29345547) Journal

    Exactly. And the whole argument totally ignores the fact that these books are now easily available.

    Shock horror: I am a liberal arts scholar. And Google Books has helped me incredibly in a project I am doing on a 18th century scholar. I have original texts in various editions at my fingertips, wonderful reference books (including a dozen 18th and 19th century Latin grammars), and serious secondary literature. Not all of these are fully posted on Google Books, but now I know what books to check out of the library, or even buy.

    As an arts scholar, I love Google books.

  • Spurious Argument (Score:2, Interesting)

    by mikethicke ( 191964 ) on Monday September 07, 2009 @09:25PM (#29345793)

    As an aspiring academic half way through a philosophy Ph. D., I find Nunberg's argument pretty absurd. Google books is a godsend for academics, and would be much more so if there was full access to their entire catalog rather than "limited previews" for most books. I have used Google books countless times to quickly check out whether a book is relevant to my research, or to get the gist of an author's argument without having to trudge down to the library. I know many others who do this as well. In all this time I've never even looked at Google's metadata. No decent academic would rely on such information, as there are far more reliable methods: such as actually checking what's written in the book, which yes, Google scans in.

  • by moosesocks ( 264553 ) on Monday September 07, 2009 @10:12PM (#29346061) Homepage

    Actually, the GP's got a good point. Back in college, I took a number of humanities courses whenever I could squeeze them into my schedule.

    I can say from firsthand experience that there are a lot of "scholarly" articles that are complete and total crap. When writing papers, I'd frequently peruse JStor [jstor.org] for pertinent articles about my topic, keeping an eye out for particularly good articles, as well as the heinously bad ones. Picking apart and systematically disproving a bad paper published in a "good" journal was an easy ticket to an 'A' on the paper.

    These papers, of course, were certainly the exception. Most scholarly papers I encounter are humbling in their brilliance. However, I've seen more than a few bad journal articles, as well as quite a few blog entries that would be worthy of scholarly publication. It's hard to make any generalizations about the validity of certain sources of information.

    Unfortunately, Physics wasn't quite as easy to bullshit (Random aside: The physical sciences certainly have their fair share of bad journal articles, especially in light of the fact that printed media is a terrible means by which to communicate scientific results. It's a cruel irony that the www was invented to enable collaboration and information exchange between scientists, but is rarely (if ever) used for that purpose. Also, any use of the word 'trivial,' or its synonyms needs to be punishable by death.)

    PS. Don't judge our writing abilities based upon out slashdot comments. I'm sure the GP had his own reasons for majoring in English, even though literary discourse is often trite and contrived.

  • by syousef ( 465911 ) on Monday September 07, 2009 @10:12PM (#29346063) Journal

    This could be the stupidest and most disingenuous argument I've encountered all year. I guess I'll never know since the metadata is not at my finger tips. This might be a good argument for getting the metadata right. It isn't a good argument for tossing the virtual books out with the bathwater.

    So no I won't get off your lawn. We're better off without scholars who'd rather hoard information. Begone!

  • by Anonymous Coward on Monday September 07, 2009 @11:04PM (#29346489)

    I'm an architect who often works on buildings from the 19th century and I cannot sufficiently express the joy, wonder, and happiness I feel browsing the material Google has made available.

    I am bewildered (but not surprised that it comes from an academic) that someone would suggest that this information, formerly molding away in some special collections department (or being shredded into fodder), should continue to be sealed off from the world because a quickly-obsolescing categorization scheme has not been applied to it with sufficient care: as other posters have noted, since the entire text can be searched for keywords, the meta-tags are largely irrelevant, and since were talking about complete scans of original sources, the desired data is there, embedded in the source!

    Anyone who call this new, incredibly rich (yet free!) database "dreck" and "garbage" is an idiot; one need only look at the current state of academia to see where that scow truly sails.

  • by grcumb ( 781340 ) on Tuesday September 08, 2009 @12:26AM (#29347027) Homepage Journal

    And I think he's entirely off-base. Nose-in-the-air "Scholars" like this gentleman fail to recognize that Google's efforts are about making material available to "the rest of us" who don't have access to those major research libraries. And categorical indexing of material makes complete and total sense if you expect to have non-PhD sorts searching for it.

    You're fighting the wrong battle here. It's easy to find any number of legitimately nasty things about 'Scholars' and 'Academics' and elitism in general. But arguing for proper classification in Google Books is not one of them.

    For several years I was an avid amateur of Information Retrieval. Classification (and other useful organisational models) of information into related collections is essential when you don't know what keywords you're looking for. This is especially important with historical works, where the use of 21st Century names, terms and other common keywords is next to useless.

    Google search is useful when you know what you're searching for. But knowing what to look for in Google Books is an entirely different matter. Categorisation matters here.

    By using a classification system that is designed for book sellers, Google's chosen a very poor set of criteria. Not only will most of the titles be poorly characterised (and thus harder to find), the effort required to find them increases with their rarity or uniqueness. These aren't always a measure of importance or interest, but often enough, they are.

    Asking Google to consider a proven, effective and well-understood categorisation system is not being snooty; it's an effort to suggest - as we geeks often do - that there might actually be a correct way to perform this task.

    Sometimes what looks like 'arrogance' is actually the state of being right [imagicity.com] about something when no one else will listen.

  • by Bacon Bits ( 926911 ) on Tuesday September 08, 2009 @01:39AM (#29347455)

    The Wrights didn't start out building toy birds, true. They first tried to use the data from some Russian or European who had modeled wings after birds. They found that the lift his data predicted was so far off from what they observed in their gliders that they could no longer assume that the data hadn't just been made up. Then they went and built a small scale wind tunnel and designed small model wings which could be reformed and shaped and angled easily and a scale which could be used to measure lift from the wing model. So, no, they didn't start out building toy birds. They effectively ended up doing that when they discovered how little data there was on the subject of a wing. They took a step back to toy bird models.

    http://www.hulu.com/watch/23333/nova-wright-brothers%E2%80%99-flying-machine [hulu.com]

  • by Anonymous Coward on Tuesday September 08, 2009 @02:23AM (#29347667)

    Why did they bother?

    1. I call absolute BS on the poor scanning quality. I have looked at 50+ books on Google Books, and not once noticed a problem with the scanning. Certainly a hell of a lot better than *I* would have done.

    2. The cost and time and legal battles required to do the scanning pretty much make it impossible unless a private corporation is leading the charge. What good does it do to try to rely on random-ass people to scan every book in existence, and every book as it comes into existence as fast as it comes? Good luck with that. And what makes you think they'd do a better job than a company that's devoted huge amounts of work to mastering the single repetitive task required to do it practically, and that can apply that practice to every single book?

    3. If you're worried about Google being evil / being too powerful / blah blah, fine, but since you don't mention that, I think you have to honestly believe they just suck. Perhaps you'd rather the US government spend 10s or 100s of millions of dollars to do it instead, because they really need to spend more money right now, and we can trust THEM to do it well.

    4. What does poor metadata have to do with anything? The task of scanning is completely separate from the OCR that goes into metadata. As Google improves their OCR, the metadata will fix itself. Or, you know, since this is a manageable task, maybe people can contribute on their own. Like the authors of this article did, and which Google gladly accepted.

    Since there are ACTUAL problems with Google Books (you know, like the ethical ones), maybe you should complain about those instead of this nonsense.

  • by introspekt.i ( 1233118 ) on Tuesday September 08, 2009 @02:37AM (#29347729)
    You act like the technology and processes use to generate this catalog are going to remain deficient indefinitely. You ignore the fact that consumer demand for better (metadata|accuracy|whathaveyou) will drive improvements in the technology. In the meantime, we get access to the early iterations of the technology and the benefits it can provide today.

    What is needed is an open standard for scanned works, with minimum resolution, minimum quality, and minimum verified metadata such as subject, author, publisher, year etc.

    Necessity is the mother invention. Wait for one to pop up, or go make one up. Nobody's stopping you.

    All those are trivially listed on the title page of every book. All one has to do is open the damn book and flip a few pages, but that appears to be too hard for some people.

    Opening the covers of every possible resource you use is quite easy when you have a discrete, present set of resources to thumb through. What if your resources aren't present, are high in number, or (lo!) are undefined...because you don't even know what exactly it is you're looking for?

    This is a long term project for humanity. There's absolutely no point in having crappy scans with garbage metadata available quickly today, when it could be available correctly with good quality in say five years.

    I think you're absolutely wrong. It's naive to assume we can just have an instant rubber-meets-the-road system available in x years without rigorous testing and input on the part of users. No point? Hah! This is absolutely the best way to go about things! Let the system work itself out with angry users pushing technicians to improve archives to have the best working system in the end. The Google system is hardly "done" and it's only going to get better with time.

    The current dreck that's online only causes duplication and waste. Take a look someday at archive.org (for example), and see how many copies of the same book are available, if it's a popular book.

    God forbid we have multiple copies of popular books in different archives.

    black and white or colour none of which is truly good quality: broken characters, pages with dark margins, missing pages, typos or incorrect titles, wrong authors etc.

    Quality is relative. Why prohibit use because we lack perfection?

    Why did they bother?

    Why did you bother? Why did I bother? Why does anybody bother? Probably because we all feel like it.

  • by waterbear ( 190559 ) on Tuesday September 08, 2009 @09:19AM (#29350103)

    As long as the books themselves are perfectly fine (which they seem to be),

    Well, some are really good and well scanned, but others are a mess. From some organizations that do the scanning, you get missing pages and mangled pages. You get pages where the person doing the scanning sometimes put their hand between the page and the glass, so you can read the rings on their fingers but not the text on the page. (Books scanned at NY Public Library for example.) If ever there is a fold-out, you get at max half of it.

    The Google Books organization doesn't seem to want to know, there is a mechanism for reporting single page defects but when 50 defects occur in a book it gets hard to work through them all using the button-clicks: I tried it for two books and also sent a message to Google Books, there was an automated reply and no action after several months.

    So much for 'As long as the books themselves are perfectly fine ....', I'm afraid.

    -wb-

  • by natehoy ( 1608657 ) on Tuesday September 08, 2009 @09:55AM (#29350523) Journal

    Given a project of this magnitude, there are inevitably going to be bad scans, and bad data, and other issues.

    And, just as inevitably, the problem areas are going to be updated and replaced with good ones when they become available.

    "There's no point in having crappy scans with garbage metadata today" would be indisputably true if every book out there was a crappy scan with garbage metadata. Instead, what we have a starting point with some good scans and some bad ones, but there's no point holding back the entire project just because some of the books have bad scans or metadata. You go live with what you have, then add/correct as needed.

    Remember, too, that none of these books replace what is available in your local library, they supplement it. If your local library has a copy of a book you want, it's still there. If they don't, Google Books will probably have it. Chances are, their scan will be good, but let's assume it's not. Isn't a barely readable version better than no version whatsoever?

    This isn't a NASA mission. If a book ends up being a crappy scan, it won't explode on re-entry killing its reader.

    This is, however, a for-profit venture. As such, it cannot wait until every page of every tome is pristine before it goes live.

    Sometimes, you go live with what you've got, even if it's not perfect, because it's not only in the best interests of profit, but because there's a benefit to having the product out there. Google Books will start as a supplemental database, and where there are good scans of books with good metadata, this will make books more available and accessible to all. Books will be missing from its catalog, and books will be unreadable at times, and books will be misfiled, but the same is true of any library.

    Google Earth went live long before detailed imagery was readily available for a lot of the world, so those who lived in an area of the world that lacked detailed imagery saw low-res imagery (green fuzzies, with a vague idea of where really big things might be) where the pictures should be. As the imagery became available, they added it to the basemap. But Google Earth made detailed cartography available to the masses in a way that it had never been available before. And, hopefully, Google Books will be able to do the same with the written word.

  • by ajs ( 35943 ) <{ajs} {at} {ajs.com}> on Tuesday September 08, 2009 @02:02PM (#29354269) Homepage Journal

    I hate to be so cynical, but there was a huge uptick in negative articles on Slashdot about Google as soon as Microsoft started their anti-Google PR effort in DC. Now I see at least one anti-Google article on Slashdot every day. Is Slashdot falling for an extensive trolling effort from MS?

    More info available from previous Slashdot article... [slashdot.org]

All seems condemned in the long run to approximate a state akin to Gaussian noise. -- James Martin

Working...