Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Books Education Google

Google Books As "Train Wreck" For Scholars 160

Following up on our earlier discussion, here's more detail on Geoffrey Nunberg's argument that Google Books could prove detrimental to academics and other scholars. Recently Nunberg gave a talk at a conference claiming that the metadata in Google Books is riddled with errors and is classified in a scheme unfit for scholarly use. This blog post was fleshed out somewhat a few days later in the Chronicle of Higher Education. Quoting from the latter: "Start with publication dates. To take Google's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, [and] Stephen King's Christine... A search on 'internet' in books written before 1950 and turns up 527 hits. ... [Google blames some errors on the originating libraries.] ...the libraries can't be responsible for books mislabeled as Health and Fitness and Antiques and Collectibles, for the simple reason that those categories are drawn from the Book Industry Standards and Communications codes, which are used by the publishers to tell booksellers where to put books on the shelves. ... In short, Google has taken a group of the world's great research collections and returned them in the form of a suburban-mall bookstore." The head of metadata for Google Books, Jon Orwant, has responded in detail to Numberg's complaints in a comment on the original blog post — and says his team has already fixed the errors that Nunberg so helpfully pointed out.
This discussion has been archived. No new comments can be posted.

Google Books As "Train Wreck" For Scholars

Comments Filter:
  • by Nefarious Wheel ( 628136 ) on Monday September 07, 2009 @07:54PM (#29345161) Journal
    ...when you have Search? Pick your own keywords.
  • by Artraze ( 600366 ) on Monday September 07, 2009 @08:25PM (#29345367)

    > The odd thing about complaining about this is, what are they comparing
    > to? A hypothetical perfect online database that doesn't exist anyways?

    That's exactly why this article is little more than some long winded trolling. So the metadata is wrong... As long as the books themselves are perfectly fine (which they seem to be), you can always check the metadata your self. I must think that as far as Google is concerned (and 99+% of its users) the metadata isn't nearly as important as the data itself. Once the data is collected you can always fix the rest.

    Expect a new "tagging game" in the next year or two to manually correct these error.

  • by Anonymous Coward on Monday September 07, 2009 @08:30PM (#29345403)

    And this is no exception. Before google books you had access to books from various libraries, books you owned, books you could loan from friends (*shock* *gasp* copyright infringement), books you could buy and books from non-google online sources. Now you have access to all of those and additionally google books. Even if google books is 99% "piece of shit" (which in my experience is simply not true, but nevertheless) you still have the 1% potentially useful material available that wasn't available before, so you win.

  • by mschuyler ( 197441 ) on Monday September 07, 2009 @08:30PM (#29345409) Homepage Journal

    like shelving 'Life of an Iceberg' under biographies, but by and large they strive to be and are correct. If they mess up, some other library will fix the error. Libraries' cataloging data is usually centralized by OCLC so that the data is uniform throughput the country as other libraries pull from this central source for their own catalogs. Libraries also use a recognized and standardized subject scheme with a controlled vocabulary, not just a bunch of meta tags. Cataloging librarians are a rare and little-recognized breed of people who spend their entire professional lives trying to make it easier to gain access to material. The result is an organized body of knowledge--not just a heap of books on the floor in no particular order, like the Internet--and Google. For Google to blame libraries for their troubles is like blaming the Machinist Mates on the Titanic for crashing the ship into an iceberg. There, full circle. How did that happen?

  • Obnoxious (Score:3, Insightful)

    by burgundysizzle ( 1192593 ) on Monday September 07, 2009 @08:33PM (#29345427)

    The inline replies are written with a smug sense of self-entitlement as though he and other "scholars" are the only legitimate users of Google Books. It's NOT about you - you are not going to create enough adsense hits to make this whole thing worthwhile (or turn a profit).

  • by dpbsmith ( 263124 ) on Monday September 07, 2009 @08:44PM (#29345505) Homepage

    This is much like Google itself.

    Google's brilliance, and woe, is its sloppy imprecision.

    You type in a query. It returns a bunch of stuff. Quite a lot of it is irrelevant and as perceived as not meeting the requirements of the search, but you don't mind because all you care about is that it finds what you want, not that it finds other stuff. Unfortunately, Google is so good that it tricks you into believing that it always finds everything that matches your query. But, of course, there's no way to find out what it _missed_.

    I've personally noticed and been puzzled by the publication dates. I'd noticed it particularly with periodicals. What seems to be the case here is that Google is very prone to give the date that a journal began publication as the publication date of every article that has ever appeared in that journal.

    Wikipedia editors are well aware of the dangers of using Google hit counts as data. It's amusing to see that there are 1,930,000 hits on "Ghandi" compared to 22,900,000 for "Gandhi" and conclude that Gandhi's name is misspelled 10% of the time... or to notice, as I have, that that percentage is increasing and project the year in which "Ghandi" must inevitably become the accepted spelling... but it is, as they say, "for amusement purposes only."

  • Re:Obnoxious (Score:5, Insightful)

    by Volante3192 ( 953645 ) on Monday September 07, 2009 @08:45PM (#29345513)

    Definatly. It's like, "Oh, look, I found an error. If I had done this, that error wouldn't be there!!" And to that I respond, then do it yourself. YOU go tack metadata onto the 100 million books they have, you smug egocentric bastard.

    And, of course, he completely ignores the 999,999 proper entries compared to the 1 error. Google seems to know there's lots of problems here, and they're not going to get it right the first pass. But having a first pass at all is better than nothing.

  • by presidenteloco ( 659168 ) on Monday September 07, 2009 @08:47PM (#29345517)

    Yes, having all of the world's literature available for instant full text search sounds
    disastrous for scholars.

  • by moon3 ( 1530265 ) on Monday September 07, 2009 @08:49PM (#29345545)
    They pushed the copyright law to over hundred years (just to make sure they will make money of writers even after they are dead), now comes our big brother Google to the ring to resurrect all the OUT OF COPYRIGHT books -- meaning those dead books that publishers no longer exclusively distribute. What an offense against the poor publishers. Google is creating a real e-Library of enormous proportions of virtually free books, what a threat. I bet I am not alone who wants to see the Newton's books on physics e-published again and searchable.
  • by ahoehn ( 301327 ) <andrew AT hoe DOT hn> on Monday September 07, 2009 @09:08PM (#29345677) Homepage

    Sorry if I sound bitter, but I spent a lot of time reading this crap, and very little of it was as insightful or interesting as even my classmates' comments.

    That sounds like more of a you problem than an academia problem. If you don't enjoy using a work's minutiae to accuse perfectly innocent authors of misogyny, innuendo, (to add a couple you forgot) blatant colonialism or latent homosexuality, what the fuck were you doing in an English Lit program? The rest of us live for that shit.

    As someone who should not have majored in English Literature in college

    There. I fixed it for you.

  • by martin-boundary ( 547041 ) on Monday September 07, 2009 @09:31PM (#29345839)

    The odd thing about complaining about this is, what are they comparing to?

    How about good old fashioned legwork? It *is* possible to make sure that the metadata is consistent with the facts, but that involves doing actual research and verification such as academics have been doing for hundreds of years.

    To me, google's project was a long time coming - somebody had to scan the world's back catalog.

    Then you have very low standards indeed. There's absolutely no reason why a single entity had to / has to scan all the world's back catalog on their own as fast as they can. It's pure commercial greed, and leads to the garbage we have on the net today.

    What is needed is an open standard for scanned works, with minimum resolution, minimum quality, and minimum verified metadata such as subject, author, publisher, year etc. All those are trivially listed on the title page of every book. All one has to do is open the damn book and flip a few pages, but that appears to be too hard for some people.

    This is a long term project for humanity. There's absolutely no point in having crappy scans with garbage metadata available quickly today, when it could be available correctly with good quality in say five years. It's also a perfect case for crowdsourcing, with some real standards to ensure quality.

    The current dreck that's online only causes duplication and waste. Take a look someday at archive.org (for example), and see how many copies of the same book are available, if it's a popular book. You'll typically find 5-10 scanned versions, by Google, Microsoft, and various local library projects, in black and white or colour none of which is truly good quality: broken characters, pages with dark margins, missing pages, typos or incorrect titles, wrong authors etc.

    Why did they bother?

  • Re:Obnoxious (Score:3, Insightful)

    by fuzzyfuzzyfungus ( 1223518 ) on Monday September 07, 2009 @10:18PM (#29346107) Journal
    If you were a scholar, writing for an audience of other scholars, why wouldn't you write about the concerns of scholars and from their perspective? I'm sure he knows exactly why Google is doing what it's doing; but that doesn't mean that he can't point out the downsides.

    It's like saying that Slashdot is obnoxious because it is "written with a smug sense of self-entitlement as though he and other 'geeks' are the only legitimate users of the Internet". This is true; but that is because it is a geek website where geeks write about geek stuff. Obviously we know why Comcast is capping and packet shaping; but that doesn't mean we can't whine about the downsides for us.
  • by fuzzyfuzzyfungus ( 1223518 ) on Monday September 07, 2009 @10:23PM (#29346159) Journal
    Which is incredibly helpful for anybody interested in printed materials before 1966...
  • by Anonymous Coward on Monday September 07, 2009 @10:37PM (#29346287)

    The concern is really the Faustian bargain that Google has been willing to strike with trade groups (like the Author's Guild settlement). Google has conceded the point that these groups should be facilitated in their great land grab of out-of-print books, in return for Google's right to index them.

    It is reasonable to question whether those bargains are fair, especially since we have projects like the Internet Archive, which wouldn't make such a concession. It's also reasonable to question whether Google and a trade group even have the legal standing to strike that sort of deal.

  • by bigbigbison ( 104532 ) on Monday September 07, 2009 @11:46PM (#29346787) Homepage
    I don't read him as saying, "any book that can be found in the holdings of a major research library is only of interest to scholars." at all. Rather, I read him as sayin that the systems that libraries use to organize books be they Dewey Decimal, Library of Congress, or some other system were created to help organize books for users to use them. The BISAC classifications were developed to help companies sell books. Why use that rather than what the libraries -- the source of these books -- uses?
  • by julesh ( 229690 ) on Tuesday September 08, 2009 @03:40AM (#29348155)

    The problem is that the existence of google books makes it harder for others working on similar systems (and there are others, this isn't just a pipedream) to become established. A Google Books court-approved class-action copyright settlement would make it harder for somebody else to reach a similar agreement (because the public interest argument will be harder to make). Essentially, this is a field where the first person to do it is likely to end up with a monopoly, and Google have done it badly, thus precluding other people from doing it properly.

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...