Forgot your password?
typodupeerror
Google News

Google Buys reCAPTCHA For Better Book Scanning 138

Posted by CmdrTaco
from the when-spammers-give-you-lemons dept.
TimmyC writes "This story may interest the Slashdot folk, many of whom use the reCAPTCHA anti-spam service. Well, reCAPTCHA is now owned by Google. Apparently, what attracted Google to ReCAPTCHA is that the company has linked its core authentication service with efforts to digitize print books and periodicals. The search giant has a massive (and controversial) effort underway in that area for its Google Books and Google News Archive services. Every time people solve a CAPTCHA from the company, they are also, as a byproduct, helping to turn scanned words into plain text that can be indexed and made searchable by search engines. Interesting times indeed."
This discussion has been archived. No new comments can be posted.

Google Buys reCAPTCHA For Better Book Scanning

Comments Filter:
  • How slow is searching the internet going to be if you have to fill out stupid obscured word each time?!
    • Re: (Score:1, Insightful)

      by Anonymous Coward
      As slow as searching most forums
    • Re: (Score:3, Insightful)

      by natehoy (1608657)

      Google's probably not going to add this to their default search engine. They've already got a good audience using this where it's appropriate - to keep spambots from joining or posting to forums or in other contexts where you want to determine if your web client is human or bot.

      Google SEARCH exists and is popular because it's fast and convenient. I can't see them adding a 2-word CAPTCHA to do a simple search only because that would drive search traffic (which is already very profitable) to their competiti

      • by mysidia (191772)

        I could see Google using it to protect their account signup/login/new service signup processes, not their search function.

        What's more valuable is other people using reCaptcha technology. Google can now benefit from their use, by using the service to assist their book scanning/OCR efforts.

        • They'd probably not do it because they'd be bound to be sued for it.

          • by mysidia (191772)

            Not likely. They already use Captcha technology to protect their signup pages. It's just a matter of replacing their in-house custom built implementation with the reCaptcha one.

  • Well... (Score:4, Interesting)

    by vikhyat (1593841) on Thursday September 17, 2009 @10:10AM (#29453083)
    This should improve Google's indecipherable CAPTCHA.
  • Why just words? (Score:4, Insightful)

    by Thanshin (1188877) on Thursday September 17, 2009 @10:11AM (#29453095)

    I suppose most people write fast enough to allow sentence captchas already.

    • Re:Why just words? (Score:5, Insightful)

      by Canazza (1428553) on Thursday September 17, 2009 @10:21AM (#29453199)

      no they don't. I was transfering flights at London Heathrow and there was only one window open, and a massive queue. I get to the front and I find the woman at the computer used one finger typing... ONE FINGER, not even one on each hand, one feking finger. This was someone who was supposedly trained to do this job, can't even touch type.
      I know alot of people who still have to look at the keys when they type, and while it's generally faster than that bint, it's still painfully slow.
      Not to mention Children, when it comes to touch typing, kids can be fast learners, but before they get the hang of it, they can be very slow too.

      • Not to mention Children, when it comes to touch typing, kids can be fast learners, but before they get the hang of it, they can be very slow too.

        Don't hate on the children. Most keyboards are way too big for the li'l ones anyways. We should be getting them netbooks... and maybe cellphone keyboards. They could probably type great on those, with their tiny little fingers.

        Lord knows, I can't do it. :)

        --Jimmy

      • by British (51765)

        I admit, I'm great with a standard QWERTY keyboard, but when it comes to remote controls for cable boxes/vcrs, etc, I slow down to a crawl. Perhaps it's just what you are used to. I almost never look at my keyboard(maybe for typing in tough passwords), but for my VCR remote control(infrequently used), it's a bit more difficult.

      • no they don't. I was transfering flights at London Heathrow and there was only one window open, and a massive queue. I get to the front and I find the woman at the computer used one finger typing... ONE FINGER, not even one on each hand, one feking finger. This was someone who was supposedly trained to do this job, can't even touch type.

        I don't know about London, but in the U.S., the 1-2 finger typing is usually accomplished by a community college dropout, whose fingernail extensions are about 2 inches long, and who types either by carefully and slowly pressing one key at a time with the nail extension, or with the second knuckle of her middle finger. She will also scream: "Can I help you" with enough contempt to burn your eyebrows off. When you get to the counter, she will look you over with as much spite as humanly possible, then get her

      • I can touch-type Dvorak at 80+wpm. I'm reduced to hunt-and-peck mode with Qwerty, however. Which proves the superiority of Dvorak of course.
  • Check out this Google book.... about the 7th page down.

    http://www.google.com/books?id=Y0OOlnDFUM8C&printsec=frontcover&dq=Le+Morte+d'Arthur&as_brr=1#v=onepage&q=&f=false [google.com]

    I thought these were scanned in by robots? If so it looks like it has well kept fingernails.
    • by KDR_11k (778916)

      Presumably the robot wasn't the only one ever to handle that book.

      • Re: (Score:1, Funny)

        by Anonymous Coward
        Presumably the robot wasn't the only one ever to handle that book.

        Maybe not. But I know that when I'm done handling a book I usually don't leave my hands there with it.
    • Humans - the new replacement for robots.

      Why drop half a million dollars on a machine when you can pay someone 25k a year to do the same job!

      But really, they probably do have robots that do some of the work - but to my (very limited) knowledge, even the best are somewhat destructive.

    • by Jared555 (874152)

      They probably also have some that were manually scanned, or there are probably cases where pages stick together and require human intervention. If the robot scans a book and then later it is discovered a page didn't get scanned they probably are going to manually scan it.

  • Good idea, but how? (Score:1, Interesting)

    by Nesa2 (1142511)
    ReCAPTCHA is a free service that usually integrates into forums, bLogs, and other such anonymous comment-posting services to help eliminate bot spamming. I think they will not use it on Google search pages, but exploit ReCAPTCHA users of all of those sites that do use it already. Sounds to me like a really good idea...

    I'm interested though how they are going to know what a correct entry by a user would be for a scanned word in order to validate it if they only have a scan...
    • by city (1189205)
      There is a really good talk by the reCAPTCHA found, Von Ahn, describing their method for validation a word and how they are using it to digitize old NYT articles. I think it's his one: http://www.youtube.com/v/Aszl5avDtekhl=en&%23038;fs=1&%23038;rel=0 [youtube.com]
    • by Akral (975984)

      Simple.
      They present two words - one is computer generated and is, in fact, the real CAPTCHA test. The other is a failed to OCR word from a book. People fill both words, because they don't know, which is which. They show the same failed OCR word to a hundred people and get a stable result by majority of people, even if somebody tries to abuse the system and write some bad words instead.

  • by Kokuyo (549451)

    Just wait until some soccer mom needs to protect her genius of a brat from all the bad things there are. Latest crusade? A 'bad' word in a CAPTCHA. Just you wait, it will happen.

  • by NoYob (1630681) on Thursday September 17, 2009 @10:25AM (#29453241)
    As I get older, I find that I'm having a harder time reading from computer monitors and especially captchas. I confuse words all the time. For acample: erection with election. Not so bad, but if Google doesn't pass that unknown to multiple folks, it could get embarrassing. Text from a Bill Clinton bio:

    After Bill Clinton's first erection as President, he proceeded .....

    • by HipToday (883113)
      Or acample with example.
    • I find that ReCapcha is MUCH easier than standard ones to decipher. I mean I have 10s of years deciphering text on the curve of a book, with cheap printing. Versus the made hard to read on purpose ones.

      But a few of the ReCapchas are just miss printed and would require someone to read the sentance to figure out what sholud go there.
    • by natehoy (1608657)

      Most CAPTCHA solutions have at least two ways you can solve them. Some offer an audio version of the words that is only slightly garbled (enough to defeat voice recognition) that you can listen to in addition to or instead of the CAPTCHA word, and some allow you to solve some simple word problem instead of CAPTCHA if your hearing AND eyesight are both bad.

      As far as the Clinton example, funny, but in reality people are going to be looking at one word at a time. The Clinton bio example would be frequently m

    • Protip: Ctrl-+

      Seriously. Or change the freakin' resolution of your display.

      There, was it that hard? ^^

  • by natehoy (1608657) on Thursday September 17, 2009 @10:35AM (#29453347) Journal

    Google is doing this in order to prevent spam and to improve OCR. But once OCR is improved to the point where it can read poorer scans, won't spammers be able to use that new technology to eventually defeat CAPTCHA?

    Don't get me wrong, I think this is a marvelous idea, potentially using volunteer labor of humans as OCR to interpret a book one poorly-scanned word at a time. But it does seem to have the side effect of eventually destroying the original purpose of what they bought. Maybe CAPTCHA is worth more as a "crowdsourced OCR solution" than it ever was as spam prevention anyway...

    • by CSMatt (1175471)

      CAPTCHAs can be defeated right now by using mechanical turk or social engineering to get humans to solve the CAPTCHAs for the spammers.

    • by slim (1652) <{john} {at} {hartnup.net}> on Thursday September 17, 2009 @10:54AM (#29453519) Homepage

      What you get in the capcha is the scanned word, plus some warping and obfuscation. Therefore if OCR advances to the point where it has no trouble with the original scan, it would still have trouble with the capcha.

      Spammers already have a neat way around capchas -- they proxy them to people on porn and warez sites. If you ever fill in a capcha on such a site, you're probably helping a spambot out.

      • Re: (Score:3, Insightful)

        by Hurricane78 (562437)

        No it's not warped and obfuscated. ReCaptcha gives you the word as-is.

        GP is using faulty logic (circular reasoning I think).

        If ReCaptcha improves OCR algorithms, then not only spammers will have access to them, but so does the effort behind ReCaptcha.
        So the now scannable words would be scanned and never turn up there. ReCaptcha would just present you with those words that would still not be scannable by any OCR.

        • Re: (Score:2, Informative)

          by koick (770435)
          In this interview on Wired, Luis von Ahn explains that they do indeed warp it: http://www.youtube.com/watch?v=3PuZ55kyf7E [youtube.com]
        • Re: (Score:3, Insightful)

          by Hays (409837)

          The text is warped and obfuscated. Look at example captchas -- do you really think the geometric swirls were in the source documents?

        • Re: (Score:3, Informative)

          by ChaosDiscord (4913) *

          No it's not warped and obfuscated. ReCaptcha gives you the word as-is.

          Go here [recaptcha.net]. Bounce on the reload button a few times to see some example reCAPTCHA. Tell me with a straight face that they're not warped. Perhaps they're scanning books printed on silly putty? As for obfuscated see the example here [recaptcha.net]. They used to slap a line across each word. They don't appear to be doing so any more, but they used to.

      • Spammers already have a way around captchas - getting Indians to solve them. I turned the flow of spam off my website for about a month by installing a captcha for registration. Then, I get a few enterprising young businessmen from India solving the captchas and spamming the comments by hand. You can't win.
    • Re: (Score:1, Interesting)

      by Anonymous Coward

      If spammers figure out how to defeat reCAPTCHA, Google will probably hire them to automatically digitise books; that probably pays a lot better than spamming. You can think of it as trying to set all the ingenuity of the world's spammers working at the same problem...

      • by maxume (22995)

        All you have to do is add a level of indirection. Take the reCAPTCHA images and present them to users of your rereCAPTCHA system, and then use the results to solve the reCAPTCHA tests.

        I suppose keeping up with the turnover of the reCAPTCHA might be an issue, but if the problem were valuable enough to solve...

    • in addition to just showing a scanned word, the captcha image is contorted and corrupted. This makes captchas much much harder to solve compared to standard OCR problems. Improving and perfecting OCR is unlikely to have as much of an adverse impact on captchas as spammers hiring poor folks to solve them.
  • by Thaelon (250687) on Thursday September 17, 2009 @11:00AM (#29453597)

    I have to say, reCAPTCHA is one of the most elegant solutions I've ever seen to a problem.

    It's not even killing two birds with one stone, it's killing two birds with one of the birds.

  • The other is to track how users browse the web, for ad targeting. All they need to do is put a cookie in your browser and read it next time you see a captcha or load a Google analytics script.

  • by AP31R0N (723649)

    Have you paranoiacs figured out how Google is going to use this to spy on you or otherwise do evil?

  • I thought I had some hazy recollection that reCAPTCHA was being used for some open projects, like helping to OCR out-of-copyright works...

    ...so now it is being used to fuel Google's massive, still-very-much-copyrighted, proprietary book scanning effort?

    So how's this going to benefit people? I'm, of course, assuming the details are spotty at the moment and I'm terribly interested to hear more details from Google's official "do no evil" department on how they intend to contribute to the world.

  • I just got a correct response from a clearly incorrect answer.
    The image was of Beloved but being difficult I answered 8cloved and got accepted.
    It did the job of proving that I wasn't a bot, but if there are enough difficult people (like me) out there then we could really screw Google over.

After an instrument has been assembled, extra components will be found on the bench.

Working...