Forgot your password?

typodupeerror
Google News

Google Buys reCAPTCHA For Better Book Scanning 138

Posted by CmdrTaco
from the when-spammers-give-you-lemons dept.
TimmyC writes "This story may interest the Slashdot folk, many of whom use the reCAPTCHA anti-spam service. Well, reCAPTCHA is now owned by Google. Apparently, what attracted Google to ReCAPTCHA is that the company has linked its core authentication service with efforts to digitize print books and periodicals. The search giant has a massive (and controversial) effort underway in that area for its Google Books and Google News Archive services. Every time people solve a CAPTCHA from the company, they are also, as a byproduct, helping to turn scanned words into plain text that can be indexed and made searchable by search engines. Interesting times indeed."
This discussion has been archived. No new comments can be posted.

Google Buys reCAPTCHA For Better Book Scanning

Comments Filter:
  • Why just words? (Score:4, Insightful)

    by Thanshin (1188877) on Thursday September 17 2009, @10:11AM (#29453095)

    I suppose most people write fast enough to allow sentence captchas already.

  • Re:Imagine! (Score:1, Insightful)

    by Anonymous Coward on Thursday September 17 2009, @10:12AM (#29453103)
    As slow as searching most forums
  • Re:Why just words? (Score:5, Insightful)

    by Canazza (1428553) on Thursday September 17 2009, @10:21AM (#29453199)

    no they don't. I was transfering flights at London Heathrow and there was only one window open, and a massive queue. I get to the front and I find the woman at the computer used one finger typing... ONE FINGER, not even one on each hand, one feking finger. This was someone who was supposedly trained to do this job, can't even touch type.
    I know alot of people who still have to look at the keys when they type, and while it's generally faster than that bint, it's still painfully slow.
    Not to mention Children, when it comes to touch typing, kids can be fast learners, but before they get the hang of it, they can be very slow too.

  • Re:WTF Summary (Score:5, Insightful)

    by iamhassi (659463) on Thursday September 17 2009, @10:28AM (#29453263) Journal
    "Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. "

    That explains why half the time I can't even read the word. I swear every time I reach a captcha I have to refresh it 5x before I finally land on two words I can read.

    I must say this system is ingenious. Distributed OCR: let millions of internet users figure out what the words are. Maybe next election when there's hanging chads [wikipedia.org] they can use that as a captcha.
  • by slim (1652) <john@hartnup . n et> on Thursday September 17 2009, @10:54AM (#29453519) Homepage

    What you get in the capcha is the scanned word, plus some warping and obfuscation. Therefore if OCR advances to the point where it has no trouble with the original scan, it would still have trouble with the capcha.

    Spammers already have a neat way around capchas -- they proxy them to people on porn and warez sites. If you ever fill in a capcha on such a site, you're probably helping a spambot out.

  • Re:WTF Summary (Score:3, Insightful)

    by Chyeld (713439) <chyeld@@@gmail...com> on Thursday September 17 2009, @11:21AM (#29453815)

    You don't assume.

    For the purposes of captcha, typing one word correct suffices. As long as you get the right word (the known 'good' word) correct.

    For the purposes of distributed OCR, the "how do you know if the unknown word was ID'ed correctly" issue is simply solved by having the word ID'ed several times. Given you don't know which word is the 'test' word and which is the one actually needing IDing, there shouldn't be a problem with people guessing "Penis!" or "Boobies!" all the time.

    So as long as a majority of the people ID the word the same way, you have can have a high level of confidence that it's being ID'ed correctly.

  • by Hurricane78 (562437) <deleted@@@slashdot...org> on Thursday September 17 2009, @11:50AM (#29454215)

    No it's not warped and obfuscated. ReCaptcha gives you the word as-is.

    GP is using faulty logic (circular reasoning I think).

    If ReCaptcha improves OCR algorithms, then not only spammers will have access to them, but so does the effort behind ReCaptcha.
    So the now scannable words would be scanned and never turn up there. ReCaptcha would just present you with those words that would still not be scannable by any OCR.

  • Re:Mod up (Score:3, Insightful)

    by Chabil Ha' (875116) on Thursday September 17 2009, @11:50AM (#29454219)

    Which gives rise to the question: Why isn't captcha giving us complete sentences? Not only would you be OCRing more words, but the context gives the human a greater chance at getting it right, whilst increasing the chance of a spam bot of getting it wrong.

  • Re:Imagine! (Score:3, Insightful)

    by natehoy (1608657) on Thursday September 17 2009, @12:28PM (#29454787) Journal

    Google's probably not going to add this to their default search engine. They've already got a good audience using this where it's appropriate - to keep spambots from joining or posting to forums or in other contexts where you want to determine if your web client is human or bot.

    Google SEARCH exists and is popular because it's fast and convenient. I can't see them adding a 2-word CAPTCHA to do a simple search only because that would drive search traffic (which is already very profitable) to their competition.

    Google is very, very clever at designing mutually beneficial arrangements. They craft all of their products so the user is receiving some significant benefit in return for the information or work they provide to Google. reCAPTCHA only provides a benefit when users see a forum is pretty clean from spam and crap because CAPTCHA is there, so they'll go to the effort of joining those forums. Forum master and user both see a tangible benefit - reduced spam - and will happily compensate google with 5 seconds' work.

  • by Hays (409837) on Thursday September 17 2009, @02:09PM (#29456397)

    The text is warped and obfuscated. Look at example captchas -- do you really think the geometric swirls were in the source documents?

  • Re:Imagine! (Score:3, Insightful)

    by SnowZero (92219) on Friday September 18 2009, @12:19AM (#29462467)

    So, a project is trying to digitize historical books, newspapers, and documents, preserving them in a form that would allow our history to be kept near-losslessly for the first time since humans started writing -- and you are trying to purposely pollute their data. Okay then...

Do not worry about which side your bread is buttered on: you eat BOTH sides.

Working...