Google Buys reCAPTCHA For Better Book Scanning 138
TimmyC writes "This story may interest the Slashdot folk, many of whom use the reCAPTCHA anti-spam service. Well, reCAPTCHA is now owned by Google. Apparently, what attracted Google to ReCAPTCHA is that the company has linked its core authentication service with efforts to digitize print books and periodicals. The search giant has a massive (and controversial) effort underway in that area for its Google Books and Google News Archive services. Every time people solve a CAPTCHA from the company, they are also, as a byproduct, helping to turn scanned words into plain text that can be indexed and made searchable by search engines. Interesting times indeed."
Well... (Score:4, Interesting)
Re:WTF Summary (Score:1, Interesting)
As a control, the system sends out one word that it knows the answer to. You don't know which of the two is the unknown word beforehand. Also, I think that the same unknown word is kept in rotation for a couple of iterations just to double-check that it was entered correctly.
At least, that's how I'd implement it.
Good idea, but how? (Score:1, Interesting)
I'm interested though how they are going to know what a correct entry by a user would be for a scanned word in order to validate it if they only have a scan...
I'm real giddy about this (Score:2, Interesting)
Just wait until some soccer mom needs to protect her genius of a brat from all the bad things there are. Latest crusade? A 'bad' word in a CAPTCHA. Just you wait, it will happen.
Won't this eventually defeat the purpose? (Score:4, Interesting)
Google is doing this in order to prevent spam and to improve OCR. But once OCR is improved to the point where it can read poorer scans, won't spammers be able to use that new technology to eventually defeat CAPTCHA?
Don't get me wrong, I think this is a marvelous idea, potentially using volunteer labor of humans as OCR to interpret a book one poorly-scanned word at a time. But it does seem to have the side effect of eventually destroying the original purpose of what they bought. Maybe CAPTCHA is worth more as a "crowdsourced OCR solution" than it ever was as spam prevention anyway...
Re:maybe they should use CAPTCHAs... (Score:4, Interesting)
Funny you should say that
http://mailhide.recaptcha.net/ [recaptcha.net]
Re:Won't this eventually defeat the purpose? (Score:1, Interesting)
If spammers figure out how to defeat reCAPTCHA, Google will probably hire them to automatically digitise books; that probably pays a lot better than spamming. You can think of it as trying to set all the ingenuity of the world's spammers working at the same problem...
Re:Mod up (Score:5, Interesting)
I agree that the idea is ingenious. But on the only one I ran into, the word was completely indecipherable. I don't mean that it was really hard, I mean that it was a word so thoroughly mangled that it was clearly impossible to read by anyone, especially without context. The lack of context is one of the big weaknesses of the system. When a word is unclear, it's the words around it that give critical clues to what it is.
Re:WTF Summary (Score:3, Interesting)
I still don't get it. How do you know that the person correctly identified the second word? I don't see how a priori decoding the first word means that the second was correct. I would expect that the individual bad data rate from this technique would be substantial.
I do enjoy the fact that Google, a ridiculously profitable company by virtue of its near-monopoly on Internet search advertising, is using the public who pays it via these ad impressions to do its work for free, and using the technique invented and used by spammers to crowd-source solve CAPTCHAs to get into Gmail and the like!
Re:WTF Summary (Score:3, Interesting)
Interesting you should say that.
Unfortunately, it won't work - 4chan already ruined it for everyone.
http://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/