

Google Buys reCAPTCHA For Better Book Scanning 138
TimmyC writes "This story may interest the Slashdot folk, many of whom use the reCAPTCHA anti-spam service. Well, reCAPTCHA is now owned by Google. Apparently, what attracted Google to ReCAPTCHA is that the company has linked its core authentication service with efforts to digitize print books and periodicals. The search giant has a massive (and controversial) effort underway in that area for its Google Books and Google News Archive services. Every time people solve a CAPTCHA from the company, they are also, as a byproduct, helping to turn scanned words into plain text that can be indexed and made searchable by search engines. Interesting times indeed."
Imagine! (Score:1)
Re: (Score:1, Insightful)
Re: (Score:3, Insightful)
Google's probably not going to add this to their default search engine. They've already got a good audience using this where it's appropriate - to keep spambots from joining or posting to forums or in other contexts where you want to determine if your web client is human or bot.
Google SEARCH exists and is popular because it's fast and convenient. I can't see them adding a 2-word CAPTCHA to do a simple search only because that would drive search traffic (which is already very profitable) to their competiti
Re: (Score:1)
I could see Google using it to protect their account signup/login/new service signup processes, not their search function.
What's more valuable is other people using reCaptcha technology. Google can now benefit from their use, by using the service to assist their book scanning/OCR efforts.
Re: (Score:1)
They'd probably not do it because they'd be bound to be sued for it.
Re: (Score:1)
Not likely. They already use Captcha technology to protect their signup pages. It's just a matter of replacing their in-house custom built implementation with the reCaptcha one.
Re: (Score:3, Insightful)
So, a project is trying to digitize historical books, newspapers, and documents, preserving them in a form that would allow our history to be kept near-losslessly for the first time since humans started writing -- and you are trying to purposely pollute their data. Okay then...
Well... (Score:4, Interesting)
Why just words? (Score:4, Insightful)
I suppose most people write fast enough to allow sentence captchas already.
Re:Why just words? (Score:5, Insightful)
no they don't. I was transfering flights at London Heathrow and there was only one window open, and a massive queue. I get to the front and I find the woman at the computer used one finger typing... ONE FINGER, not even one on each hand, one feking finger. This was someone who was supposedly trained to do this job, can't even touch type.
I know alot of people who still have to look at the keys when they type, and while it's generally faster than that bint, it's still painfully slow.
Not to mention Children, when it comes to touch typing, kids can be fast learners, but before they get the hang of it, they can be very slow too.
Re: (Score:2)
Not to mention Children, when it comes to touch typing, kids can be fast learners, but before they get the hang of it, they can be very slow too.
Don't hate on the children. Most keyboards are way too big for the li'l ones anyways. We should be getting them netbooks... and maybe cellphone keyboards. They could probably type great on those, with their tiny little fingers.
:)
Lord knows, I can't do it.
--Jimmy
Re: (Score:2)
I admit, I'm great with a standard QWERTY keyboard, but when it comes to remote controls for cable boxes/vcrs, etc, I slow down to a crawl. Perhaps it's just what you are used to. I almost never look at my keyboard(maybe for typing in tough passwords), but for my VCR remote control(infrequently used), it's a bit more difficult.
Familiar Creature (Score:2)
no they don't. I was transfering flights at London Heathrow and there was only one window open, and a massive queue. I get to the front and I find the woman at the computer used one finger typing... ONE FINGER, not even one on each hand, one feking finger. This was someone who was supposedly trained to do this job, can't even touch type.
I don't know about London, but in the U.S., the 1-2 finger typing is usually accomplished by a community college dropout, whose fingernail extensions are about 2 inches long, and who types either by carefully and slowly pressing one key at a time with the nail extension, or with the second knuckle of her middle finger. She will also scream: "Can I help you" with enough contempt to burn your eyebrows off. When you get to the counter, she will look you over with as much spite as humanly possible, then get her
Re: (Score:2)
Re: (Score:2)
80 wpm? Isn't dvorak supposed to be faster or something? ;)
Re: (Score:2)
(like many other developers) I have to look at my hands (not constantly, but at least a glance every 3rd word) to type.
"like many other developers"??? Jebus, I hope not. I've never met a single developer who can't touch type. And in the company I work for, the average is in the 60-70 wpm range (and I'm definitely on the higher end, averaging about 120 wpm).
As for the looking at the keyboard, TBH, I'd just find that annoying... when I'm in the "flow", I prefer to keep my eyes on the screen... having to pau
Not just words (Score:2)
Is that a finger cot? (Score:2)
http://www.google.com/books?id=Y0OOlnDFUM8C&printsec=frontcover&dq=Le+Morte+d'Arthur&as_brr=1#v=onepage&q=&f=false [google.com]
I thought these were scanned in by robots? If so it looks like it has well kept fingernails.
Re: (Score:1)
Presumably the robot wasn't the only one ever to handle that book.
Re: (Score:1, Funny)
Maybe not. But I know that when I'm done handling a book I usually don't leave my hands there with it.
Re: (Score:2)
Humans - the new replacement for robots.
Why drop half a million dollars on a machine when you can pay someone 25k a year to do the same job!
But really, they probably do have robots that do some of the work - but to my (very limited) knowledge, even the best are somewhat destructive.
Re: (Score:2)
They probably also have some that were manually scanned, or there are probably cases where pages stick together and require human intervention. If the robot scans a book and then later it is discovered a page didn't get scanned they probably are going to manually scan it.
Good idea, but how? (Score:1, Interesting)
I'm interested though how they are going to know what a correct entry by a user would be for a scanned word in order to validate it if they only have a scan...
Re: (Score:1)
Re: (Score:1)
Re: (Score:1)
Simple.
They present two words - one is computer generated and is, in fact, the real CAPTCHA test. The other is a failed to OCR word from a book. People fill both words, because they don't know, which is which. They show the same failed OCR word to a hundred people and get a stable result by majority of people, even if somebody tries to abuse the system and write some bad words instead.
I'm real giddy about this (Score:2, Interesting)
Just wait until some soccer mom needs to protect her genius of a brat from all the bad things there are. Latest crusade? A 'bad' word in a CAPTCHA. Just you wait, it will happen.
I hope they have a couple of tests! (Score:5, Funny)
After Bill Clinton's first erection as President, he proceeded .....
Re: (Score:1)
Re: (Score:1)
But a few of the ReCapchas are just miss printed and would require someone to read the sentance to figure out what sholud go there.
Re: (Score:2)
Most CAPTCHA solutions have at least two ways you can solve them. Some offer an audio version of the words that is only slightly garbled (enough to defeat voice recognition) that you can listen to in addition to or instead of the CAPTCHA word, and some allow you to solve some simple word problem instead of CAPTCHA if your hearing AND eyesight are both bad.
As far as the Clinton example, funny, but in reality people are going to be looking at one word at a time. The Clinton bio example would be frequently m
Re: (Score:2)
Protip: Ctrl-+
Seriously. Or change the freakin' resolution of your display.
There, was it that hard? ^^
Re: (Score:2)
CTRL-+ just makes "erection" bigger. Since you ask: yes, it's hard.
Won't this eventually defeat the purpose? (Score:4, Interesting)
Google is doing this in order to prevent spam and to improve OCR. But once OCR is improved to the point where it can read poorer scans, won't spammers be able to use that new technology to eventually defeat CAPTCHA?
Don't get me wrong, I think this is a marvelous idea, potentially using volunteer labor of humans as OCR to interpret a book one poorly-scanned word at a time. But it does seem to have the side effect of eventually destroying the original purpose of what they bought. Maybe CAPTCHA is worth more as a "crowdsourced OCR solution" than it ever was as spam prevention anyway...
Re: (Score:2)
CAPTCHAs can be defeated right now by using mechanical turk or social engineering to get humans to solve the CAPTCHAs for the spammers.
Re: (Score:2)
CAPTCHAs can also be defeated with a system like reCAPTCHA.
Re:Won't this eventually defeat the purpose? (Score:5, Insightful)
What you get in the capcha is the scanned word, plus some warping and obfuscation. Therefore if OCR advances to the point where it has no trouble with the original scan, it would still have trouble with the capcha.
Spammers already have a neat way around capchas -- they proxy them to people on porn and warez sites. If you ever fill in a capcha on such a site, you're probably helping a spambot out.
Re: (Score:3, Insightful)
No it's not warped and obfuscated. ReCaptcha gives you the word as-is.
GP is using faulty logic (circular reasoning I think).
If ReCaptcha improves OCR algorithms, then not only spammers will have access to them, but so does the effort behind ReCaptcha.
So the now scannable words would be scanned and never turn up there. ReCaptcha would just present you with those words that would still not be scannable by any OCR.
Re: (Score:2, Informative)
Re: (Score:3, Insightful)
The text is warped and obfuscated. Look at example captchas -- do you really think the geometric swirls were in the source documents?
Re: (Score:3, Informative)
Go here [recaptcha.net]. Bounce on the reload button a few times to see some example reCAPTCHA. Tell me with a straight face that they're not warped. Perhaps they're scanning books printed on silly putty? As for obfuscated see the example here [recaptcha.net]. They used to slap a line across each word. They don't appear to be doing so any more, but they used to.
Re: (Score:2)
Re: (Score:1, Interesting)
If spammers figure out how to defeat reCAPTCHA, Google will probably hire them to automatically digitise books; that probably pays a lot better than spamming. You can think of it as trying to set all the ingenuity of the world's spammers working at the same problem...
Re: (Score:1)
All you have to do is add a level of indirection. Take the reCAPTCHA images and present them to users of your rereCAPTCHA system, and then use the results to solve the reCAPTCHA tests.
I suppose keeping up with the turnover of the reCAPTCHA might be an issue, but if the problem were valuable enough to solve...
Re: (Score:1)
Re: (Score:2)
Excellent point.
Someone please mod parent insightful. Thanks! :)
reCAPTCHA is awesome (Score:5, Funny)
I have to say, reCAPTCHA is one of the most elegant solutions I've ever seen to a problem.
It's not even killing two birds with one stone, it's killing two birds with one of the birds.
Re: (Score:1)
Re: (Score:2)
I've already posted so I can't mod you up, but that might be the greatest analogy I've ever heard. I'm already thinking up applications for it.
Psst, scanning books is just one goal (Score:2)
The other is to track how users browse the web, for ad targeting. All they need to do is put a cookie in your browser and read it next time you see a captcha or load a Google analytics script.
Evil? (Score:2)
Have you paranoiacs figured out how Google is going to use this to spy on you or otherwise do evil?
Waiiiiit.... (Score:1)
I thought I had some hazy recollection that reCAPTCHA was being used for some open projects, like helping to OCR out-of-copyright works...
...so now it is being used to fuel Google's massive, still-very-much-copyrighted, proprietary book scanning effort?
So how's this going to benefit people? I'm, of course, assuming the details are spotty at the moment and I'm terribly interested to hear more details from Google's official "do no evil" department on how they intend to contribute to the world.
Beloved != 8cloved (Score:1)
The image was of Beloved but being difficult I answered 8cloved and got accepted.
It did the job of proving that I wasn't a bot, but if there are enough difficult people (like me) out there then we could really screw Google over.
Re: (Score:2)
read up on the implementation to see why you are wrong.
Re:WTF Summary (Score:5, Informative)
From: recaptcha.net [recaptcha.net]:
But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
Re:WTF Summary (Score:5, Insightful)
That explains why half the time I can't even read the word. I swear every time I reach a captcha I have to refresh it 5x before I finally land on two words I can read.
I must say this system is ingenious. Distributed OCR: let millions of internet users figure out what the words are. Maybe next election when there's hanging chads [wikipedia.org] they can use that as a captcha.
Mod up (Score:2)
Re:Mod up (Score:5, Interesting)
I agree that the idea is ingenious. But on the only one I ran into, the word was completely indecipherable. I don't mean that it was really hard, I mean that it was a word so thoroughly mangled that it was clearly impossible to read by anyone, especially without context. The lack of context is one of the big weaknesses of the system. When a word is unclear, it's the words around it that give critical clues to what it is.
Re: (Score:3, Insightful)
Which gives rise to the question: Why isn't captcha giving us complete sentences? Not only would you be OCRing more words, but the context gives the human a greater chance at getting it right, whilst increasing the chance of a spam bot of getting it wrong.
Re: (Score:2, Funny)
Which gives rise to the question: Why isn't captcha giving us complete sentences? Not only would you be OCRing more words, but the context gives the human a greater chance at getting it right, whilst increasing the chance of a spam bot of getting it wrong.
...and increasing the rate of people saying "F- it, the captcha should not be longer than my comment." - hence the limit of two words to allow for "me too!" comments.
Re: (Score:3, Funny)
hence the limit of two words to allow for "me too!" comments.
lol
Re: (Score:3, Funny)
Which gives rise to the question...
Don't you mean, "Which begs the question..."?!
(ducks)
Re: (Score:2)
Because having to read and enter a single, hard to read word is enough hassle for most people; two is stretching it. An entire sentence would be too much.
Re: (Score:2)
True. I think they could however highlight one or two words and ask the user to enter the highlighted words
Re: (Score:1)
Re: (Score:2)
I find reCaptcha high readable, this isn't like other captcha techniques where there are really thin letters and randoms objects strewn about, it's just blurry, zoomed in typewritten words that are hard for a computer to distinguish.
Re: (Score:3, Interesting)
I still don't get it. How do you know that the person correctly identified the second word? I don't see how a priori decoding the first word means that the second was correct. I would expect that the individual bad data rate from this technique would be substantial.
I do enjoy the fact that Google, a ridiculously profitable company by virtue of its near-monopoly on Internet search advertising, is using the public who pays it via these ad impressions to do its work for free, and using the technique invented a
Re: (Score:1)
Re: (Score:1)
Re: (Score:3, Insightful)
You don't assume.
For the purposes of captcha, typing one word correct suffices. As long as you get the right word (the known 'good' word) correct.
For the purposes of distributed OCR, the "how do you know if the unknown word was ID'ed correctly" issue is simply solved by having the word ID'ed several times. Given you don't know which word is the 'test' word and which is the one actually needing IDing, there shouldn't be a problem with people guessing "Penis!" or "Boobies!" all the time.
So as long as a majori
Re: (Score:2)
Yeah, the multiple answers idea occurred to me later. I'm actually not talking about deliberate garbage answers, just people getting it wrong, and if it is badly scanned, etc. you will get multiple answers for the unknown text, and possibly not 100:1, but maybe 2 answers that 100:90 or something of that order - you still don't know which is more correct. Or maybe because of the nature of the image, the vast majority of people may actually converge on a wrong answer.
Re: (Score:2)
You keep running it until one answer dominates in a statistical sense. With the amount of data they are getting, it wouldn't be hard to construct a pretty accurate probabilistic model. If you never get a satisfactory probability for the most frequent answer, you could flag it for a developer to look at.
Re: (Score:1)
Re: (Score:1, Offtopic)
Maybe next election when there's hanging chads they can use that as a captcha.
It would certainly be a lot more fair than the current process - which is a bunch of cronies each interpret the results to their preferred candidate's advantage and then a judge settles it.
Of course, the better solution is to not have such ambiguity in the first place.
If you wanted to implement a system for interpreting analog votes here is what I'd do:
1. All ambiguous votes are digitized. Of course, the definition of "ambiguous
Re: (Score:1, Troll)
I must say this system is ingenious.
I respectfully disagree. I hate CAPTCHA because it discriminates against AI. Instead, Web-based systems should be designed to accommodate AI participants. I hate reCAPTCHA even more because it is even more annoying and I have no idea who I am working for. I always intentionally smash the keyboard with my palm for the second word. I think that tricking people into working for you is by far the least decent way of distributing this process. It would be better to have an "OCR box" which has nothing to do with
Re: (Score:2)
I always intentionally smash the keyboard with my palm for the second word.
Well, it doesn't have to be the first word known and the second word unknown, it could be the opposite, or random.
Re: (Score:2)
If it is at random, one of the following will happen: I will either screw up the known word, in which case my OCR will not be trusted, or I will screw up the OCR word and get through. It should only take a few tries to get through, and there is no chance of helping with OCR.
Re: (Score:2)
Yeah. I often get combinations like "WORD vjfkjsmxs" or worse, "WORD [illegible smudge]".
I tend to simply put a dash for the smudge. They're not using that word to verify, after all, they just want to know what it says. So I tell them, "nothing". Likely, they'll get a lot of different results for it, and if the scoring algorithm is good it will eventually determine the word is illegible (or at least show it to a moderator of some kind).
Re:WTF Summary (Score:5, Informative)
The best part is, it automatically selects for words which are invulnerable to OCR-based attacks. And if the user's presented with an illegible scanned CAPTCHA, they aren't penalised for getting it wrong.
Re: (Score:1)
Well, yeah, but the OCR attacker also just needs to get the OCR readable word right...
Re: (Score:2)
That'd involve designing a pattern-recognition system which can reliably decide which of two OCR words is less readable, mind you.
Re: (Score:2)
One KNOWN, one not. The known word is not necessarily going to be OCR readable... you can seed the database with 100 or so images which are known, but maybe not OCR readable. Of course it works better if the known words are NOT OCR readable.
The point is OCR can have typos as well, so just because OCR returns a result doesn't mean it should be trusted. The known word of the two is likely independently analyzed, probably by a human.
Once enough people put the same answer for an unknown word, it becomes trus
Re: (Score:1)
The 'known' word wasn't necessarily OCR readable. And their methods of OCR are probably not quite the same as the attacker's.
Re: (Score:1)
That's really interesting. I've always wondered why I have passed these CAPTCHAs even when I had to make wild guesses on some of the words because they were so hard to read.
However, how long will it be before a lot of users realize that it is irrelevant what you enter for the unknown word? Even if you don't know for sure which of the word that is the unknown one, knowing the above I think the risk is high that you just type nonsense if you can't read one of the words.
If enough people do this the system will
Re: (Score:1)
That is what happened with the Anonymous attack on the Time poll, with the 'penis' attack.
They looked at both words, see which one was the least readable, fill in the good one and fill in 'penis' for the second one, in the hopes of poisoning the database so that they only have to enter the first word correctly.
Would be kind of amusing to see a couple of books showing up on Google Books with the word 'penis' randomly inserted in pages where reCaptcha was used.
Re: (Score:1, Interesting)
As a control, the system sends out one word that it knows the answer to. You don't know which of the two is the unknown word beforehand. Also, I think that the same unknown word is kept in rotation for a couple of iterations just to double-check that it was entered correctly.
At least, that's how I'd implement it.
Re: (Score:2)
People still have to solve the first one correctly, and if enough people give the same answer to the second one, it is added considered correct.
Re: (Score:3, Funny)
wisdow
OCR error?
Re: (Score:2)
Er... no. Read the reCAPTCHA info (Score:2)
Re: (Score:2)
So if enough people type ' penis' as the result, eventually 3 people will identify the captcha as 'penis' and it gets in the list of known words.
Marble cake, also, the game (Score:1)
Re: (Score:2)
Re: (Score:2)
ReCaptcha does that:
One of the words is generated or known, and the other is the new word they are trying to scan. You have to give both to access the protected system, since you don't know which is the known word and which is the new word.
http://en.wikipedia.org/wiki/ReCAPTCHA [wikipedia.org]
Re: (Score:1)
Re: (Score:2)
This is not just any captcha, but recaptcha. This captcha system will challenge you to recognize two words, one of which it understands and one it cannot understand. It assumes that, if sufficient people map the unrecognized word to the same set of letters (and also get the known word right), the image indeed maps to these letters.
This is, indeed, a neat idea for OCR.
Re: (Score:2)
The system works by having you validate 2 words. One of the words is a word that already been verified to be correct, a known quantity. The other word is the unknown word. If you get the first one correct, it assumes you got the other one correct to. Error correction is done by having multiple people evaluate the same unknown word. If 3 people agree that the unknown word is "Bacon", the word is then taken to be bacon.
Random people trying to mess up the system will not suceed. However, if you convinced every
Re:WTF Summary (Score:5, Funny)
"Hey everyone, let's all sit refreshing the google gmail account creation page, and always type "boobs" for the second captcha value..."
Re: (Score:3, Interesting)
Interesting you should say that.
Unfortunately, it won't work - 4chan already ruined it for everyone.
http://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/
Re:maybe they should use CAPTCHAs... (Score:4, Interesting)
Funny you should say that
http://mailhide.recaptcha.net/ [recaptcha.net]