Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

reCAPTCHA Hard At Work, Rescuing Fading Texts

Posted by timothy on Thu Aug 14, 2008 08:05 PM
from the strange-confluence dept.
sciencehabit writes "Computer scientists have developed a program, called reCAPTCHA, which is being used in lieu of CAPTCHA by several sites, to help digitize old books and newspapers. The reCAPTCHA takes entries from old and faded texts that optical scanners and digital-text readers have trouble with. So every time you solve that string of crooked letters, you may actually be helping historians digitally reconstruct a page from the 1908 New York Times." The Science Now story links to the longer and more informative article at Ars Technica. (We last mentioned this program last year — and now it's good to get some sense of how well it's working.)
+ -
story

Related Stories

[+] IT: Carnegie Mellon CAPTCHA Digitization Project Now Underway 119 comments
tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"
[+] Technology: Next-Generation CAPTCHA Exploits the Semantic Gap 327 comments
captcha_fun writes "Researchers at Penn State have developed a patent-pending image-based CAPTCHA technology for next-generation computer authentication. A user is asked to pass two tests: (1) click the geometric center of an image within a composite image, and (2) annotate an image using a word selected from a list. These images shown to the users have fake colors, textures, and edges, based on a sequence of randomly-generated parameters. Computer vision and recognition algorithms, such as alipr, rely on original colors, textures, and shapes in order to interpret the semantic content of an image. Because of the endowed power of imagination, even without the correct color, texture, and shape information, humans can still pass the tests with ease. Until computers can 'imagine' what is missing from an image, robotic programs will be unable to pass these tests. The system is called IMAGINATION and you can try it out." This sounds promising given how broken current CAPTCHA technology is.
[+] IT: Understanding How CAPTCHA Is Broken 148 comments
An anonymous reader writes "Websense Security Labs explains the spammer Anti-CAPTCHA operations and mass-mailing strategies. Apparently spammers are using combination of different tactics — proper email accounts, visual social engineering, and fast-flux — representing a strategy, explains their resident CAPTCHA expert. It is evident that spammers are working towards defeating anti-spam filters with their tactics."
[+] IT: Fallout From the Fall of CAPTCHAs 413 comments
An anonymous reader recommends Computerworld's look at the rise and fall of CAPTCHAs, and at some of the ways bad guys are leveraging broken CAPTCHAs to ply their evil trade. "CAPTCHA used to be an easy and useful way for Web administrators to authenticate users. Now it's an easy and useful way for malware authors and spammers to do their dirty work. By January 2008, Yahoo Mail's CAPTCHA had been cracked. Gmail was ripped open soon thereafter. Hotmail's top got popped in April. And then things got bad. There are now programs available online (no, we will not tell you where) that automate CAPTCHA attacks. You don't need to have any cracking skills. All you need is a desire to spread spam, make anonymous online attacks against your enemies, propagate malware or, in general, be an online jerk. And it's not just free e-mail sites that can be made to suffer..."
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • Not new (Score:4, Informative)

    by JazzyMusicMan (1012801) on Thursday August 14 2008, @08:07PM (#24609239)
    Ticketmaster and other sites have already been doing this for a while. Go to ticketmaster and search for tickets, you'll see two words. One is known and the other is unknown. If you don't believe me, try to guess which one they know and misspell the other one on purpose (or don't, this is for historic posterity =) )
    • Re: (Score:3, Informative)

      So is the US Patent and Trademark Office, as part of the process of using PAIR [uspto.gov], the Patent Application Information Retrieval system, which lets the public look at information about patent applications that have been published.

    • Facebook uses reCAPTCHA. I guess you can make something useful out of the millions of useless teenagers wasting their time on Facebook.

      • Re:Not new (Score:5, Funny)

        by grahamd0 (1129971) on Thursday August 14 2008, @08:45PM (#24609587)

        Facebook uses reCAPTCHA. I guess you can make something useful out of the millions of useless teenagers wasting their time on Facebook.

        That's not fair.

        Plenty of useless adults waste their time on Facebook.

      • Re: (Score:3, Informative)

        Do they really? From what I was able to tell, it's not specified as reCAPTCHA anywhere in the window; having looked at the reCAPTCHA site from a development side I could swear that I read that you needed to give credit if developing a custom style for it. Either I'm remembering wrong, they've got a deal, or FB is undergoing one of the stupidest TOS violations ever.

        • Re:Not new (Score:4, Informative)

          by erbmjw (903229) on Thursday August 14 2008, @09:21PM (#24609911)
          from reCAPTCHA FAQ [recaptcha.net]

          When showing reCAPTCHA to the user, is it possible not to show the reCAPTCHA logo? We allow you to customize the theme of reCAPTCHA with our Client API. You are still required to have text on your website which states that you are using reCAPTCHA, however with our theming API, you are free to do this in a way that blends in to your site.

        • Re: (Score:3, Informative)

          Do they really? From what I was able to tell, it's not specified as reCAPTCHA anywhere in the window; having looked at the reCAPTCHA site from a development side I could swear that I read that you needed to give credit if developing a custom style for it. Either I'm remembering wrong, they've got a deal, or FB is undergoing one of the stupidest TOS violations ever.

          They do give attribution to reCAPTCHA. You have to click on "What's this?"

          This is a standard security test that we use to prevent spammers from creating fake accounts and spamming users. Our captchas are provided by ReCaptcha

      • Re: (Score:3, Insightful)

        Because that's so different than the thousands of useless geeks wasting their time on /.
    • Re: (Score:3, Insightful)

      I would imagine that they use multiple logins to verify one word - it's not like people don't mistype captchas in the first place.

    • Re:Not new (Score:4, Informative)

      by Your Pal Dave (33229) on Thursday August 14 2008, @10:29PM (#24610433)

      Quoting from the NPR story [npr.org] which aired earlier today:

      more than 40,000 Web sites -- including popular ones such as Ticketmaster, Facebook and Craigslist -- are using a new kind of security program called reCAPTCHA.

      • Re: (Score:3, Interesting)

        Quoting from the NPR story [npr.org] which aired earlier today:

        more than 40,000 Web sites -- including popular ones such as Ticketmaster, Facebook and Craigslist -- are using a new kind of security program called reCAPTCHA.

        That's scary. The way ReCaptcha works allows the reCaptcha server to collect the IPs of reCaptcha users (along with the reCaptcha-enabled website they are using). If many websites are using reCaptcha, it allows to track users as they are moving through the web, from one reCaptcha-enabled website to the next.

        The idea is cute, but the implementation is fundamentally broken and a huge breach of privacy.

  • by Anonymous Coward on Thursday August 14 2008, @08:16PM (#24609359)

    I can usually tell which of the two words is from a real old text. With high probability (>90%) I can correctly answer the real CAPTCHA and replace someone's OCR'd word with "penis".

    I've only ever done this maybe ten or twenty times, but it could easily become an automatic part of using the system.

    • I'm sure they send the same unknown word out to multiple people, and wait for a concensus on it.

      Now, if we ALL started entering "penis" for the obvious unknown words.. :)

    • The thing is, they're often actually both from old texts. It's just that one of them has already been verified.

      And TFA states that they do pass every word by multiple people so as to get more accuracy in what they say. I have little doubt that they're well acquainted with people who try spoofing them.

    • by PPH (736903) on Thursday August 14 2008, @10:54PM (#24610609)

      Since they use entries from several users to validate correct translations for OCR'ed text, this probably won't cause them major problems. OTOH, I wonder if they can track the accuracy of each user's inputs and, if it becomes evident that a user is either incompetent or attempting to screw with the system, take appropriate measures.

      When someone's karma starts dropping into the negative range, they should let us know how well this worked out. If anyone can see their posts, that is.

    • As soon as I heard about this project, I figured there'd be people finding ways to abuse it.

      I can see future generations sitting down for a good read:

      MOBY COCK

      Chapturd One

      Call me LOLOLFAG...

  • Cool possible uses (Score:5, Interesting)

    by Irish_Samurai (224931) on Thursday August 14 2008, @08:19PM (#24609379)

    Man, I would love to see the results if this technique was used for an ontological [google.com] purpose.

    Please type in the word from the choices below that most closely relates to this word: OLD

    HISTORIC
    LIFESPAN

    Interesting shit indeed.

    • by burgundysizzle (1192593) on Thursday August 14 2008, @08:50PM (#24609639)

      Or perhaps SLASHDOT-READER:

      OVERWEIGHT

      GEEK

      SPENDS-TO-MUCH-TIME-USING-COMPUTERS

      ALL-OF-THE-ABOVE

      I fit into the category ALL-OF-THE-ABOVE. The only generalisation that is missing about slashdotters is the one about girlfriends.

    • is full of hyperbole, dogma, propaganda, and meaningless blatherings.
      • That's kinda the point moron.

        Let me introduce you to the concept of context.

          • The linked page is self purporting. That's it's purpose.

            If you are smart enough to see through it, you are smart enough to discredit it. In turn that makes it an example, not a message.

            The problem with trying to communicate a message of the sort I link to is that the goal is to get you to scream "BULLSHIT!"

            Parts apply and others don't, but they do provoke thought. Thought allows you to discard its catalyst for new ideas, but doesn't require it.

            If you take the link as truth, you miss the point.

      • Re: (Score:3, Informative)

        The point is to see what the populace thinks the relation is.

        If you think google is the end all be all of absolute information then you already fail.

  • by mschuyler (197441) on Thursday August 14 2008, @08:27PM (#24609415) Homepage Journal

    The New York Times is already online from 1851 onwards. the concept is cool, truly, but why not CAPTCHA something not already accomplished? Oh, I know. That was, like, a metaphor, right?

    • I am almost certain that it is not all there in its entirety. There are bits that are not online specifically because of OCR errors. That is going to be true with any large volume of OCRed text.
    • Yeah I was kinda wondering about that too, but from a different perspective... I mean: "So every time you solve that string of crooked letters, you may actually be helping historians digitally reconstruct a page from the 1908 New York Times."

      What the hell is the problem with people? All text is apparently on a single page from the NY Times in 1908... I mean fuck, stop the press, cause its obviously all redundant shit anyways, just keep redistributing that one page across the world!

  • by Nymz (905908) on Thursday August 14 2008, @08:30PM (#24609449) Journal
    The feature known as FADING was designed to protect copyright works from being pirated by becoming illegible before the work could fall into the public domain.
  • by v1 (525388) on Thursday August 14 2008, @08:41PM (#24609561) Homepage Journal

    a little OT I know but is anyone else having a bad time with gmail's captchas? I've tried signing up several of our customers for gmail recently and it's becoming really hard to get them right. The "audio" playback used to be the saving grace, but the last two I did it sounded like ten people were talking to me all at once with no discernible key voice. (and last I succeeded, the string to be entered was spoken in three groups, by three different voices)

    • Yep, I do the same thing, signing clients up for Google services, and I get their captchas right about once every three or four tries. :-(
  • Image Captchas (Score:4, Informative)

    by pembo13 (770295) on Thursday August 14 2008, @08:48PM (#24609613) Homepage
    I've found implementing a simple "please choose the name of the item seen bellow" eliminates a large amount of spam (all?) but has the problem of not being viable for blind people.
  • Took me a bit to get past the new security measures, But I got a coupon 5 cents off my next shoe purchase.

  • Right about now, I'm wondering what the implications would be for including reCAPTCHA in an open source project. (a PHP-based blog I'm working on) Right now the blog is read-only, since I have yet to build my own working CAPTCHA system and putting up an unprotected reply form is sheer idiocysince it wil lbe a whole five minutes before the spam bots find it. My project is GPLv3, so would including ReCAPTCHA cause me some sort of licensing problem?
    • reCAPTCHA should not cause any licensing issues if all you do is link to their site via the "magic four lines of code" or use one of their plugins

      from why reCAPTHCHA [recaptcha.net]

      It's Easy. reCAPTCHA is a Web service. As such, adopting it is as simple as adding 4 lines of code on your site. For many applications and programming languages such as Wordpress and PHP we also have easy-to-install plugins available. We generate and check the distorted images, so you don't need to run costly image generation programs.

      Word

    • by corbettw (214229) <corbettw.yahoo@com> on Thursday August 14 2008, @09:35PM (#24610039) Homepage Journal

      There are multiple libraries for reCAPTCHA already published, all under the MIT License. Just see http://code.google.com/p/recaptcha/ [google.com] for a list of them.

  • How are they able to tell if I've accurately solved an unknown. If the word is "Yesterday" and I enter "Fucktard", not only will the society get some very wrong data, but I'll also have passed the CAPTCHA without entering the actual letters.
    • RTFA.

      You get two captchas. One is your standard, let's find out if you're human captcha, where the program knows the answer. The other is the scanned text. It also presents the same scanned text to many people, and then uses the results to figure out which one is the most likely correct result.

  • It turns out... (Score:3, Informative)

    by symbolset (646467) on Thursday August 14 2008, @11:02PM (#24610689) Journal

    That slashdot's Goatse troll server guy proves useful.

    Note: This is not a troll. One of the guys that offers open web services to slashdot trolls is also responsible for considerable development of CAPTCHA breakage and is an eminent Debian developer. This is why I've said that we should respect his efforts despite the unpleasant side effects. The truly brilliant we should grant exceptions from social behavior because they discover things more proper folk would not.

    • Re: (Score:3, Interesting)

      How is being responsible for CAPTCHA breakage useful?

      Look, just because the guy who more or less invented both trolling and automated trolling is an eminent UNIX guru and textbook author that doesn't mean his trolling on net.suicide was any less disgusting. I was appalled at the people who laughed along with Pike when he revealed that he was behind Bimmler and Shaney. This kind of thing is just not acceptable no matter who you are.

  • by Mumei no koshinuke (1110677) on Thursday August 14 2008, @11:43PM (#24610969)
    When solving these I sometimes find that there's more than one possibility for an illegible word, yet I can't tell which it is without knowing the context.
    For example, in some fonts "cost" and "cast" might be indistinguishable in the image shown. But given the context of the sentence it's trivial for a human to tell the difference.
    Suppose that they found these words on which people disagreed and had another captcha system which showed the full sentence. I'd guess they could improve their accuracy significantly in this case. Since they could prescreen for ambiguous words using the current captcha system, even if fewer people were willing to solve the "large" captcha, they would still get all the solutions they needed.
  • why don't they just use whatever software is used by the crackers to bombard us with spam email to go through all of these books are whatever speed they're capable of. If compromised PCs can send tens of thousands of fake emails, why not just set a few up to figure out these words/

    How much worse is this than trusting users to correctly identify the text? I ask because I honestly don't know the succcess rate of the automated system.
    • Re:RTFA (Score:3, Informative)

      The authors also tested software designed to crack CAPTCHAs against images created using reCAPTCHA, and found that they failed completely. The authors ascribe this to the fact that the letters in scanned images contain distortions that are not the result of a clean mathematical transformation. User response times were also measured, but there were no significant differences between the time it took users to handle traditional systems and that required to use reCAPTCHA.

  • by RJFerret (1279530) on Friday August 15 2008, @01:58AM (#24611685) Homepage

    You can also use reCaptcha for your own email address, and be more willing to provide it "publicly" since they'd have to answer the reCaptcha to get to the mailto... reCaptcha mailhide [recaptcha.net]

    • Re: (Score:3, Funny)

      by Anonymous Coward

      The following security test allows us to validate you are a human and not an automated script.

      please type the following two words in the text box below

      you moron

      ____________ _____________

    • From TFA:

      The software presents one optically unreadable word and one "control" CAPTCHA word. Getting the control word right identifies the user as a human, and the program records his or her response to the unreadable word and adds it to a database.

      So, there is the real CAPTCHA, and another reCAPTCHA.

    • by RedWizzard (192002) on Thursday August 14 2008, @10:18PM (#24610363)

      One FUNDAMENTAL problem with this

      ... is that you didn't RTFA.