Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Books Media Software Technology

reCAPTCHA Hard At Work, Rescuing Fading Texts 112

sciencehabit writes "Computer scientists have developed a program, called reCAPTCHA, which is being used in lieu of CAPTCHA by several sites, to help digitize old books and newspapers. The reCAPTCHA takes entries from old and faded texts that optical scanners and digital-text readers have trouble with. So every time you solve that string of crooked letters, you may actually be helping historians digitally reconstruct a page from the 1908 New York Times." The Science Now story links to the longer and more informative article at Ars Technica. (We last mentioned this program last year — and now it's good to get some sense of how well it's working.)
This discussion has been archived. No new comments can be posted.

reCAPTCHA Hard At Work, Rescuing Fading Texts

Comments Filter:
  • Cool possible uses (Score:5, Interesting)

    by Irish_Samurai ( 224931 ) on Thursday August 14, 2008 @09:19PM (#24609379)

    Man, I would love to see the results if this technique was used for an ontological [google.com] purpose.

    Please type in the word from the choices below that most closely relates to this word: OLD

    HISTORIC
    LIFESPAN

    Interesting shit indeed.

  • by Anonymous Coward on Thursday August 14, 2008 @10:12PM (#24609825)
    I've seen a number of issues with reCaptcha that I don't really know how to handle (i.e. what to enter): 1. Multiple word strings 2. Foreign characters 3. Illegible Text 4. A single word for both entries 5. Words that look like one thing initially, but are really another when you look closer
  • by PPH ( 736903 ) on Thursday August 14, 2008 @11:54PM (#24610609)

    Since they use entries from several users to validate correct translations for OCR'ed text, this probably won't cause them major problems. OTOH, I wonder if they can track the accuracy of each user's inputs and, if it becomes evident that a user is either incompetent or attempting to screw with the system, take appropriate measures.

    When someone's karma starts dropping into the negative range, they should let us know how well this worked out. If anyone can see their posts, that is.

  • by Mumei no koshinuke ( 1110677 ) on Friday August 15, 2008 @12:43AM (#24610969)
    When solving these I sometimes find that there's more than one possibility for an illegible word, yet I can't tell which it is without knowing the context.
    For example, in some fonts "cost" and "cast" might be indistinguishable in the image shown. But given the context of the sentence it's trivial for a human to tell the difference.
    Suppose that they found these words on which people disagreed and had another captcha system which showed the full sentence. I'd guess they could improve their accuracy significantly in this case. Since they could prescreen for ambiguous words using the current captcha system, even if fewer people were willing to solve the "large" captcha, they would still get all the solutions they needed.
  • Interesting field (Score:2, Interesting)

    by Anonymous Coward on Friday August 15, 2008 @04:03AM (#24612005)

    My company is working on digitizing a large volume of old text (19th century government documents). There are a number of problems unique to old text:
    - OCR breaks down due to archaic letter shapes, smudging, letter damage and paper deterioration.
    - we evaluated OCR versus having the entire text retyped by Indians, and ended up going with the Indians. The only way to get sufficient accuracy (>99%) was to have everything done twice and do a comparison.
    - Even then, the typed text has to be checked using both automated and manual processes. The text is highly structured, which makes automatic checks possible, but we can't catch everything that way. Then again, the checks necessary for our text are more extensive than for an old newspaper.
    - For old texts, your average spelling checker is useless. You end up adding loads of words to the dictionary.

    ReCAPTCHA solves one of these problem (text entry), but I suspect a fair amount of work remains. E.g. sometimes you need context to decipher a word correctly.

  • Re:Not new (Score:3, Interesting)

    by Random Walk ( 252043 ) on Friday August 15, 2008 @05:18AM (#24612273)

    Quoting from the NPR story [npr.org] which aired earlier today:

    more than 40,000 Web sites -- including popular ones such as Ticketmaster, Facebook and Craigslist -- are using a new kind of security program called reCAPTCHA.

    That's scary. The way ReCaptcha works allows the reCaptcha server to collect the IPs of reCaptcha users (along with the reCaptcha-enabled website they are using). If many websites are using reCaptcha, it allows to track users as they are moving through the web, from one reCaptcha-enabled website to the next.

    The idea is cute, but the implementation is fundamentally broken and a huge breach of privacy.

  • Re:It turns out... (Score:3, Interesting)

    by argent ( 18001 ) <peter@slashdot . ... t a r o nga.com> on Friday August 15, 2008 @04:00PM (#24620467) Homepage Journal

    How is being responsible for CAPTCHA breakage useful?

    Look, just because the guy who more or less invented both trolling and automated trolling is an eminent UNIX guru and textbook author that doesn't mean his trolling on net.suicide was any less disgusting. I was appalled at the people who laughed along with Pike when he revealed that he was behind Bimmler and Shaney. This kind of thing is just not acceptable no matter who you are.

An authority is a person who can tell you more about something than you really care to know.

Working...