Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
Open Source AI Software The Internet

The Open-Source Software Saving the Internet From AI Bot Scrapers (404media.co) 27

An anonymous reader quotes a report from 404 Media: For someone who says she is fighting AI bot scrapers just in her free time, Xe Iaso seems to be putting up an impressive fight. Since she launched it in January, Anubis, a "program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies," has been downloaded nearly 200,000 times, and is being used by notable organizations including GNOME, the popular open-source desktop environment for Linux, FFmpeg, the open-source software project for handling video and other media, and UNESCO, the United Nations organization for educations, science, and culture. [...]

"Anubis is an uncaptcha," Iaso explains on her site. "It uses features of your browser to automate a lot of the work that a CAPTCHA would, and right now the main implementation is by having it run a bunch of cryptographic math with JavaScript to prove that you can run JavaScript in a way that can be validated on the server." Essentially, Anubis verifies that any visitor to a site is a human using a browser as opposed to a bot. One of the ways it does this is by making the browser do a type of cryptographic math with JavaScript or other subtle checks that browsers do by default but bots have to be explicitly programmed to do. This check is invisible to the user, and most browsers since 2022 are able to complete this test. In theory, bot scrapers could pretend to be users with browsers as well, but the additional computational cost of doing so on the scale of scraping the entire internet would be huge. This way, Anubis creates a computational cost that is prohibitively expensive for AI scrapers that are hitting millions and millions of sites, but marginal for an individual user who is just using the internet like a human.

Anubis is free, open source, lightweight, can be self-hosted, and can be implemented almost anywhere. It also appears to be a pretty good solution for what we've repeatedly reported is a widespread problem across the internet, which helps explain its popularity. But Iaso is still putting a lot of work into improving it and adding features. She told me she's working on a non cryptographic challenge so it taxes users' CPUs less, and also thinking about a version that doesn't require JavaScript, which some privacy-minded disable in their browsers. The biggest challenge in developing Anubis, Iaso said, is finding the balance. "The balance between figuring out how to block things without people being blocked, without affecting too many people with false positives," she said. "And also making sure that the people running the bots can't figure out what pattern they're hitting, while also letting people that are caught in the web be able to figure out what pattern they're hitting, so that they can contact the organization and get help. So that's like, you know, the standard, impossible scenario."

The Open-Source Software Saving the Internet From AI Bot Scrapers

Comments Filter:
  • by OverlordQ ( 264228 ) on Monday July 07, 2025 @08:15PM (#65504162) Journal

    Hashcash was thought up back in 1997 for combatting spam.

  • If we had this in 1997, maybe we could have prevented the Google bot and its ilk from crawling the web for their "search engines".
  • by Anonymous Coward

    bot scrapers could pretend to be users with browsers as well, but the additional computational cost of doing so on the scale of scraping the entire internet would be huge

    Not so much if the scraping can be done in a distributed fashion. By "infecting" a large group of systems, one can distribute the computational loads of any "proof of work" verification. In much the same way machines are recruited by "evil" JavaScript to do some bitcoin mining for example.

    also thinking about a version that doesn't require JavaScript

    Good. Because the root of many attacks is running JavaScript.

    • Not so much if the scraping can be done in a distributed fashion...

      Exactly. Now that this has been widely publicized, it will continue to work for another 2-3 months, tops, and then the bot swarm managers will simply escalate.

      • by Anonymous Coward

        okay but any source of compute would work, it's not like crypto mining is being done exclusively with illicit zombies

        yeah they could "just" 2x or 10x or 100x the number of boxes they're throwing at the scrape project, comp for whatever speedbump anubis is supposed to add, but that comes at a cost whether you're running your own or renting zombies

        sounds to me like this takes a concerted effort though (like how screeching about boycotts means fuckall without massive buy in) scrapers don't care if one website

  • by vbdasc ( 146051 ) on Monday July 07, 2025 @11:55PM (#65504438)

    Bye-bye Wayback Machine... It was a honor knowing you.

    • Bye-bye Wayback Machine... It was a honor knowing you.

      This could probably be fixed, it's early and I haven't had my coffee yet, but perhaps "legit" sites like IA could have client certs whitelisted by the anti-botware...?

      Unfortunately I think this is necessary enshittification, in response to the reckless botmasters doing stupid shit like scraping the same content 10x per second and slashdotting small servers, running up hosting costs, etc. As another poster said, though, it seems it's just a matter of time before the scrapers figure out how to outsource the

  • by simlox ( 6576120 ) on Tuesday July 08, 2025 @01:38AM (#65504534)
    Couldn't they do a useful calculation?
    • No, that would be against the ethos set out by bitcoin and all that "crypto" crap. Just by doing something useless feels subtle and elite, in a way we poor fucks cannot understand.

      If whoever the fuck designed that had an eye for actual engineering, that are a thousand way to distribute a ledger (in a way that cannot be subverted by someone NOT controlling the entire internet) than doing pointless calculations.

      BTW, surely adding javascript to a webcrawler is hard to impossible, and wasting some cycles (in a

    • You need a problem which is difficult to compute, with predictable difficulty, where the answer can be quickly verified. Few real-world problems meet all of those criteria. Most scientific calculations have to be completely re-run to verify the result, which would make it easy to DoS the server by submitting bogus answers.

  • The Github (Score:5, Informative)

    by bill_mcgonigle ( 4333 ) * on Tuesday July 08, 2025 @05:26AM (#65504726) Homepage Journal

    What you were looking for instead of a paywall:

    https://github.com/TecharoHQ/a... [github.com]

    404 Media is misleading - it should be called 402 Media.

  • ...but we probably need a Beware of Dog sign on the fence.

    Anubis is a brilliant response to the rising tide of AI-powered crawlers chewing through the small web like termites through a paperback. It's basically what robots.txt always wanted to be when it grew up—a gatekeeper that actually enforces the rules.

    When a browser hits a site protected by Anubis (I love the reference -- what is the weight of bot scraper's soul, indeed?) it’s handed a lightweight JavaScript proof-of-work challenge—s

  • How much does this imply it would cost to scrape the web indiscriminately?
    If it's only a few billion it may be within reach of large AI companies, granting them the exclusive privilege to web scraping.

    What about scraping a subset of the internet, such a single website or small group of sites?
    Would this still be cost prohibitive?

  • The workaround for this will be live in 5... 4... 3... 2...

I have never seen anything fill up a vacuum so fast and still suck. -- Rob Pike, on X.

Working...