Forgot your password?
typodupeerror
Open Source AI Software The Internet

The Open-Source Software Saving the Internet From AI Bot Scrapers (404media.co) 33

An anonymous reader quotes a report from 404 Media: For someone who says she is fighting AI bot scrapers just in her free time, Xe Iaso seems to be putting up an impressive fight. Since she launched it in January, Anubis, a "program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies," has been downloaded nearly 200,000 times, and is being used by notable organizations including GNOME, the popular open-source desktop environment for Linux, FFmpeg, the open-source software project for handling video and other media, and UNESCO, the United Nations organization for educations, science, and culture. [...]

"Anubis is an uncaptcha," Iaso explains on her site. "It uses features of your browser to automate a lot of the work that a CAPTCHA would, and right now the main implementation is by having it run a bunch of cryptographic math with JavaScript to prove that you can run JavaScript in a way that can be validated on the server." Essentially, Anubis verifies that any visitor to a site is a human using a browser as opposed to a bot. One of the ways it does this is by making the browser do a type of cryptographic math with JavaScript or other subtle checks that browsers do by default but bots have to be explicitly programmed to do. This check is invisible to the user, and most browsers since 2022 are able to complete this test. In theory, bot scrapers could pretend to be users with browsers as well, but the additional computational cost of doing so on the scale of scraping the entire internet would be huge. This way, Anubis creates a computational cost that is prohibitively expensive for AI scrapers that are hitting millions and millions of sites, but marginal for an individual user who is just using the internet like a human.

Anubis is free, open source, lightweight, can be self-hosted, and can be implemented almost anywhere. It also appears to be a pretty good solution for what we've repeatedly reported is a widespread problem across the internet, which helps explain its popularity. But Iaso is still putting a lot of work into improving it and adding features. She told me she's working on a non cryptographic challenge so it taxes users' CPUs less, and also thinking about a version that doesn't require JavaScript, which some privacy-minded disable in their browsers. The biggest challenge in developing Anubis, Iaso said, is finding the balance. "The balance between figuring out how to block things without people being blocked, without affecting too many people with false positives," she said. "And also making sure that the people running the bots can't figure out what pattern they're hitting, while also letting people that are caught in the web be able to figure out what pattern they're hitting, so that they can contact the organization and get help. So that's like, you know, the standard, impossible scenario."

This discussion has been archived. No new comments can be posted.

The Open-Source Software Saving the Internet From AI Bot Scrapers

Comments Filter:
  • by OverlordQ ( 264228 ) on Monday July 07, 2025 @08:15PM (#65504162) Journal

    Hashcash was thought up back in 1997 for combatting spam.

    • by Brain-Fu ( 1274756 ) on Monday July 07, 2025 @10:41PM (#65504360) Homepage Journal

      So what? Apparently no one else thought to use this solution for this problem until Xe Iaso came along.

      I, for one, think this is awesome, and I am happy she made it. Maybe there will be an arms race. Well, in that case, I am glad she is on our side.

      My hat is off to you, Xe Iaso.

      • by keyvin ( 2788865 )
        They are not the first. Just the first to get coverage. It's always who you know - never forget that.
      • Re: (Score:1, Funny)

        by Anonymous Coward

        Whilst it sticks one in the eye of the AI scrapers, it also means user browsers need to consume a significant amount of energy. Not only is that not great if you're on batteries, but it's also an environmental concern.

        For you or I with our 5 monthly visitors, that's not really a big deal. But a big site with millions of humans viewing it, that's a significant environmental cost right there - and it's a cost devolved to those users, rather than to the website owner.

        In some sense, if they were actually mining

        • if they were actually mining bitcoins or something, then at least the environmental cost would have some sort of useful return.

          Oh man, why didn't that post get +5 Funny?

      • Apparently no one else thought to use this solution for this problem until Xe Iaso came along.

        I seem to remember a service called Coinhive that offered a script to make the viewer's device mine the cryptocurrency Monero in the background. I forget if it had an option to hide the article until a particular amount was mined. (Coinhive shut down when too many intruders started installing its script on other people's websites.)

  • If we had this in 1997, maybe we could have prevented the Google bot and its ilk from crawling the web for their "search engines".
  • by Anonymous Coward

    bot scrapers could pretend to be users with browsers as well, but the additional computational cost of doing so on the scale of scraping the entire internet would be huge

    Not so much if the scraping can be done in a distributed fashion. By "infecting" a large group of systems, one can distribute the computational loads of any "proof of work" verification. In much the same way machines are recruited by "evil" JavaScript to do some bitcoin mining for example.

    also thinking about a version that doesn't require JavaScript

    Good. Because the root of many attacks is running JavaScript.

    • Not so much if the scraping can be done in a distributed fashion...

      Exactly. Now that this has been widely publicized, it will continue to work for another 2-3 months, tops, and then the bot swarm managers will simply escalate.

      • by Anonymous Coward

        okay but any source of compute would work, it's not like crypto mining is being done exclusively with illicit zombies

        yeah they could "just" 2x or 10x or 100x the number of boxes they're throwing at the scrape project, comp for whatever speedbump anubis is supposed to add, but that comes at a cost whether you're running your own or renting zombies

        sounds to me like this takes a concerted effort though (like how screeching about boycotts means fuckall without massive buy in) scrapers don't care if one website

  • by vbdasc ( 146051 ) on Monday July 07, 2025 @11:55PM (#65504438)

    Bye-bye Wayback Machine... It was a honor knowing you.

    • Bye-bye Wayback Machine... It was a honor knowing you.

      This could probably be fixed, it's early and I haven't had my coffee yet, but perhaps "legit" sites like IA could have client certs whitelisted by the anti-botware...?

      Unfortunately I think this is necessary enshittification, in response to the reckless botmasters doing stupid shit like scraping the same content 10x per second and slashdotting small servers, running up hosting costs, etc. As another poster said, though, it seems it's just a matter of time before the scrapers figure out how to outsource the

  • by simlox ( 6576120 ) on Tuesday July 08, 2025 @01:38AM (#65504534)
    Couldn't they do a useful calculation?
    • No, that would be against the ethos set out by bitcoin and all that "crypto" crap. Just by doing something useless feels subtle and elite, in a way we poor fucks cannot understand.

      If whoever the fuck designed that had an eye for actual engineering, that are a thousand way to distribute a ledger (in a way that cannot be subverted by someone NOT controlling the entire internet) than doing pointless calculations.

      BTW, surely adding javascript to a webcrawler is hard to impossible, and wasting some cycles (in a

    • You need a problem which is difficult to compute, with predictable difficulty, where the answer can be quickly verified. Few real-world problems meet all of those criteria. Most scientific calculations have to be completely re-run to verify the result, which would make it easy to DoS the server by submitting bogus answers.

  • The Github (Score:5, Informative)

    by bill_mcgonigle ( 4333 ) * on Tuesday July 08, 2025 @05:26AM (#65504726) Homepage Journal

    What you were looking for instead of a paywall:

    https://github.com/TecharoHQ/a... [github.com]

    404 Media is misleading - it should be called 402 Media.

  • ...but we probably need a Beware of Dog sign on the fence.

    Anubis is a brilliant response to the rising tide of AI-powered crawlers chewing through the small web like termites through a paperback. It's basically what robots.txt always wanted to be when it grew up—a gatekeeper that actually enforces the rules.

    When a browser hits a site protected by Anubis (I love the reference -- what is the weight of bot scraper's soul, indeed?) it’s handed a lightweight JavaScript proof-of-work challenge—solve this trivial SHA-256 puzzle before proceeding. It’s transparent to the average user, introduces no visible friction, and thwarts most scraping bots that don’t want to spend CPU cycles for every page request. There’s no crypto mining, no wallet enrichment, no WASM blobs firing up your GPU. Just a small, ephemeral hash puzzle. In terms of defense, it’s elegant, open-source, and way less annoying than CAPTCHA hell.

    But here’s the catch—and where we need to tread carefully: this defense mechanism is invisible. Most users won’t know their machine is doing extra work unless they’re monitoring CPU spikes or poking around in dev tools. You and I may keep a wary eye on about:processes or chrome://performance, but most users don't. The impact is minimal, sure—but the principle of transparency still matters. While Anubis' current stealth is likely an intentional design choice to avoid tipping off bot developers, the lack of consent sets a tricky precedent.

    We're asking users to donate a sliver of compute power as proof of humanity—and most don't even know the request is being made. That might be fine today, with a good-faith actor at the helm. But it sets a precedent: client-side compute as silent gatekeeping. Without some basic transparency, that opens the door for less ethical implementations— aggressive fingerprinting scripts, or bot deterrents with more teeth than sense.

    So, how can we improve this? Anubis is a fantastic tool, but I think we can strengthen it by baking in the principle of informed consent. The goal should be to make the challenge inspectable for those who care, without adding friction for those who don't.

    How about an HTTP header? Anubis could send a simple, standardized header (e.g., X-Anubis-Challenge: active). This is invisible to the average user but allows browsers and extensions to detect the proof-of-work. A user could then install an extension that adds a small icon to the address bar, much like extensions do for password managers or ad blocking. This empowers the user to see what's happening and trust the process without interrupting it.

    Or an opt-in badge? For site owners who prioritize transparency, Anubis could offer an optional, self-hosting badge or banner that discloses the use of a proof-of-work system, linking to a page that explains why it's necessary.

    Or even a console message? The easiest, though least impactful, option is a simple console log message. It's a clear signal to developers (but also to bot makers, so yeah, a double-edged sword, at best)

    Anubis gives the small web a fighting chance in the bot-scraper arms race. By embracing a standard for inspectability, it can not only win the technical battle but also set a healthy precedent for the future of the web. Let's normalize silent client-side work only when we also normalize consent and transparency.

    • ?) it’s handed a lightweight JavaScript proof-of-work challenge—solve this trivial SHA-256 puzzle before proceeding. [...] There’s no crypto mining, no wallet enrichment

      Yet. Because Anubis is free software, and because its hash happens to be the same as the proof of work of the cryptocurrency Bitcoin, someone could modify Anubis to tie the SHA-256 puzzle to the Bitcoin block that a mining pool is working on.

      no WASM blobs firing up your GPU

      Until someone writes a browser extension to offload solving the hashcash to WebGPU.

      Most users won’t know their machine is doing extra work unless they’re monitoring CPU spikes or poking around in dev tools.

      Laptops tend to have an always-on CPU spike monitor: the exhaust fan. So do phones and tablets: they get warm. So do older, less expensive, or small-form-factor desktop computers: they ge

      • Your post is exactly the kind of slashvertisement that doesn't deserve reading. It’s a thread hijack—pure and simple—to run up the install counter on a half-baked browser extension (and yes, I checked the GitHub page: it’s crap).

        If you had something meaningful to say about Anubis, protocol-level consent, or invisible compute boundaries, you could’ve engaged with any of that. Instead, you offered a sales pitch wrapped in a concern-trolling sandwich. GFY.

        • I happened to be aware of the existence of a extension made by someone else that offers domain-level opt-in consent to run script in a particular web browser. I cited the extension's title and author and deliberately left out any URL. I thought that would have been adequate to imply lack of conflict of interest. A user has implied to me that it is not. What means of citing a source would have been adequate?

  • How much does this imply it would cost to scrape the web indiscriminately?
    If it's only a few billion it may be within reach of large AI companies, granting them the exclusive privilege to web scraping.

    What about scraping a subset of the internet, such a single website or small group of sites?
    Would this still be cost prohibitive?

  • The workaround for this will be live in 5... 4... 3... 2...
  • Aren't they getting shut out?

    • by tepples ( 727027 )

      Most User-agent strings that don't contain "bot" nor "Mozilla/" were in Anubis's allowlist last I checked.

"They that can give up essential liberty to obtain a little temporary saftey deserve neither liberty not saftey." -- Benjamin Franklin, 1759

Working...