



The Open-Source Software Saving the Internet From AI Bot Scrapers (404media.co) 27
An anonymous reader quotes a report from 404 Media: For someone who says she is fighting AI bot scrapers just in her free time, Xe Iaso seems to be putting up an impressive fight. Since she launched it in January, Anubis, a "program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies," has been downloaded nearly 200,000 times, and is being used by notable organizations including GNOME, the popular open-source desktop environment for Linux, FFmpeg, the open-source software project for handling video and other media, and UNESCO, the United Nations organization for educations, science, and culture. [...]
"Anubis is an uncaptcha," Iaso explains on her site. "It uses features of your browser to automate a lot of the work that a CAPTCHA would, and right now the main implementation is by having it run a bunch of cryptographic math with JavaScript to prove that you can run JavaScript in a way that can be validated on the server." Essentially, Anubis verifies that any visitor to a site is a human using a browser as opposed to a bot. One of the ways it does this is by making the browser do a type of cryptographic math with JavaScript or other subtle checks that browsers do by default but bots have to be explicitly programmed to do. This check is invisible to the user, and most browsers since 2022 are able to complete this test. In theory, bot scrapers could pretend to be users with browsers as well, but the additional computational cost of doing so on the scale of scraping the entire internet would be huge. This way, Anubis creates a computational cost that is prohibitively expensive for AI scrapers that are hitting millions and millions of sites, but marginal for an individual user who is just using the internet like a human.
Anubis is free, open source, lightweight, can be self-hosted, and can be implemented almost anywhere. It also appears to be a pretty good solution for what we've repeatedly reported is a widespread problem across the internet, which helps explain its popularity. But Iaso is still putting a lot of work into improving it and adding features. She told me she's working on a non cryptographic challenge so it taxes users' CPUs less, and also thinking about a version that doesn't require JavaScript, which some privacy-minded disable in their browsers. The biggest challenge in developing Anubis, Iaso said, is finding the balance. "The balance between figuring out how to block things without people being blocked, without affecting too many people with false positives," she said. "And also making sure that the people running the bots can't figure out what pattern they're hitting, while also letting people that are caught in the web be able to figure out what pattern they're hitting, so that they can contact the organization and get help. So that's like, you know, the standard, impossible scenario."
"Anubis is an uncaptcha," Iaso explains on her site. "It uses features of your browser to automate a lot of the work that a CAPTCHA would, and right now the main implementation is by having it run a bunch of cryptographic math with JavaScript to prove that you can run JavaScript in a way that can be validated on the server." Essentially, Anubis verifies that any visitor to a site is a human using a browser as opposed to a bot. One of the ways it does this is by making the browser do a type of cryptographic math with JavaScript or other subtle checks that browsers do by default but bots have to be explicitly programmed to do. This check is invisible to the user, and most browsers since 2022 are able to complete this test. In theory, bot scrapers could pretend to be users with browsers as well, but the additional computational cost of doing so on the scale of scraping the entire internet would be huge. This way, Anubis creates a computational cost that is prohibitively expensive for AI scrapers that are hitting millions and millions of sites, but marginal for an individual user who is just using the internet like a human.
Anubis is free, open source, lightweight, can be self-hosted, and can be implemented almost anywhere. It also appears to be a pretty good solution for what we've repeatedly reported is a widespread problem across the internet, which helps explain its popularity. But Iaso is still putting a lot of work into improving it and adding features. She told me she's working on a non cryptographic challenge so it taxes users' CPUs less, and also thinking about a version that doesn't require JavaScript, which some privacy-minded disable in their browsers. The biggest challenge in developing Anubis, Iaso said, is finding the balance. "The balance between figuring out how to block things without people being blocked, without affecting too many people with false positives," she said. "And also making sure that the people running the bots can't figure out what pattern they're hitting, while also letting people that are caught in the web be able to figure out what pattern they're hitting, so that they can contact the organization and get help. So that's like, you know, the standard, impossible scenario."
Everything old is new again (Score:5, Informative)
Hashcash was thought up back in 1997 for combatting spam.
Re:Everything old is new again (Score:5, Insightful)
So what? Apparently no one else thought to use this solution for this problem until Xe Iaso came along.
I, for one, think this is awesome, and I am happy she made it. Maybe there will be an arms race. Well, in that case, I am glad she is on our side.
My hat is off to you, Xe Iaso.
Re: (Score:1)
Re: (Score:2)
Voldemort is a software developer?
Re: (Score:2)
if they were actually mining bitcoins or something, then at least the environmental cost would have some sort of useful return.
Oh man, why didn't that post get +5 Funny?
If only we had this in 1997 (Score:1)
Re:The internet is officially dead to me now (Score:4, Interesting)
The summary went on to say that the developer is working on a new mechanism that doesn't use JavaScript to help such users. Work in progress.
Re: (Score:1)
The internet will be fine, maybe the web will be dead for you but we won't miss you AC.
Distributed bots (Score:1)
bot scrapers could pretend to be users with browsers as well, but the additional computational cost of doing so on the scale of scraping the entire internet would be huge
Not so much if the scraping can be done in a distributed fashion. By "infecting" a large group of systems, one can distribute the computational loads of any "proof of work" verification. In much the same way machines are recruited by "evil" JavaScript to do some bitcoin mining for example.
also thinking about a version that doesn't require JavaScript
Good. Because the root of many attacks is running JavaScript.
Re: (Score:1)
Not so much if the scraping can be done in a distributed fashion...
Exactly. Now that this has been widely publicized, it will continue to work for another 2-3 months, tops, and then the bot swarm managers will simply escalate.
Re: (Score:1)
okay but any source of compute would work, it's not like crypto mining is being done exclusively with illicit zombies
yeah they could "just" 2x or 10x or 100x the number of boxes they're throwing at the scrape project, comp for whatever speedbump anubis is supposed to add, but that comes at a cost whether you're running your own or renting zombies
sounds to me like this takes a concerted effort though (like how screeching about boycotts means fuckall without massive buy in) scrapers don't care if one website
Re: (Score:3)
If some site was running a JS bitcoin miner in my browser without my knowledge or consent I would be pretty angry about that.
How do we know you are even human? Please select all the pictures with bicycles.
Re: (Score:3)
Re: (Score:2)
If some site was running a JS bitcoin miner in my browser without my knowledge or consent I would be pretty angry about that.
Yeah, I block all javascript by default for exactly this reason; it's a major malware exposure. I have a very short whitelist of sites that I'm willing to trust with that. I've already encountered Anubis a few times and, when I do, I just close the tab and move on. I'm not enabling javascript for random sites.
More Web enshittification (Score:3)
Bye-bye Wayback Machine... It was a honor knowing you.
Re: (Score:2)
Bye-bye Wayback Machine... It was a honor knowing you.
This could probably be fixed, it's early and I haven't had my coffee yet, but perhaps "legit" sites like IA could have client certs whitelisted by the anti-botware...?
Unfortunately I think this is necessary enshittification, in response to the reckless botmasters doing stupid shit like scraping the same content 10x per second and slashdotting small servers, running up hosting costs, etc. As another poster said, though, it seems it's just a matter of time before the scrapers figure out how to outsource the
Add to the power cost of AI (Score:3)
Re: (Score:2)
No, that would be against the ethos set out by bitcoin and all that "crypto" crap. Just by doing something useless feels subtle and elite, in a way we poor fucks cannot understand.
If whoever the fuck designed that had an eye for actual engineering, that are a thousand way to distribute a ledger (in a way that cannot be subverted by someone NOT controlling the entire internet) than doing pointless calculations.
BTW, surely adding javascript to a webcrawler is hard to impossible, and wasting some cycles (in a
Re: (Score:2)
You need a problem which is difficult to compute, with predictable difficulty, where the answer can be quickly verified. Few real-world problems meet all of those criteria. Most scientific calculations have to be completely re-run to verify the result, which would make it easy to DoS the server by submitting bogus answers.
The Github (Score:5, Informative)
What you were looking for instead of a paywall:
https://github.com/TecharoHQ/a... [github.com]
404 Media is misleading - it should be called 402 Media.
Anubis: A Robots.txt With Teeth... (Score:2)
...but we probably need a Beware of Dog sign on the fence.
Anubis is a brilliant response to the rising tide of AI-powered crawlers chewing through the small web like termites through a paperback. It's basically what robots.txt always wanted to be when it grew up—a gatekeeper that actually enforces the rules.
When a browser hits a site protected by Anubis (I love the reference -- what is the weight of bot scraper's soul, indeed?) it’s handed a lightweight JavaScript proof-of-work challenge—s
Cost prohibitive? (Score:1)
How much does this imply it would cost to scrape the web indiscriminately?
If it's only a few billion it may be within reach of large AI companies, granting them the exclusive privilege to web scraping.
What about scraping a subset of the internet, such a single website or small group of sites?
Would this still be cost prohibitive?
Wait for it (Score:2)