Wikimedia Drowning in AI Bot Traffic as Crawlers Consume 65% of Resources 73

Posted by msmash on Friday April 04, 2025 @09:00PM from the closer-look dept.

Web crawlers collecting training data for AI models are overwhelming Wikipedia's infrastructure, with bot traffic growing exponentially since early 2024, according to the Wikimedia Foundation. According to data released April 1, bandwidth for multimedia content has surged 50% since January, primarily from automated programs scraping Wikimedia Commons' 144 million openly licensed media files.

This unprecedented traffic is causing operational challenges for the non-profit. When Jimmy Carter died in December 2024, his Wikipedia page received 2.8 million views in a day, while a 1.5-hour video of his 1980 presidential debate caused network traffic to double, resulting in slow page loads for some users.

Analysis shows 65% of the foundation's most resource-intensive traffic comes from bots, despite bots accounting for only 35% of total pageviews. The foundation's Site Reliability team now routinely blocks overwhelming crawler traffic to prevent service disruptions. "Our content is free, our infrastructure is not," the foundation said, announcing plans to establish sustainable boundaries for automated content consumption.

Wikimedia Drowning in AI Bot Traffic as Crawlers Consume 65% of Resources

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 73 Comments Log In/Create an Account

Comments Filter:

AI is the most gluttonous & antisocial thing (Score:3)

by rsilvergun ( 571051 ) writes: on Friday April 04, 2025 @09:06PM (#65282733)

I think the human race is ever come up with. It devours everything and it spits out garbage. With the idea being that we're all going to be forced to use it. I'm sure that the technology has its uses in scientific fields but to the general consumer there is nothing but downsides. It's the end stage of the enshitification of the internet maybe even the whole kit and kaboodle of our civilization..

And we can do absolutely nothing to stop it because you're not allowed to question the unending growth of corporate profits. I mean you are just not in any way that would be effective at curtailing them. You can do all the stories you want about evil corporations but if you ever actually get anywhere or get any traction on reigning them in you're going to find yourself in a bad way.

- - Re: (Score:2)
    
    by Mr. Dollar Ton ( 5495648 ) writes:
    
    If you could spell "ppl" correctly, you'd know it means "people", not AI bots, you victim of the modern "AI" ejucation.
    - - Re: (Score:2)
        
        by az-saguaro ( 1231754 ) writes:
        
        I enjoyed your remarks.
        You cared enough to spend time on a non-trivial amount of text.
        You wrote about a highly pertinent subject.
        There are some people who will agree wholly, some partially, and some might have an exception or rejoinder, but either way, this is a credible and respectable post.
        For those who agree, it will reinforce a common appreciation of the current ills of tech and AI. For those unclear, it may help persuade them. For those who disagree, it could and should prompt respectable dialogue.
        So
        
        Re: Futhermore (Score:2)
        
        by Dripdry ( 1062282 ) writes:
        
        The guy posts here more lately, he always has the same sig from fark.com (shout-out to my Fark peepz!)
        The post office seem a bit extreme, but do you have possible points to them worth considering. Still not sure why someone would post AC on this.
        
        You're responding to a bot (Score:1)
        
        by rsilvergun ( 571051 ) writes:
        
        There's a bot going around taking my old comments and jumbling them together and reposting them as an AC. I think what it's trying to do is attract automated moderation by posting my words in a context that would get them moded down.
        
        That doesn't really work here because this old website doesn't have automated moderation but for some reason we have some pretty sophisticated bots hanging out here. I think it's because many many years ago this website was a big fucking deal on the internet and there's prob
  - - Re: (Score:2)
      
      by buck-yar ( 164658 ) writes:
      
      I've fixed it for you. Can I have a refund of my share of the tax dollars that led to your subpar education?
      No need to respond with personal attacks, all he's doing is parroting what Microsoft's AI chief said. https://www.chatgptguide.ai/20... [chatgptguide.ai]
  - Re: (Score:2)
    
    by Marxist Hacker 42 ( 638312 ) * writes:
    
    Or better yet, since you're apparently able to tell the difference between a webcrawler and a human in the log, keep the site public for humans but start sending invoices to the IP Address Holders of web crawlers.
- Re: (Score:2)
  
  by DrXym ( 126579 ) writes:
  
  AI will get worse as more content becomes AI. Imagine an AI ingesting today's news from websites which are using AI to generate news.
  - Re: (Score:1)
    
    by buck-yar ( 164658 ) writes:
    
    Its not just news, Wikipedia isn't very good as a source on anything slightly controversial. The articles are written by a single person or a small group, and they fail to write unbiased. Personal opinions injected everywhere, which, according to Wikipedia is fine as long as you can find some source that says it. If you don't believe me, read an article on something you know a lot about and see how Wikipedia butchers it.
    - Re: (Score:2)
      
      by John_Sauter ( 595980 ) writes:
      
      Its not just news, Wikipedia isn't very good as a source on anything slightly controversial. The articles are written by a single person or a small group, and they fail to write unbiased. Personal opinions injected everywhere, which, according to Wikipedia is fine as long as you can find some source that says it. If you don't believe me, read an article on something you know a lot about and see how Wikipedia butchers it.
      If you find such a butchered article I recommend you edit it to make it better.
      - Re: AI is the most gluttonous & antisocial thi (Score:2)
        
        by ArmoredDragon ( 3450605 ) writes:
        
        That would simply be a waste of time.
        https://www.penny-arcade.com/c... [penny-arcade.com]
        
        Re: (Score:2)
        
        by John_Sauter ( 595980 ) writes:
        
        That would simply be a waste of time.
        https://www.penny-arcade.com/c... [penny-arcade.com]
        The arguments in your reference are good ones, and I am sure that is true in some cases, but I have not found it to be generally true. I did see one case in which an editor claimed that he didn't need to cite sources because he was an expert on the topic. As I and the other editors told him, if the information he has is not available in a reliable source, he should publish what he knows in a peer-reviewed journal, and we will then be happy to cite that source.
    - Re: (Score:3)
      
      by larryjoe ( 135075 ) writes:
      
      Wikipedia isn't very good as a source on anything slightly controversial.
      
      I find Wikipedia to be one of the best sources of information on controversial topics ... at least in a relative sense. Good sources for controversial topics are hard to find anywhere. Wikipedia articles tend to be more comprehensive and broad across different viewpoints, relatively speaking.
Dono (Score:3)

by dohzer ( 867770 ) writes: on Friday April 04, 2025 @09:18PM (#65282747)

Great. I'm about to get 400 extra popups and emails from Jimmy Wales asking for me to donate to Wikipedia now, aren't I?

- - Re: (Score:2)
    
    by coopertempleclause ( 7262286 ) writes:
    
    No one with a brain is going to write a favourable article about Trump.
  - Re: (Score:2)
    
    by haruchai ( 17472 ) writes:
    
    "I remember in 2016 reading the page for Hillary Clinton and it read like it had been written by her attorney"
    That was nearly a decade ago. How does it read today?
    Did her attorney also write the linked pages such as "legal career of Hillary Clinton", "career in corporate governance". etc?
Points to the end of the open internet (Score:3)

by Big Hairy Gorilla ( 9839972 ) writes: on Friday April 04, 2025 @09:36PM (#65282779)

To all you chaps who argue their is no difference between you reading a book and OpenAI scraping everything .

The speed and scale that AI information harvesting is done at is the difference between you reading a book and applying the knowledge , and AI renting that knowledge back to for profit. Wikipedia subsidizes info harvesting, so they will have to close the door, or go out of biz.

- Re: (Score:2)
  
  by Mr. Dollar Ton ( 5495648 ) writes:
  
  Yes, yet another example of the "free market" and its capability of "self-regulating".
  The invisible hand showing you the invisible finger before shoving it up your ass.
  Paid for by your tax money.
- Re: (Score:1)
  
  by OngelooflijkHaribo ( 7706194 ) writes:
  
  Yes, so they have automated something. At one point a man walked up with a screwdriver to screw in a car wheel, at this point, a machine does it far faster but fundamentally it's the same procedure.
  Man has managed to automate something yet again, that he may sit on his arse one extra hour per day and work less, for he enjoys sitting on his arse more than working in dangerous factories, I can't blame him.
  - Re:Points to the end of the open internet (Score:5, Interesting)
    
    by martin-boundary ( 547041 ) writes: on Saturday April 05, 2025 @03:39AM (#65283055)
    
    For thousands of years, man took a small boat and went to fish in the ocean to feed his family. Now mega trawlers rake the ocean floor with nets that catch everything swimming for miles around the ship.
    For thousands years, fish populations have existed and been caught by humans. Now, fish populations are going extinct because the trawlers are fishing faster than humans did.
    Speed has consequences. Don't shit in the kitchen. If you do, you'll be blamed.
    
    - Re: (Score:2)
      
      by Big Hairy Gorilla ( 9839972 ) writes:
      
      very good analogy.
      
      Your industrial trawling analogy is easy to understand and an apt description of what is going on. And probably an apt description of what will happen. Servers cost money. At the point of 80% of your traffic being industrial trawlers, it's obviously time to shut the door to free use. It's fair to say that is not the intended audience or use case. The scale of harvesting is killing the resource.
  - Re: (Score:3)
    
    by GrumpySteen ( 1250194 ) writes:
    
    screwdriver to screw in a car wheel
    You don't know something as simple as how car wheels are attached to cars. Why should anyone trust your blithering about AI?
    - Re: (Score:2)
      
      by Growlley ( 6732614 ) writes:
      
      Nuts!
      - Almost every analogy is flawed (Score:2)
        
        by buck-yar ( 164658 ) writes:
        
        I had a Volkswagen that had lug bolts. Took a 1/2" breaker bar with a 6 ft pole to get them loose. Machines (a 1200 ft lb impact) couldn't do the job.
- Re: (Score:2)
  
  by Neuroelectronic ( 643221 ) writes:
  
  For sure, nobody has ever had to deal with a double digit percentage increase of web traffic before.
  - Re: (Score:2)
    
    by jsonn ( 792303 ) writes:
    
    The current AI crawlers are indistinguishable from a DDOS. That's very different from natural human traffic.
    - Re: (Score:2)
      
      by buck-yar ( 164658 ) writes:
      
      Your UID is low enough that you should remember something called the Slashdot effect.
      - Re: (Score:2)
        
        by jsonn ( 792303 ) writes:
        
        Slashdot during its glory days had something like 3-4 million regular visitors. Let's say those read one article over a time frame of 6 hours. That's about 140 requests per second. With a 200Mbit/s connection, that's still 200KB per user. Both can be handled even from a semi-well connected home server. Experience might be degraded, but for a well written website, it's manageable. The problem with AI crawlers is that they are written to be aggressive without any regard for the source system. Most of them ig
Sell hard drives (Score:1)

by davidwr ( 791652 ) writes:

Sell hard drives full of the most-requested-by-bot traffic "at cost."
I say "at cost" to avoid the possible scandal/volunteer-boycott of "Wikimedia making money off of content."
Alternatively, set up a deal with a content-delivery-network where the content delivery network would charge a fee to the actual bot-masters to cover its costs, with Wikimedia gaining nothing but a reduced traffic load in exchange.
Commercial AI-bot-masters would very likely be willing to pay a reasonable fee to avoid the Wikimedia-imp
- Nevermind Re:Sell hard drives (Score:1)
  
  by davidwr ( 791652 ) writes:
  
  It looks like the Foundation has this covered with their database-dump mechanism and mirrors run by outside volunteers.
- Re: (Score:2)
  
  by nevermindme ( 912672 ) writes:
  
  If the content is in a database (MariaDB), it cannot be a insane price to offer replica database just for the AI robots, with a better than scraping API, a when we get to it SLA and minor port charge at the DC of Wikipedia choice. I honestly think outside of the hard sciences that wikipedia data is not worthy of scraping. AI at the end of all this outputs a list of editors that are bots, that would be a shame.
Can someone explain to me... (Score:5, Interesting)

by HotNeedleOfInquiry ( 598897 ) writes: on Friday April 04, 2025 @10:04PM (#65282813)

Why you would crawl the on-line copy of Wikipedia when you can download an image of it and crawl it locally?

- Re: (Score:2)
  
  by Mr. Dollar Ton ( 5495648 ) writes:
  
  For same reason that would you think brute force will help you build "AI".
- Re:Can someone explain to me... (Score:5, Interesting)
  
  by az-saguaro ( 1231754 ) writes: on Saturday April 05, 2025 @03:55AM (#65283063)
  
  I was going to post an identical remark, then at the last minute, I saw yours.
  Exactly.
  Why not just download the whole site once, then analyze offline?
  About 2 weeks ago, this story was posted on Slashdot :
  Meta's Llama AI Models Hit 1 Billion Downloads, Zuckerberg Says
  https://tech.slashdot.org/stor... [slashdot.org]
  It didn't add up for me, but there was this reply by zurmikopa :
  https://tech.slashdot.org/comm... [slashdot.org]
  Many of the popular tools and workflows have a bad habit of downloading a model quite frequently, even if they have been downloaded before. This is especially the case when it is being distributed over several GPUs and each only downloads a portion of the model.
  This is combined with countless models uploaded to huggingface which are fine-tunes, quantizations, etc, that he probably counts in those totals.
  There some ML test workflows download the model each time a new commit is made to their frameworks.
  I would buy [believe] a billion downloads of all that put together, though nowhere near a billion unique downloaders.
  Makes you wonder if all the crawling and scraping is being done with any thought of efficiency, minimum redundancy and duplication, etc. These companies are spending giga-bucks and giga-watts, but maybe much of it is repetitive waste.
  Anyone have any thoughts?
  
  - Re: (Score:2)
    
    by AmiMoJo ( 196126 ) writes:
    
    Because the people creating these AI are morons who simply have their bot follow every link blindly. Suck it all up, no filtering for quality.
  - Re: (Score:2)
    
    by Tony Isaac ( 1301187 ) writes:
    
    They don't just want a single snapshot. Wikipedia is a constantly-changing site. They want continuous updates of everything. Doing that in bulk every day wouldn't really change the load much, might even make it worse.
- Re: (Score:2)
  
  by jsonn ( 792303 ) writes:
  
  Because those AI crawlers are written by asocial assholes and bandwidth is cheaper than CPU time.
  - Re: (Score:2)
    
    by clovis ( 4684 ) writes:
    
    Because those AI crawlers are written by asocial assholes and bandwidth is cheaper than CPU time.
    It's their mantra: "Shittify things and move fast"
- Re: (Score:2)
  
  by Tony Isaac ( 1301187 ) writes:
  
  How would that help anything? These crawlers aren't just getting a snapshot of what's on Wikipedia and calling it a day. They continuously crawl for updated pages. If they instead downloaded the image, they would re-download the image over and over, causing the exact same strain on the infrastructure.
  - Re: (Score:2)
    
    by HotNeedleOfInquiry ( 598897 ) writes:
    
    I'd argue that if the data is used for AI learning, there is no need for it to be more than a couple weeks old. I'd also argue that given the general shitification of Wikipedia, images 10 years old would be more useful than the current data. Alternately, with some programming work, the crawler could look at the Wikipedia page change log and only download pages changed since the last crawl. This would work real time or locally updating from a new image. But then what do I know. I'm just a cynical old fa
    - Re: (Score:2)
      
      by Tony Isaac ( 1301187 ) writes:
      
      Many of these crawlers were hastily thrown together by startup companies with one or two programmers, hoping to cash in quickly on the AI hype train. They aren't putting *nearly* that much thought and care into their work. They'll fix the inefficiencies later...in the sweet by and by when they have time because they are all millionaires.
More efficient? (Score:2)

by OngelooflijkHaribo ( 7706194 ) writes:

This makes me wonder how big the revolution will be once they find some way to perform similar training but with far less input. Some kind of new revolutionary model that can achieve the same while reading far less. It has to be doable because right now artificial neural networks need vastly more input to be able to achieve far less reasoning skills as humans do from their inputs so maybe it's possible, though to be fair, humans also need far more time to process their input to make it useful.
Would be inter
Download is only 105GB (Score:2)

by ihadafivedigituid ( 8391795 ) writes:

Why crawl? I have a local copy of Wikipedia running in Kiwix [kiwix.org]. There must be torrents out there with the same export, though the stupid AI people would probably not seed.
- Re: (Score:3)
  
  by Samare ( 2779329 ) writes:
  
  That's all and well for the text content. But the multimedia content is much bigger: about 600 TiB as of right now. https://commons.wikimedia.org/... [wikimedia.org]
  The problem is that each bot has to download everything.
  - Re: (Score:3)
    
    by KiloByte ( 825081 ) writes:
    
    If only we had protocols for bulk downloads that scale according to the content's popularity...
  - Re: (Score:2)
    
    by ihadafivedigituid ( 8391795 ) writes:
    
    You should do more/better research. The 105GB download includes media, albeit at reduced resolution. Text-only is a smaller file.
    
    But yeah: torrent/mirror/shipping SSDs would be massively more efficient for everyone involved.
They could just ... (Score:4, Interesting)

by PPH ( 736903 ) writes: on Friday April 04, 2025 @11:44PM (#65282901)

... learn a lesson from the time Wikipedia was being vandalized. Direct all detected AI bots to the article about chickens [theregister.com].

- Re:They could just ... (Score:4, Interesting)
  
  by Frobnicator ( 565869 ) writes: on Saturday April 05, 2025 @12:45AM (#65282937) Journal
  
  Yup. They are overdue for poisoning bot requests. Block the hosting domains, errors and black holes, feed them the same errors every time about how they can get the copied version of the databases at cost. This is not a new problem, companies have detected and killed bot traffic for decades now.
  
  - Re: (Score:2)
    
    by KiloByte ( 825081 ) writes:
    
    Not block, that just makes the bot herders come from a new set of IPs and with a slightly different method to ward off detection. And instead of a single article, give them endless content they so crave. Be it AI-generated to match current fads, or dissociated press generated as we used to do in the previous millenium.
    - Re: (Score:2)
      
      by kbrannen ( 581293 ) writes:
      
      Ha, endless random made up content, I like it. :) I was going to suggest that when they detect a bot just start feeding them from /dev/zero until the bot breaks the connection...but I like your idea better as it sort of poisons the bot.
Angry White Men Getting Even More Angry (Score:1)

by thesjaakspoiler ( 4782965 ) writes:

Not only having to deal with people editing their pages, but now bots reading their beloved pages as well.
Typical class with natural stupidity ... (Score:1)

by angel'o'sphere ( 80593 ) writes:

Perhaps one should tell those crawler builders, that they just can download the whole wikipedia as a zip file?
- Re: (Score:2)
  
  by jsonn ( 792303 ) writes:
  
  They don't care because bandwidth is cheaper than human intelligence.
Hurrrrrr (Score:2)

by Neuroelectronic ( 643221 ) writes:

Wow only if there was a torrent, RSS of updates or something. Maybe donations would help? This is not targeted at anyone capable of knowing anything about the modern web. So many solutions.
What about actual websites that don't have the resources to manage this problem? What about the fact that this is just democratization of bots and special interests modifying Wikipedia?
I could go on but I'm having troubles caring.
Redirect to torrent (Score:1)

by jago25_98 ( 566531 ) writes:

Redirect to torrent wouldn't work but
There must be an intelligent way to change incentives to redistribute the load?
I wonder if we can use methods to fight it (Score:2)

by Casandro ( 751346 ) writes:

I mean just block those IP-Addresses temporarily and serve them plausible garbage. It should be feasible to detect them.
Time to start poisoning the content (Score:3)

by DrXym ( 126579 ) writes: on Saturday April 05, 2025 @06:36AM (#65283165)

If a particular IP range is abusing the server then apply filters that degrade the image. e.g. scaling, skewing, saturating, applying noise/lines, or eroding or smearing chunks of the image. Or show completely the wrong image entirely or the image blended with another image. Wikimedia can have an in-house AI pretrained to mangle images in a way which is most effective at poisoning other AIs. i.e.
Also worth pursuing legal options like stating AI / scrapers who access the data are bound by ethical clauses. And throwing in some boobytrap articles & images which prove if an AI has ingested the content.

- Re: (Score:2)
  
  by jsonn ( 792303 ) writes:
  
  Sadly, current AI crawlers behave more or less like a DDOS. You can start by blocking all access from AWS and China, but even that is only getting rid of some of the worst offenders.
This is a real problem, not only for Wikimedia (Score:5, Informative)

by marcobat ( 1178909 ) writes: on Saturday April 05, 2025 @10:46AM (#65283367)

This is a real issue. I manage about 100 websites with lot of well written (not by me, so it is actually well written :) interdisciplinary content for a public university.
We are having the very same issue and we basically have no budget to improve our infrastructure (which was sub par to begin with). We get waives of what it looks like a denial of service attack, but it is actually a concerted scraping of our sites from an entire subnets. Sometimes located in the US, sometimes from some random country in the world. I have changed the robots.txt file to allow only search indexing, but it is completely ignored, so I'm constantly adding ip subnets to the block list of my firewall so that the actual "ppl" that would like to read the content of our sites are able to reach them.
I feel the pain that Wikimedia must feel.

Only 65% percent of bandwidth? (Score:2)

by guacamole ( 24270 ) writes:

I think it's not a whole lot. It's still great that amazingly one third of wikipedia bandwidth is used by people who are actually reading articles.
tgz file (Score:2)

by spitzak ( 4019 ) writes:

Isn't the entire contents (at a particular time) available as a .tgz file? They should be downloading and reading that, rather than killing the servers with individual page requests.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

AI is the most gluttonous & antisocial thing (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: Futhermore (Score:2)

You're responding to a bot (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: AI is the most gluttonous & antisocial thi (Score:2)

Re: (Score:2)

Re: (Score:3)

Dono (Score:3)

Re: (Score:2)

Re: (Score:2)

Points to the end of the open internet (Score:3)

Re: (Score:2)

Re: (Score:1)

Re:Points to the end of the open internet (Score:5, Interesting)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Almost every analogy is flawed (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Sell hard drives (Score:1)

Nevermind Re:Sell hard drives (Score:1)

Re: (Score:2)

Can someone explain to me... (Score:5, Interesting)

Re: (Score:2)

Re:Can someone explain to me... (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

More efficient? (Score:2)

Download is only 105GB (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

They could just ... (Score:4, Interesting)

Re:They could just ... (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Angry White Men Getting Even More Angry (Score:1)

Typical class with natural stupidity ... (Score:1)

Re: (Score:2)

Hurrrrrr (Score:2)

Redirect to torrent (Score:1)

I wonder if we can use methods to fight it (Score:2)

Time to start poisoning the content (Score:3)

Re: (Score:2)

This is a real problem, not only for Wikimedia (Score:5, Informative)

Only 65% percent of bandwidth? (Score:2)

tgz file (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals