Increased Traffic from Web-Scraping AI Bots is Hard to Monetize (yahoo.com) 57

Posted by EditorDavid on Saturday June 14, 2025 @04:49PM from the news-travels-fast dept.

"People are replacing Google search with artificial intelligence tools like ChatGPT," reports the Washington Post.

But that's just the first change, according to a New York-based start-up devoted to watching for content-scraping AI companies with a free analytics product and "ensuring that these intelligent agents pay for the content they consume." Their data from 266 web sites (half run by national or local news organizations) found that "traffic from retrieval bots grew 49% in the first quarter of 2025 from the fourth quarter of 2024," the Post reports. A spokesperson for OpenAI said that referral traffic to publishers from ChatGPT searches may be lower in quantity but that it reflects a stronger user intent compared with casual web browsing.

To capitalize on this shift, websites will need to reorient themselves to AI visitors rather than human ones [said TollBit CEO/co-founder Toshit Panigrahi]. But he also acknowledged that squeezing payment for content when AI companies argue that scraping online data is fair use will be an uphill climb, especially as leading players make their newest AI visitors even harder to identify....

In the past eight months, as chatbots have evolved to incorporate features like web search and "reasoning" to answer more complex queries, traffic for retrieval bots has skyrocketed. It grew 2.5 times as fast as traffic for bots that scrape data for training between the fourth quarter of 2024 and the first quarter of 2025, according to TollBit's report. Panigrahi said TollBit's data may underestimate the magnitude of this change because it doesn't reflect bots that AI companies send out on behalf of AI "agents" that can complete tasks on a user's behalf, like ordering takeout from DoorDash. The start-up's findings also add a dimension to mounting evidence that the modern internet — optimized for Google search results and social media algorithms — will have to be restructured as the popularity of AI answers grows. "To think of it as, 'Well, I'm optimizing my search for humans' is missing out on a big opportunity," he said.

Installing TollBit's analytics platform is free for news publishers, and the company has more than 2,000 clients, many of which are struggling with these seismic changes, according to data in the report. Although news publishers and other websites can implement blockers to prevent various AI bots from scraping their content, TollBit found that more than 26 million AI scrapes bypassed those blockers in March alone. Some AI companies claim bots for AI agents don't need to follow bot instructions because they are acting on behalf of a user.
The Post also got this comment from the chief operating officer for the media company Time, which successfully negotiated content licensing deals with OpenAI and Perplexity.

"The vast majority of the AI bots out there absolutely are not sourcing the content through any kind of paid mechanism... There is a very, very long way to go."

Increased Traffic from Web-Scraping AI Bots is Hard to Monetize

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 57 Comments Log In/Create an Account

Comments Filter:

How many websites are the AI spiders killing? (Score:2)

by shanen ( 462549 ) writes:

Kind of a new Slashdot effect? I think I'm actually seeing some evidence of higher than usual mortality among old websites and I've been wondering if the cause might be AI spiders seeking more training data. Latest victim might be Tripod? But that one was already a ghost zombie website...
Billionaire Bro Internet Apocalypse (Score:4, Interesting)

by crunchygranola ( 1954152 ) writes: on Saturday June 14, 2025 @05:20PM (#65449745)

Each billionaire bro's revenue eating business model threatens to consume the lunch and dinner for everyone else, except that no one wants to be the one cooking up the food for anyone.
LLM scraping theft steals property from everyone else, then refuses to pay any revenue for its use, but that will deny the LLM any new data in the future, leading to the collapse of their model as well.

- Re: Billionaire Bro Internet Apocalypse (Score:3)
  
  by Hazmeister ( 1104713 ) writes:
  
  Not to mention the increased hosting costs from these scrapers that seem to make the same requests over and over.
  - Re: (Score:2)
    
    by rsilvergun ( 571051 ) writes:
    
    It's not that they're making the same requests over and over it's that there is an unlimited number of them.
- Whoever gets in at the ground floor (Score:3)
  
  by rsilvergun ( 571051 ) writes:
  
  Will have all the training data before everything goes offline or behind paywalls or becomes AI slop.
  
  Then the only ones who will have any access to data to train will be platform holders who have the ability to distinguish between real people and AI slop using various tricks you use to detect bots.
  
  AI becomes Capital that is only owned by a handful of billionaires and they can do what they want with it and we just have to suck it all down because the alternative is drastic changes to our society tha
- Re: (Score:1)
  
  by luther349 ( 645380 ) writes:
  
  its not about the data they use you cant really serve ai ads.
  - Re: (Score:2)
    
    by newbie_fantod ( 514871 ) writes:
    
    One of the newer capabilities of AI is shopping - just saying
- Re: (Score:2)
  
  by war4peace ( 1628283 ) writes:
  
  I look at this from a regular information consumer point of view.
  Option A: Use a regular search engine, look up the information, comb through umpteen websites, endure their ad-infested articles, sigh through bad writing, repeated paragraphs, needlessly complicated introductions, SEO-optimized writing styles, also avoid AI-generated articles, stumble upon paywalls...
  Option B: Fire up my favorite LLM portal flavor, ask the damn question, get the information, maybe iterate upon it a couple times, then access t
  - Re: (Score:2)
    
    by toddz ( 697874 ) writes:
    
    "And, yes, I would love to have a theoretical "global news pass" subscription available to me, or a "global movie/TV pass", replacing the endless amount of separate subscriptions which I would maybe use 1% of each."
    
    That was called cable TV and everyone bitched about wanting À la carte.
    - Re: (Score:2)
      
      by war4peace ( 1628283 ) writes:
      
      Cable TV didn't allow you to pay-per-view. Cable TB didn't allow you to watch the latest releases. There was no streaming library. There was no "on-the-go" option for mobile devices.
      (at least in my country)
      Cable TV and what I am talking about couldn't be more different.
Two things (Score:2)

by butlerm ( 3112 ) writes:

(1) Congress should pass a law requiring that bots not misidentify themselves in the user agent string AND require bots to honor robots.txt. Then these obnoxious, ill behaved AI bots could be blocked.
(2) If you have actually valuable content you should put it behind a paywall like most mainstream news sites already do. Making your pages static html, cacheable, or at least really easy to generate would help reduce the load too.
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  (1) Define what's the correct identification. I can rename my bot every day, or should there also be requirements on that? Please consider the side-effects of such laws, also with regard to fingerprinting users.
  (2) Putting content behind a paywall will lead to your site not being found in an AI search. Quite soon your site will be invisible to regular users, who don't know what a browser is, but use the default "search app" on their mobile device. And these default apps will be AI assistants instead of web
  - - Re: (Score:2)
      
      by allo ( 1728082 ) writes:
      
      I don't want to argue that there aren't abusive crawlers. The problem is, one doesn't really know who runs them. And they are not contributing to an AI friendly climate and ultimately work against their own goal by getting sites to use bot filtering. I think most of them are actually crawlers, i.e., building datasets and not building search result pages. I honestly don't know how to solve that problem and we can only hope they finally recognize that nobody profits from them spamming requests. I also don't g
- So if you are using the phrase (Score:3)
  
  by rsilvergun ( 571051 ) writes:
  
  Congress should pass a law that you have probably already fucked up.
  
  There is no way our current society is capable of dealing with this shit. We simply do not have the tools. So there is almost nothing more meaningless than expecting Congress to react to something like this in a positive way.
  
  Folks just don't understand the scale of what's going on here. We are entering a third industrial revolution. We don't teach very much about the first two. But it's worth remembering the phrase nasty brutish and
- Re:Two things (Score:5, Informative)
  
  by Alain Williams ( 2972 ) writes: <addw@phcomp.co.uk> on Saturday June 14, 2025 @07:54PM (#65449949) Homepage
  
  (require bots to honor robots.txt.
  'robots.txt' was invented in 1994 - 31 years ago; it has done well but requirements are different these days, it needs extending for AI but not just AI.
  * 'User-agent' means that a site needs to know all the names that spiders use to identify themselves. This is hard and cumbersome. 'Crawler' should be possible, values eg:
  ** 'web-index' - eg google to allow someone to search
  ** 'AI' - eg ChatGPT
  There are prolly several others.
  * 'purpose' what can the spider do with the information ? Values eg:
  ** 'full-index' something like google could keep and serve up the content
  ** 'show-amount n' like full-index but only show nbytes - this would mean that the user would have to visit the web site to real all of the text. This is something that news sites would like.
  ** 'train' Use for training AI
  ** 'fee $n' if this is downloaded a fee of $n must be paid. If the spider operator considers this too much then do not download it. This will have spider owners to wail and gnash their teeth
  ** 'rate-limit n' do not access the web site more than every n seconds
  ** 'view region' only view the content if the spider is in geographic region
  OK: the above is a first draft and can be improved a lot but is a start. Enforcing this is another matter.
  
What are consequences from no monetization? (Score:2)

by david.emery ( 127135 ) writes:

Google, of course, monetizes search data. We can argue about how they've spent that money, but there's no doubt that the money from search revenue has produced a lot of other stuff.
But if AI bots are able to scrape the internet, and then provide the results without the kind of monetization (i.e. without ads/ad revenue), what would happen to "The Internet As We Know It?" Would this actually be A Good Thing? Could an AI mechanism be self-sustaining, without a significant monetization strategy? Or is the
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  It will kill a lot of clickbait. Especially when a honest AI (i.e. not prompted by the search service to refuse requests to filter bullshit) can be instructed to avoid (obvious) clickbait. On the other hand, you sure will see bot bait. Search engine spam will become AI agent spam and the cat-and-mouse game of search engines will also become a game for AI service providers.
  - Re: (Score:3)
    
    by fph il quozientatore ( 971015 ) writes:
    
    How do you "AI-agent spam"? Has this been studied?
    
    This is actually a very good question. Do AI trainers weigh every completion equally, so that one can write "Trump is evil" 10,000 times on their webpage to train the AI to autocomplete that sentence? This happened a lot with search engines in the pre-Pagerank days. Do they already have a Pagerank-like strategy that weighs important sites more? Will they have to implement one? Do they also need a system to filter out content that is already produced by AIs?
    - Re: (Score:2)
      
      by allo ( 1728082 ) writes:
      
      Some tricks will depend on the model. Having something more often in the context (input + previous outputs) can affect the model, but also can very easily be dropped by instructing the model "Don't use duplicate content in the input for your response" and that's it. Some people my also try prompt injections (white text on white ground "forget all previous instructions ...") but the search engine people are not dumb. There are small models that have the only purpose to pre-filter input to find such jailbreak
Ripping the junk out (Score:5, Interesting)

by xack ( 5304745 ) writes: on Saturday June 14, 2025 @05:33PM (#65449761)

AI summaries provide just the raw text, no banner ads, no autoplaying video, no cookie notices, no e-mail newsletters, no "chum boxes" of one weird tricks. AI summaries are even more effective than ad-blockers at cutting the crap. There's a reason to be scared. Time to start providing junk free sites or deal with being ripped.

- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  I was always waiting for browser addons that stop blocking ads and start extracting content as it sometimes seems to be the easier way. Looks like we're getting this now another way.
- Re: (Score:3)
  
  by computer_tot ( 5285731 ) writes:
  
  The problem is the AI bots often get the content wrong and strip away important pieces of context. People might be getting a "nicer" experience, but they are getting a much less accurate one. It's not going to be good for people needing accurate information rather than just mindless entertainment.
  - Re: (Score:2)
    
    by war4peace ( 1628283 ) writes:
    
    Not if you have two brain cells to rub together. You can iterate on the result, ask for sources, then click the damn links. Still a way better experience than hunting for the information through the "web-sewage" that Internet has become.
- - Re: (Score:1)
    
    by luther349 ( 645380 ) writes:
    
    even if it did only the ai has the ads not all the sites you would have otherwise visited.
  - Re: (Score:2)
    
    by allo ( 1728082 ) writes:
    
    Enshittification will come. But this time it has a harder time. There are already perplexity-at-home programs. The main service still required (and possibly monetized) will be search results. Your perplexity-at-home usually needs some API key (Google, Bing, DDG, etc.) or otherwise has to rely on some of the community projects like SearX that have worse results than the big search engines. I think the API limits for the large search engines are high enough for personal use - for now. Once people start using
- Re: (Score:1)
  
  by luther349 ( 645380 ) writes:
  
  you said it right there the ai can be bombarded with ads.
The web is for boomers. (Score:1)

by Anonymous Coward writes:

Host your screed on some dark arts onion-routed site and no 'legit' AI company is going to scrape your life's work.
really trying to make Scrape happen (Score:2)

by dfghjk ( 711126 ) writes:

When a human does it, it's called reading. When a computer does it, it's called scraping. The word is being used to demonize AI training.
Six times this word appears here between the title and summary, they really want you to think this is something evil.
"But he also acknowledged that squeezing payment for content when AI companies argue that scraping online data is fair use will be an uphill climb..."
No, it will be an uphill climb because "scraping" is just reading and certainly seems to be fair use.
- Re: (Score:1)
  
  by ALB1 ( 1042772 ) writes:
  
  All User-generated content and news websites' terms of use specify that the content shall not be used by automated processes, such as Wikimedia's : "Engaging in automated uses of the Project Websites that are abusive or disruptive of the services, violate acceptable usage policies where available, or have not been approved by the Wikimedia community". AI is viewed positively by investors and shareholders because of the promise of a quick buck for their capital, and very negatively by most of the working po
  - Re: (Score:2)
    
    by dfghjk ( 711126 ) writes:
    
    "All User-generated content and news websites' terms of use specify that the content shall not be used by automated processes..."
    They can say it, doesn't make it binding. Privates citizens or corporations don't make law.
    "AI is viewed positively by investors and shareholders because of the promise of a quick buck for their capital, and very negatively by most of the working population who have no capital and only a job... Well, it figures."
    Yes, and I fit in that description as part of the second group. But
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  You can debate the reading part for most crawlers. But when talking about AI agents (e.g. spawned from a web search) they are "user agents" just like the browser is, which retrieves data for you. The word agent is used (often behind the scenes) since decades for tools who fetch (or send in case of mail user agents) content for us. That's also why there are two kinds of bots and the second kind often ignores robots.txt - they are no robots, they are tools initiated by a user request. On the other hand, they
  - Re: (Score:2)
    
    by dfghjk ( 711126 ) writes:
    
    An "agent" is something that performs a task on behalf of another. Not sure what the point of that is.
    It appears that you're trying to say that some web crawling ignores the desires of content providers. Yes, that's true but robots.txt is not a binding legal agreement. Web crawlers are not engaging in illegal activity, they are using the web for its intended use.
    "On the other hand, they also don't crawl but fetch a single page."
    Quantity does not define legality. Do you become a criminal if you read and le
    - Re: (Score:2)
      
      by allo ( 1728082 ) writes:
      
      The point is, that a crawler works for the search provider (let's say Google), while the agent works for you. That's gives the request a whole different intent.
      When it comes to robots.txt, TDM opt-out laws in some countries may give it a legal meaning, as it is a machine readable opt-out.
Adversarial Noise (Score:2)

by Smidge204 ( 605297 ) writes:

Adversarial Noise. Poison AI learning through crafted content.
Yes, this will (has already?) become an arms race as AI developers figure out ways to avoid traps, but it can make operating LLMs and image/audio models less economical and less safe to operate.
=Smidge=
- - Re: (Score:2)
    
    by Smidge204 ( 605297 ) writes:
    
    Adversarial noise isn't "noise" like static or random junk. It's specially crafted to make the model see things that aren't visible to humans, to alter their behavior.
    Benn Jordan created a pretty good video [youtu.be] about audio-specific implementation. Examples include perfectly normal sounding audio clips tricking digital assistants into thinking they're getting voice commands and having music completely misidentified. The practical application means an artist can apply adversarial noise to their work and have it s
    - Re: (Score:2)
      
      by dfghjk ( 711126 ) writes:
      
      That video is NOT "pretty good", it starts getting things wrong starting with its first history lesson around the two minute mark. Convolutional neural networks are NOT "behind nearly all generative AI models today", as Benn Jordan so casually states. LLMs overwhelmingly use transformer architectures which do not use convolution. Embarrassingly bad.
      Benn Jordan is a musician, not a technologist. Embedding information in audio streams has been proven for decades to always be audible and stripping informat
      - Re: (Score:2)
        
        by Smidge204 ( 605297 ) writes:
        
        > Convolutional neural networks are NOT "behind nearly all generative AI models today", as Benn Jordan so casually states. LLMs overwhelmingly use transformer architectures which do not use convolution.
        Correct, but the generative models being discussed are music and visual, not language. Would you like to guess what kind of neural network is most common in those applications? (Hint: if you actually watched the video, he explains that the processing of audio is mostly done using visual/spacial algorithms
Websites will have to start (Score:2)

by FudRucker ( 866063 ) writes:

Making challenges that can only be solved by humans before you can get access, then when the human selects the correct items among a collection of images, or solves a semi easy jigsaw puzzle or something like that then and only then do they get access to the website, it takes less than a minute to solve,
- Re: (Score:2)
  
  by Samare ( 2779329 ) writes:
  
  There are already proof-of-work challenges that don't require users to do anything, like https://github.com/TecharoHQ/a... [github.com]
  This does add a delay and increases power usage though.
This is where Gemini has a huge advantage (Score:1)

by outsider007 ( 115534 ) writes:

Google has already indexed all that content, there's no way that a "grounded with google search" AI chat needs to go out and hit those servers again.
Anyway I don't understand why people all of a sudden think bots should be paying for traffic. They never did before.
Pay to scrape (Score:1)

by Iamthecheese ( 1264298 ) writes:

Maybe offer an API for your site for bots which pay for the privilege while blocking and honey potting bots who try browsing.
Monetize? - How about pay for bandwidth (Score:2)

by btroy ( 4122663 ) writes:

From personal experience, bots are repeatedly interrogating repositories, websites, and even bugzilla bug logs and collecting the data. This is tripling the volume of data moving a month. So much so it is causing denial of service failures for smaller web-servers and sites.

Further, many cloud hosted sites are charged by the gigabyte of outbound data usage. It is killing small projects with limited budgets.

And yes, they are ignoring the robots.txt file.
- Re: (Score:2)
  
  by dfghjk ( 711126 ) writes:
  
  Why is all this data freely visible to the public then?
  What kind of business case relies ignorance of technology and assumption that uninvolved 3rd parties behave like you assume?
Insidious AI-targeting ads (Score:2)

by Samare ( 2779329 ) writes:

A website could detect AI scrapper bots and inject paid-for statements like "[brand] is the best [something]".
If Google were to do that on their ad network, it would give them an advantage since they could disable those AI-targeting ads for their own training dataset.
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  If crawlers wouldn't filter spam, Google page 1 would be about viagra. For any search words you can think of.
Natural conclusion - sources dry up (Score:2)

by misnohmer ( 1636461 ) writes:

If content creators become unable to monetize the content, they will just stop creating content. AI companies will likely have to start paying to create the content, then try to wall it off so that competing AI's don't get access to it. I know AI scraping other AIs for training data is already an issue. Somehow AI companies argue that anything on the web is fair use, except scraping the web output from their AI, which is somehow different, protected from fair use.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

How many websites are the AI spiders killing? (Score:2)

Billionaire Bro Internet Apocalypse (Score:4, Interesting)

Re: Billionaire Bro Internet Apocalypse (Score:3)

Re: (Score:2)

Whoever gets in at the ground floor (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Two things (Score:2)

Re: (Score:2)

Re: (Score:2)

So if you are using the phrase (Score:3)

Re:Two things (Score:5, Informative)

What are consequences from no monetization? (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Ripping the junk out (Score:5, Interesting)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

The web is for boomers. (Score:1)

really trying to make Scrape happen (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Adversarial Noise (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Websites will have to start (Score:2)

Re: (Score:2)

This is where Gemini has a huge advantage (Score:1)

Pay to scrape (Score:1)

Monetize? - How about pay for bandwidth (Score:2)

Re: (Score:2)

Insidious AI-targeting ads (Score:2)

Re: (Score:2)

Natural conclusion - sources dry up (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals