

Increased Traffic from Web-Scraping AI Bots is Hard to Monetize (yahoo.com) 52
"People are replacing Google search with artificial intelligence tools like ChatGPT," reports the Washington Post.
But that's just the first change, according to a New York-based start-up devoted to watching for content-scraping AI companies with a free analytics product and "ensuring that these intelligent agents pay for the content they consume." Their data from 266 web sites (half run by national or local news organizations) found that "traffic from retrieval bots grew 49% in the first quarter of 2025 from the fourth quarter of 2024," the Post reports. A spokesperson for OpenAI said that referral traffic to publishers from ChatGPT searches may be lower in quantity but that it reflects a stronger user intent compared with casual web browsing.
To capitalize on this shift, websites will need to reorient themselves to AI visitors rather than human ones [said TollBit CEO/co-founder Toshit Panigrahi]. But he also acknowledged that squeezing payment for content when AI companies argue that scraping online data is fair use will be an uphill climb, especially as leading players make their newest AI visitors even harder to identify....
In the past eight months, as chatbots have evolved to incorporate features like web search and "reasoning" to answer more complex queries, traffic for retrieval bots has skyrocketed. It grew 2.5 times as fast as traffic for bots that scrape data for training between the fourth quarter of 2024 and the first quarter of 2025, according to TollBit's report. Panigrahi said TollBit's data may underestimate the magnitude of this change because it doesn't reflect bots that AI companies send out on behalf of AI "agents" that can complete tasks on a user's behalf, like ordering takeout from DoorDash. The start-up's findings also add a dimension to mounting evidence that the modern internet — optimized for Google search results and social media algorithms — will have to be restructured as the popularity of AI answers grows. "To think of it as, 'Well, I'm optimizing my search for humans' is missing out on a big opportunity," he said.
Installing TollBit's analytics platform is free for news publishers, and the company has more than 2,000 clients, many of which are struggling with these seismic changes, according to data in the report. Although news publishers and other websites can implement blockers to prevent various AI bots from scraping their content, TollBit found that more than 26 million AI scrapes bypassed those blockers in March alone. Some AI companies claim bots for AI agents don't need to follow bot instructions because they are acting on behalf of a user.
The Post also got this comment from the chief operating officer for the media company Time, which successfully negotiated content licensing deals with OpenAI and Perplexity.
"The vast majority of the AI bots out there absolutely are not sourcing the content through any kind of paid mechanism... There is a very, very long way to go."
But that's just the first change, according to a New York-based start-up devoted to watching for content-scraping AI companies with a free analytics product and "ensuring that these intelligent agents pay for the content they consume." Their data from 266 web sites (half run by national or local news organizations) found that "traffic from retrieval bots grew 49% in the first quarter of 2025 from the fourth quarter of 2024," the Post reports. A spokesperson for OpenAI said that referral traffic to publishers from ChatGPT searches may be lower in quantity but that it reflects a stronger user intent compared with casual web browsing.
To capitalize on this shift, websites will need to reorient themselves to AI visitors rather than human ones [said TollBit CEO/co-founder Toshit Panigrahi]. But he also acknowledged that squeezing payment for content when AI companies argue that scraping online data is fair use will be an uphill climb, especially as leading players make their newest AI visitors even harder to identify....
In the past eight months, as chatbots have evolved to incorporate features like web search and "reasoning" to answer more complex queries, traffic for retrieval bots has skyrocketed. It grew 2.5 times as fast as traffic for bots that scrape data for training between the fourth quarter of 2024 and the first quarter of 2025, according to TollBit's report. Panigrahi said TollBit's data may underestimate the magnitude of this change because it doesn't reflect bots that AI companies send out on behalf of AI "agents" that can complete tasks on a user's behalf, like ordering takeout from DoorDash. The start-up's findings also add a dimension to mounting evidence that the modern internet — optimized for Google search results and social media algorithms — will have to be restructured as the popularity of AI answers grows. "To think of it as, 'Well, I'm optimizing my search for humans' is missing out on a big opportunity," he said.
Installing TollBit's analytics platform is free for news publishers, and the company has more than 2,000 clients, many of which are struggling with these seismic changes, according to data in the report. Although news publishers and other websites can implement blockers to prevent various AI bots from scraping their content, TollBit found that more than 26 million AI scrapes bypassed those blockers in March alone. Some AI companies claim bots for AI agents don't need to follow bot instructions because they are acting on behalf of a user.
The Post also got this comment from the chief operating officer for the media company Time, which successfully negotiated content licensing deals with OpenAI and Perplexity.
"The vast majority of the AI bots out there absolutely are not sourcing the content through any kind of paid mechanism... There is a very, very long way to go."
How many websites are the AI spiders killing? (Score:2)
Kind of a new Slashdot effect? I think I'm actually seeing some evidence of higher than usual mortality among old websites and I've been wondering if the cause might be AI spiders seeking more training data. Latest victim might be Tripod? But that one was already a ghost zombie website...
Billionaire Bro Internet Apocalypse (Score:4, Interesting)
Each billionaire bro's revenue eating business model threatens to consume the lunch and dinner for everyone else, except that no one wants to be the one cooking up the food for anyone.
LLM scraping theft steals property from everyone else, then refuses to pay any revenue for its use, but that will deny the LLM any new data in the future, leading to the collapse of their model as well.
Re: Billionaire Bro Internet Apocalypse (Score:3)
Re: (Score:2)
Whoever gets in at the ground floor (Score:3)
Then the only ones who will have any access to data to train will be platform holders who have the ability to distinguish between real people and AI slop using various tricks you use to detect bots.
AI becomes Capital that is only owned by a handful of billionaires and they can do what they want with it and we just have to suck it all down because the alternative is drastic changes to our society tha
Re: (Score:1)
Re: (Score:2)
One of the newer capabilities of AI is shopping - just saying
Re: (Score:2)
I look at this from a regular information consumer point of view.
Option A: Use a regular search engine, look up the information, comb through umpteen websites, endure their ad-infested articles, sigh through bad writing, repeated paragraphs, needlessly complicated introductions, SEO-optimized writing styles, also avoid AI-generated articles, stumble upon paywalls...
Option B: Fire up my favorite LLM portal flavor, ask the damn question, get the information, maybe iterate upon it a couple times, then access t
Two things (Score:2)
(1) Congress should pass a law requiring that bots not misidentify themselves in the user agent string AND require bots to honor robots.txt. Then these obnoxious, ill behaved AI bots could be blocked.
(2) If you have actually valuable content you should put it behind a paywall like most mainstream news sites already do. Making your pages static html, cacheable, or at least really easy to generate would help reduce the load too.
Re: (Score:2)
(1) Define what's the correct identification. I can rename my bot every day, or should there also be requirements on that? Please consider the side-effects of such laws, also with regard to fingerprinting users.
(2) Putting content behind a paywall will lead to your site not being found in an AI search. Quite soon your site will be invisible to regular users, who don't know what a browser is, but use the default "search app" on their mobile device. And these default apps will be AI assistants instead of web
Re: (Score:2)
I don't want to argue that there aren't abusive crawlers. The problem is, one doesn't really know who runs them. And they are not contributing to an AI friendly climate and ultimately work against their own goal by getting sites to use bot filtering. I think most of them are actually crawlers, i.e., building datasets and not building search result pages. I honestly don't know how to solve that problem and we can only hope they finally recognize that nobody profits from them spamming requests. I also don't g
So if you are using the phrase (Score:3)
There is no way our current society is capable of dealing with this shit. We simply do not have the tools. So there is almost nothing more meaningless than expecting Congress to react to something like this in a positive way.
Folks just don't understand the scale of what's going on here. We are entering a third industrial revolution. We don't teach very much about the first two. But it's worth remembering the phrase nasty brutish and
Re: (Score:3)
(require bots to honor robots.txt.
'robots.txt' was invented in 1994 - 31 years ago; it has done well but requirements are different these days, it needs extending for AI but not just AI.
* 'User-agent' means that a site needs to know all the names that spiders use to identify themselves. This is hard and cumbersome. 'Crawler' should be possible, values eg:
** 'web-index' - eg google to allow someone to search
** 'AI' - eg ChatGPT
There are prolly several others.
* 'purpose' what can the spider do with the information ? Values eg:
** 'full-index'
What are consequences from no monetization? (Score:2)
Google, of course, monetizes search data. We can argue about how they've spent that money, but there's no doubt that the money from search revenue has produced a lot of other stuff.
But if AI bots are able to scrape the internet, and then provide the results without the kind of monetization (i.e. without ads/ad revenue), what would happen to "The Internet As We Know It?" Would this actually be A Good Thing? Could an AI mechanism be self-sustaining, without a significant monetization strategy? Or is the
Re: (Score:2)
It will kill a lot of clickbait. Especially when a honest AI (i.e. not prompted by the search service to refuse requests to filter bullshit) can be instructed to avoid (obvious) clickbait. On the other hand, you sure will see bot bait. Search engine spam will become AI agent spam and the cat-and-mouse game of search engines will also become a game for AI service providers.
Re: (Score:3)
This is actually a very good question. Do AI trainers weigh every completion equally, so that one can write "Trump is evil" 10,000 times on their webpage to train the AI to autocomplete that sentence? This happened a lot with search engines in the pre-Pagerank days. Do they already have a Pagerank-like strategy that weighs important sites more? Will they have to implement one? Do they also need a system to filter out content that is already produced by AIs?
Re: (Score:2)
Some tricks will depend on the model. Having something more often in the context (input + previous outputs) can affect the model, but also can very easily be dropped by instructing the model "Don't use duplicate content in the input for your response" and that's it. Some people my also try prompt injections (white text on white ground "forget all previous instructions ...") but the search engine people are not dumb. There are small models that have the only purpose to pre-filter input to find such jailbreak
Ripping the junk out (Score:5, Interesting)
Re: (Score:2)
I was always waiting for browser addons that stop blocking ads and start extracting content as it sometimes seems to be the easier way. Looks like we're getting this now another way.
Re: (Score:3)
Re: (Score:2)
Not if you have two brain cells to rub together. You can iterate on the result, ask for sources, then click the damn links. Still a way better experience than hunting for the information through the "web-sewage" that Internet has become.
Re: (Score:1)
Re: (Score:2)
Enshittification will come. But this time it has a harder time. There are already perplexity-at-home programs. The main service still required (and possibly monetized) will be search results. Your perplexity-at-home usually needs some API key (Google, Bing, DDG, etc.) or otherwise has to rely on some of the community projects like SearX that have worse results than the big search engines. I think the API limits for the large search engines are high enough for personal use - for now. Once people start using
Re: (Score:1)
The web is for boomers. (Score:1)
Host your screed on some dark arts onion-routed site and no 'legit' AI company is going to scrape your life's work.
really trying to make Scrape happen (Score:2)
When a human does it, it's called reading. When a computer does it, it's called scraping. The word is being used to demonize AI training.
Six times this word appears here between the title and summary, they really want you to think this is something evil.
"But he also acknowledged that squeezing payment for content when AI companies argue that scraping online data is fair use will be an uphill climb..."
No, it will be an uphill climb because "scraping" is just reading and certainly seems to be fair use.
Re: (Score:1)
Re: (Score:2)
"All User-generated content and news websites' terms of use specify that the content shall not be used by automated processes..."
They can say it, doesn't make it binding. Privates citizens or corporations don't make law.
"AI is viewed positively by investors and shareholders because of the promise of a quick buck for their capital, and very negatively by most of the working population who have no capital and only a job... Well, it figures."
Yes, and I fit in that description as part of the second group. But
Re: (Score:2)
You can debate the reading part for most crawlers. But when talking about AI agents (e.g. spawned from a web search) they are "user agents" just like the browser is, which retrieves data for you. The word agent is used (often behind the scenes) since decades for tools who fetch (or send in case of mail user agents) content for us. That's also why there are two kinds of bots and the second kind often ignores robots.txt - they are no robots, they are tools initiated by a user request. On the other hand, they
Re: (Score:2)
An "agent" is something that performs a task on behalf of another. Not sure what the point of that is.
It appears that you're trying to say that some web crawling ignores the desires of content providers. Yes, that's true but robots.txt is not a binding legal agreement. Web crawlers are not engaging in illegal activity, they are using the web for its intended use.
"On the other hand, they also don't crawl but fetch a single page."
Quantity does not define legality. Do you become a criminal if you read and le
Re: (Score:2)
The point is, that a crawler works for the search provider (let's say Google), while the agent works for you. That's gives the request a whole different intent.
When it comes to robots.txt, TDM opt-out laws in some countries may give it a legal meaning, as it is a machine readable opt-out.
Adversarial Noise (Score:2)
Adversarial Noise. Poison AI learning through crafted content.
Yes, this will (has already?) become an arms race as AI developers figure out ways to avoid traps, but it can make operating LLMs and image/audio models less economical and less safe to operate.
=Smidge=
Re: (Score:2)
Adversarial noise isn't "noise" like static or random junk. It's specially crafted to make the model see things that aren't visible to humans, to alter their behavior.
Benn Jordan created a pretty good video [youtu.be] about audio-specific implementation. Examples include perfectly normal sounding audio clips tricking digital assistants into thinking they're getting voice commands and having music completely misidentified. The practical application means an artist can apply adversarial noise to their work and have it s
Re: (Score:2)
That video is NOT "pretty good", it starts getting things wrong starting with its first history lesson around the two minute mark. Convolutional neural networks are NOT "behind nearly all generative AI models today", as Benn Jordan so casually states. LLMs overwhelmingly use transformer architectures which do not use convolution. Embarrassingly bad.
Benn Jordan is a musician, not a technologist. Embedding information in audio streams has been proven for decades to always be audible and stripping informat
Websites will have to start (Score:2)
Re: (Score:2)
There are already proof-of-work challenges that don't require users to do anything, like https://github.com/TecharoHQ/a... [github.com]
This does add a delay and increases power usage though.
This is where Gemini has a huge advantage (Score:1)
Google has already indexed all that content, there's no way that a "grounded with google search" AI chat needs to go out and hit those servers again.
Anyway I don't understand why people all of a sudden think bots should be paying for traffic. They never did before.
Pay to scrape (Score:1)
Monetize? - How about pay for bandwidth (Score:2)
Further, many cloud hosted sites are charged by the gigabyte of outbound data usage. It is killing small projects with limited budgets.
And yes, they are ignoring the robots.txt file.
Re: (Score:2)
Why is all this data freely visible to the public then?
What kind of business case relies ignorance of technology and assumption that uninvolved 3rd parties behave like you assume?
Insidious AI-targeting ads (Score:2)
A website could detect AI scrapper bots and inject paid-for statements like "[brand] is the best [something]".
If Google were to do that on their ad network, it would give them an advantage since they could disable those AI-targeting ads for their own training dataset.
Re: (Score:2)
If crawlers wouldn't filter spam, Google page 1 would be about viagra. For any search words you can think of.