

Increased Traffic from Web-Scraping AI Bots is Hard to Monetize (yahoo.com) 33
"People are replacing Google search with artificial intelligence tools like ChatGPT," reports the Washington Post.
But that's just the first change, according to a New York-based start-up devoted to watching for content-scraping AI companies with a free analytics product and "ensuring that these intelligent agents pay for the content they consume." Their data from 266 web sites (half run by national or local news organizations) found that "traffic from retrieval bots grew 49% in the first quarter of 2025 from the fourth quarter of 2024," the Post reports. A spokesperson for OpenAI said that referral traffic to publishers from ChatGPT searches may be lower in quantity but that it reflects a stronger user intent compared with casual web browsing.
To capitalize on this shift, websites will need to reorient themselves to AI visitors rather than human ones [said TollBit CEO/co-founder Toshit Panigrahi]. But he also acknowledged that squeezing payment for content when AI companies argue that scraping online data is fair use will be an uphill climb, especially as leading players make their newest AI visitors even harder to identify....
In the past eight months, as chatbots have evolved to incorporate features like web search and "reasoning" to answer more complex queries, traffic for retrieval bots has skyrocketed. It grew 2.5 times as fast as traffic for bots that scrape data for training between the fourth quarter of 2024 and the first quarter of 2025, according to TollBit's report. Panigrahi said TollBit's data may underestimate the magnitude of this change because it doesn't reflect bots that AI companies send out on behalf of AI "agents" that can complete tasks on a user's behalf, like ordering takeout from DoorDash. The start-up's findings also add a dimension to mounting evidence that the modern internet — optimized for Google search results and social media algorithms — will have to be restructured as the popularity of AI answers grows. "To think of it as, 'Well, I'm optimizing my search for humans' is missing out on a big opportunity," he said.
Installing TollBit's analytics platform is free for news publishers, and the company has more than 2,000 clients, many of which are struggling with these seismic changes, according to data in the report. Although news publishers and other websites can implement blockers to prevent various AI bots from scraping their content, TollBit found that more than 26 million AI scrapes bypassed those blockers in March alone. Some AI companies claim bots for AI agents don't need to follow bot instructions because they are acting on behalf of a user.
The Post also got this comment from the chief operating officer for the media company Time, which successfully negotiated content licensing deals with OpenAI and Perplexity.
"The vast majority of the AI bots out there absolutely are not sourcing the content through any kind of paid mechanism... There is a very, very long way to go."
But that's just the first change, according to a New York-based start-up devoted to watching for content-scraping AI companies with a free analytics product and "ensuring that these intelligent agents pay for the content they consume." Their data from 266 web sites (half run by national or local news organizations) found that "traffic from retrieval bots grew 49% in the first quarter of 2025 from the fourth quarter of 2024," the Post reports. A spokesperson for OpenAI said that referral traffic to publishers from ChatGPT searches may be lower in quantity but that it reflects a stronger user intent compared with casual web browsing.
To capitalize on this shift, websites will need to reorient themselves to AI visitors rather than human ones [said TollBit CEO/co-founder Toshit Panigrahi]. But he also acknowledged that squeezing payment for content when AI companies argue that scraping online data is fair use will be an uphill climb, especially as leading players make their newest AI visitors even harder to identify....
In the past eight months, as chatbots have evolved to incorporate features like web search and "reasoning" to answer more complex queries, traffic for retrieval bots has skyrocketed. It grew 2.5 times as fast as traffic for bots that scrape data for training between the fourth quarter of 2024 and the first quarter of 2025, according to TollBit's report. Panigrahi said TollBit's data may underestimate the magnitude of this change because it doesn't reflect bots that AI companies send out on behalf of AI "agents" that can complete tasks on a user's behalf, like ordering takeout from DoorDash. The start-up's findings also add a dimension to mounting evidence that the modern internet — optimized for Google search results and social media algorithms — will have to be restructured as the popularity of AI answers grows. "To think of it as, 'Well, I'm optimizing my search for humans' is missing out on a big opportunity," he said.
Installing TollBit's analytics platform is free for news publishers, and the company has more than 2,000 clients, many of which are struggling with these seismic changes, according to data in the report. Although news publishers and other websites can implement blockers to prevent various AI bots from scraping their content, TollBit found that more than 26 million AI scrapes bypassed those blockers in March alone. Some AI companies claim bots for AI agents don't need to follow bot instructions because they are acting on behalf of a user.
The Post also got this comment from the chief operating officer for the media company Time, which successfully negotiated content licensing deals with OpenAI and Perplexity.
"The vast majority of the AI bots out there absolutely are not sourcing the content through any kind of paid mechanism... There is a very, very long way to go."
How many websites are the AI spiders killing? (Score:2)
Kind of a new Slashdot effect? I think I'm actually seeing some evidence of higher than usual mortality among old websites and I've been wondering if the cause might be AI spiders seeking more training data. Latest victim might be Tripod? But that one was already a ghost zombie website...
Billionaire Bro Internet Apocalypse (Score:4, Interesting)
Each billionaire bro's revenue eating business model threatens to consume the lunch and dinner for everyone else, except that no one wants to be the one cooking up the food for anyone.
LLM scraping theft steals property from everyone else, then refuses to pay any revenue for its use, but that will deny the LLM any new data in the future, leading to the collapse of their model as well.
Re: Billionaire Bro Internet Apocalypse (Score:3)
Re: (Score:2)
Whoever gets in at the ground floor (Score:3)
Then the only ones who will have any access to data to train will be platform holders who have the ability to distinguish between real people and AI slop using various tricks you use to detect bots.
AI becomes Capital that is only owned by a handful of billionaires and they can do what they want with it and we just have to suck it all down because the alternative is drastic changes to our society tha
Re: (Score:1)
Two things (Score:2)
(1) Congress should pass a law requiring that bots not misidentify themselves in the user agent string AND require bots to honor robots.txt. Then these obnoxious, ill behaved AI bots could be blocked.
(2) If you have actually valuable content you should put it behind a paywall like most mainstream news sites already do. Making your pages static html, cacheable, or at least really easy to generate would help reduce the load too.
Re: (Score:2)
(1) Define what's the correct identification. I can rename my bot every day, or should there also be requirements on that? Please consider the side-effects of such laws, also with regard to fingerprinting users.
(2) Putting content behind a paywall will lead to your site not being found in an AI search. Quite soon your site will be invisible to regular users, who don't know what a browser is, but use the default "search app" on their mobile device. And these default apps will be AI assistants instead of web
So if you are using the phrase (Score:3)
There is no way our current society is capable of dealing with this shit. We simply do not have the tools. So there is almost nothing more meaningless than expecting Congress to react to something like this in a positive way.
Folks just don't understand the scale of what's going on here. We are entering a third industrial revolution. We don't teach very much about the first two. But it's worth remembering the phrase nasty brutish and
Re: (Score:2)
(require bots to honor robots.txt.
'robots.txt' was invented in 1994 - 31 years ago; it has done well but requirements are different these days, it needs extending for AI but not just AI.
* 'User-agent' means that a site needs to know all the names that spiders use to identify themselves. This is hard and cumbersome. 'Crawler' should be possible, values eg:
** 'web-index' - eg google to allow someone to search
** 'AI' - eg ChatGPT
There are prolly several others.
* 'purpose' what can the spider do with the information ? Values eg:
** 'full-index'
What are consequences from no monetization? (Score:2)
Google, of course, monetizes search data. We can argue about how they've spent that money, but there's no doubt that the money from search revenue has produced a lot of other stuff.
But if AI bots are able to scrape the internet, and then provide the results without the kind of monetization (i.e. without ads/ad revenue), what would happen to "The Internet As We Know It?" Would this actually be A Good Thing? Could an AI mechanism be self-sustaining, without a significant monetization strategy? Or is the
Re: (Score:2)
It will kill a lot of clickbait. Especially when a honest AI (i.e. not prompted by the search service to refuse requests to filter bullshit) can be instructed to avoid (obvious) clickbait. On the other hand, you sure will see bot bait. Search engine spam will become AI agent spam and the cat-and-mouse game of search engines will also become a game for AI service providers.
Re: (Score:3)
This is actually a very good question. Do AI trainers weigh every completion equally, so that one can write "Trump is evil" 10,000 times on their webpage to train the AI to autocomplete that sentence? This happened a lot with search engines in the pre-Pagerank days. Do they already have a Pagerank-like strategy that weighs important sites more? Will they have to implement one? Do they also need a system to filter out content that is already produced by AIs?
Re: (Score:2)
Some tricks will depend on the model. Having something more often in the context (input + previous outputs) can affect the model, but also can very easily be dropped by instructing the model "Don't use duplicate content in the input for your response" and that's it. Some people my also try prompt injections (white text on white ground "forget all previous instructions ...") but the search engine people are not dumb. There are small models that have the only purpose to pre-filter input to find such jailbreak
Ripping the junk out (Score:5, Interesting)
Re: (Score:2)
I was always waiting for browser addons that stop blocking ads and start extracting content as it sometimes seems to be the easier way. Looks like we're getting this now another way.
Re: (Score:2)
Re: (Score:1)
Re: (Score:1)
The web is for boomers. (Score:1)
Host your screed on some dark arts onion-routed site and no 'legit' AI company is going to scrape your life's work.
really trying to make Scrape happen (Score:2)
When a human does it, it's called reading. When a computer does it, it's called scraping. The word is being used to demonize AI training.
Six times this word appears here between the title and summary, they really want you to think this is something evil.
"But he also acknowledged that squeezing payment for content when AI companies argue that scraping online data is fair use will be an uphill climb..."
No, it will be an uphill climb because "scraping" is just reading and certainly seems to be fair use.
Re: (Score:1)
Re: (Score:2)
You can debate the reading part for most crawlers. But when talking about AI agents (e.g. spawned from a web search) they are "user agents" just like the browser is, which retrieves data for you. The word agent is used (often behind the scenes) since decades for tools who fetch (or send in case of mail user agents) content for us. That's also why there are two kinds of bots and the second kind often ignores robots.txt - they are no robots, they are tools initiated by a user request. On the other hand, they
Adversarial Noise (Score:2)
Adversarial Noise. Poison AI learning through crafted content.
Yes, this will (has already?) become an arms race as AI developers figure out ways to avoid traps, but it can make operating LLMs and image/audio models less economical and less safe to operate.
=Smidge=
Websites will have to start (Score:2)
This is where Gemini has a huge advantage (Score:1)
Google has already indexed all that content, there's no way that a "grounded with google search" AI chat needs to go out and hit those servers again.
Anyway I don't understand why people all of a sudden think bots should be paying for traffic. They never did before.
Pay to scrape (Score:1)
Monetize? - How about pay for bandwidth (Score:2)
Further, many cloud hosted sites are charged by the gigabyte of outbound data usage. It is killing small projects with limited budgets.
And yes, they are ignoring the robots.txt file.