Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm (yahoo.com) 108
Multiple AI companies are ignoring Robots.txt files meant to block the scraping of web content for generative AI systems, reports Reuters — citing a warning sent to publisher by content licensing startup TollBit.
TollBit, an early-stage startup, is positioning itself as a matchmaker between content-hungry AI companies and publishers open to striking licensing deals with them. The company tracks AI traffic to the publishers' websites and uses analytics to help both sides settle on fees to be paid for the use of different types of content... It says it had 50 websites live as of May, though it has not named them. According to the TollBit letter, Perplexity is not the only offender that appears to be ignoring robots.txt. TollBit said its analytics indicate "numerous" AI agents are bypassing the protocol, a standard tool used by publishers to indicate which parts of its site can be crawled.
"What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites," TollBit wrote. "The more publisher logs we ingest, the more this pattern emerges."
The article includes this quote from the president of the News Media Alliance (a trade group representing over 2,200 U.S.-based publishers). "Without the ability to opt out of massive scraping, we cannot monetize our valuable content and pay journalists. This could seriously harm our industry."
Reuters also notes another threat facing news sites: Publishers have been raising the alarm about news summaries in particular since Google rolled out a product last year that uses AI to create summaries in response to some search queries. If publishers want to prevent their content from being used by Google's AI to help generate those summaries, they must use the same tool that would also prevent them from appearing in Google search results, rendering them virtually invisible on the web.
"What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites," TollBit wrote. "The more publisher logs we ingest, the more this pattern emerges."
The article includes this quote from the president of the News Media Alliance (a trade group representing over 2,200 U.S.-based publishers). "Without the ability to opt out of massive scraping, we cannot monetize our valuable content and pay journalists. This could seriously harm our industry."
Reuters also notes another threat facing news sites: Publishers have been raising the alarm about news summaries in particular since Google rolled out a product last year that uses AI to create summaries in response to some search queries. If publishers want to prevent their content from being used by Google's AI to help generate those summaries, they must use the same tool that would also prevent them from appearing in Google search results, rendering them virtually invisible on the web.
Yah ... (Score:2, Troll)
If publishers want to prevent their content from being used by Google's AI to help generate those summaries, they must use the same tool that would also prevent them from appearing in Google search results, rendering them virtually invisible on the web.
I'm sure this is nothing more than an unfortunate coincidence that Google will fix .... eventually ... rest assured, we're working on it ... any day now ...
Re:Yah ... (Score:3)
Re:Yah ... (Score:1)
Indeed. This is just publishers whining that they aren't getting something for nothing.
Re:Yah ... (Score:4, Insightful)
If their content is worth nothing then why are they being scraped?
Re: Yah ... (Score:2)
Tulip bulbs.
Re:Yah ... (Score:0)
How about that. I guess even a blind squirrel gets a nut once in a while.
Re:Yah ... (Score:2)
Doesn't matter either way. Shit gets trained into the AI on Slashdot time.
In other words, in six months to a year the AI will know about it, and it will likely spit out just as many dupes!
Re:Yah ... (Score:1)
Multiple AI Companies Ignore Robots.Txt Files
I would be shocked if there is anyone who DOESN'T ignore the Robots.Txt file
Re:Yah ... (Score:1)
It gets hits on my webserver. It's clearly not being ignored by everyone.
Re:Yah ... (Score:2)
There's an important difference between retrieving that file and complying with its contents.
Re:Yah ... (Score:1)
True, but it's technically not being "ignored" if it's at least being fetched, even if it's not being actually obeyed...
Re:Yah ... (Score:2)
They might ask for robot.txt to get the interesting stuff you try to hide, it would be trivial for you to test it out instead of just saying they ask for it. Who knows? Maybe I could test it myself on my own web servers then, write a click bait article then, profit!
Re:Yah ... (Score:2)
Why would they change this? Appearing on google search results is a privilege, not a right.
And that's why Google needs to be regulated as a monopoly.
Re:Yah ... (Score:2)
It cuts both ways. If Google didn't have their content the users would go elsewhere for it. The value is mutual.
Re:Yah ... (Score:2)
Google doesn't have their content.
Re:Yah ... (Score:2)
corruption and classism, it's inevitable that the greedy will cross every boundary because greed is insatiable
classism is the real problem which creates the corruption destroying our society
Re: Yah ... Doom on Repeat! (Score:0)
The search engine providers were not stewards of quality data and not providing a search index for users to peruse that was free of bias, SPAM and adhered to common digital etiquette. That gen 1.0 lesson was easy to learn. It mutated into a corrupted dataset within a decade.
Looking at many of the software dataset query providers (AI for slang), they do not have the integrity or reputation to even rank as a steward of knowledge, facts or language. They may be vying as a leader in high-performance compute clusters and other technology utilizing large amounts of unverified data, but they in no way represent humanity or language expertise (nuanced) in the 21st century.
Knowing what you want simplifies a lot with technology,while waiting for certain areas to mature. In many cases, a dictionary and expandable (modular) encyclopedia is plenty for an LLM. All data has been vetted and verified,while continuing on a trajectory that has proven successful for centuries in developed cultures (that explains the culture less war).
I also think that we'll see companies/opensource orgs specialize in data sorting applications and other specific features of SDQP's and in an reverse Apple'esque play, fork services and features with narrower goals,processes and implementations,rather than offering the market "the kitchen sink" of mediocre software. (Do one thing and one thing well is what masterclass curriculum's are made from).
With many technologies like USB 4 and Thunderbolt 4, I think we'll start to see "cascading computing" in the market, connecting the PC to specialized platforms (an SBC optimized for dictionary/encyclopedia LLMs for example). Nerds and geeks used 10 Gb networking, with consumers going 1Gb to 2.5Gb to 5Gb, where USB4/TB4 does a theoretical 40Gbps.
https://en.m.wikipedia.org/wik... [wikipedia.org]
https://en.m.wikipedia.org/wik... [wikipedia.org]
With cascading computing platforms, we may also see TCP/IP(Internet 1.0) as a single SBC on the edge, facilitating data transfers to a cascading computer cluster, allowing "indirect access" to hostile networks while limiting exposure for the other PCs. Innovation from 2010-2020 have laid the foundation for many new architectures/functionality in PC designs and evolving LANs to something even more secure (no direct TCP/IP links with cascading computing).
I will gladly purchase LLM dictionaries and modular/expandable encyclopedias (coding add-ons, etc,making LLms a la carte built with smaller models that are manageable) and even purchase an SBC capable of advanced data sorting to link to my main PC or cluster with a 40Gbps link rather than trying to do everything on (1) computer (who only uses (1) computer in 2024?). That evolution even changes operating system usage (stay on Windows 10 if you choose), now that it does not have a direct link to Internet 1.0. This addresses eWaste and disposable psuedo-innovation where manufacturers were unable to. And much of the software will run for decades on trusted clusters/networks with only feature updates being the priority of devs {speculation}. The synergy with innovation is much stronger in trusted/safe environments. Many were raised on the Information Superhighway in that environment (sub-1-billion e-population).
Add in network diversification (new digital infrastructure) and digital slum data mining, broken English SEO and other inferior tactics go the way of the dodo bird. IMHO, cascading computing does more for productivity,evolved security/privacy and performance enhancements than any kitchen sink software dataset query provider (that uses slave labor to groom inferior data) ever could.
Those that avoid an inferior digital umbilical cord and don't allow their kids to have that noose tied around their neck will tell a different story. My kids can use the lo
Just a contract (Score:5, Insightful)
robots.txt is just a good-will contract between the client and a web server. Since AI companies (in fact all companies) go for profit and hardly ever show any good will, why would you expect them to obey by rules outlined in robots.txt? If you want access control over your content, implement actual access control.
Re:Just a contract (Score:5, Informative)
A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.
https://developers.google.com/... [google.com]
Re:Just a contract (Score:2)
Yes, this was decided after the California DMV decided to use their robots.tx not to have their site indexed.
Google's reasoning was that it would become a pretty bad search engine if it didn't couldn't index the DMV site.
Re:Just a contract (Score:2)
robots.txt is just a good-will contract between the client and a web server. Since AI companies (in fact all companies) go for profit and hardly ever show any good will, why would you expect them to obey by rules outlined in robots.txt? If you want access control over your content, implement actual access control.
robots.txt is just a passive request, really. And the AI companies are apparently declining that request.
Personally, I think they should respect it, but yeah, if you are relying on everybody respecting robots.txt, then you are smoking something ...
Re:Just a contract (Score:2)
Re:Just a contract (Score:2)
Accessing content made publicly available and learning from it is not a violation of anything. So you can multiply however you like. It's still multiplication by zero, so it equals zero.
Re:Just a contract (Score:2)
If the content they scrape has no value then why are they scraping it?
Re:Just a contract (Score:2)
>If the content they scrape has no value
Who made this claim, and why are you replying to me with this stupid assertion?
Re:Just a contract (Score:0)
Try to remember that he is just as stupid, if not more stupid, than you are.
He's posted this same thing multiple times. He's very proud of that "insight" and wants to make sure as many people see it as possible.
Re:Just a contract (Score:0)
Go and defeat your strawman yourself.
Re:Just a contract (Score:1)
I knock down strawmen here almost every day, thanks. I take pleasure in it.
Re:Just a contract (Score:2)
robots.txt is just a good-will contract between the client and a web server.
robots.txt is, in effect, a machine readable/parable copyright notice. So if an AI company scrapes pages in contravention of what is said in a robots.txt then it should be liable for being sued for breach of copyright. They cannot pretend that as they did not read it that they can ignore it; imagine what a judge would say if you reproduced a book or piece of music but said that you did not read the book's copyright or CD's copyright notice.
But AI companies are rich with expensive lawyers and will fight test cases tooth & nail to stop a precedence that they do not want: remember that money can buy you justice.
Re:Just a contract (Score:2)
A copyright notice tells who's got the copyright. The robots.txt doesn't talk about copyrights or even author names at all.
My access control (Score:2)
A copyright, license agreement, and an attorney. Not that is practical for an individual, with the except of those in a jurisdiction with small claims court. But even a moderately sized business should be able to enforce its rights in a civil court.
So be it (Score:2)
One way to protest this is to add instructions or other content for only the AI on your site. This could be used for commercial purposes too, if the return message can be modified.
selfishness (Score:2)
Re:selfishness (Score:2)
Exactly. Well, supposedly a "good" company like Google was 20 years ago (ah, for the days of Do No Evil...) I would expect them to honor it but anyone with a "bad" intent would probably use it as a shopping list of where to start indexing.
Otherwise, it needs to be dealt with in config or code (require authentication of some kind, only allow internal LAN/VPN connections, etc)
Killing the deal (Score:4, Interesting)
The deal was, you get to scrape the web, show your ads on your search results pages, and in return the web sites get visitors. If you scrape the web and "summarize" the content and nobody ever visits the web site, you don't uphold your end of the deal and the deal will end. And if copyright legislation doesn't come down on that behavior, the open web will cease to be. Nobody will get access to anything that isn't advertising or propaganda in itself without signing a contract that excludes any non-personal use of the content, even summarizing and other "fair uses".
Re:Killing the deal (Score:4, Insightful)
That worked...for a while. The problem is the ad networks, being the capitalists they are, took the "neutral" approach of "whatever you pay for". This resulted in legitiate businesses being used for the illicit distribution of malware.
The next problem is that ad networks did nothing. They knew they were serving malicious ads; they knew they were selling to bad actors; but they knew they had legal protection and continued to willingly sell malicious adspace under the guise of "we're too big to check".
So now come ad-blockers. It was one thing when they were just annoying; but it's another when there's actual risk of getting hacked. It didn't go over well when the local newspaper infected 2500 local readers from a bad ad. Did they blame the ad? The paper did. Know who the readers blamed? They blamed the newspaper. "You should have taken more responsibility," is what they screamed as they were canceling subscriptions. The same for a local TV station's website when their ad network was serving malicious ads. They could point the finger all they wanted...but everyone was pointing it at the station.
That's the other problem; no one places the blame where the blame should be placed. Rather than blame the adnetworks with no morals; they blame the website operators.
So now we have the ad-blocker wars; and to combat that...more anti-adblock stuff.
The fact is...ad revenue isn't enough anymore. The lack of privacy laws and no oversight on any of this has meant the biggest export is American user data; sold by American companies, to the highest bidder. They don't care about us...we're just a product to profit off of.
Re:Killing the deal (Score:3)
Facebook has a version of that problem now. Every single "Sponsored" article leads straight to a malware. Reporting it to Facebook as fraud, which these pages are, gets a response of "we see no violation of our guidelines".
Re:Killing the deal (Score:2)
Yeah...when they're being paid to display it it's never a guideline and there's no concern for users.
I'm waiting for someone to finally get a judge to start using the elimination of rules and start holding them responsible in civil court. They don't have immunity in civil court anymore over that; it was killed so they could arrest backpage guys.
Re: Killing the deal (Score:2)
Re: Killing the deal (Score:0)
No, it does NOT make you the customer. It just makes you another dumbfuck sucker.
They sell your info anyway. But when you're paying, they've got more to sell, they've got your real name, your real address, your bank info.
Doesn't matter if you spent 10s of thousands, the car companies are harvesting and selling your info too. You paid the phone company for a subscription, they sold your location data. You might think "I'll just subscribe to somebody with a good privacy policy" - but you're a fucking fool if you think that. NONE of them have good privacy policies.
At least when it's free you can block ads, you can give them fake info if you have to register. But the moment you pay, they've fucking got you.
Fucking paytard.
Re: Killing the deal (Score:2)
If I pay $15 per month to a company, that company is getting a solid $180 per year from me. Sure, they could try to sell my data, but if they do it too sloppily and cause me trouble, I get pissed off, cancel my subscription, and talk about it online where their bad behavior gets broadcast to a thousand or a million other subscribers. That’s real-life economic damage that can rack up REALLY FAST for a company. Far smarter for them to hold my info close, keep me happy as a customer, and harvest that EZ $15 every month, year after year.
Even a cheap subscription means that the company is HIGHLY motivated to keep my info (mostly) locked down. It’s not a perfect system. Nothing is.
But, by all means, keep using free-teir services on Android and convince yourself that you’re screwing the man and you’re better than all the sheeple.
Re:Killing the deal (Score:3)
The websites are a proxy for the malware. They signed a contract with someone serving malware to their innocent users. They are responsible for delivering malware. If they had used a legit ad company that filters out shitty malware ads the users would not have been impacted.
Here's who gets to blame who:
Web site users get to blame the web site
Web site has to take that responsibility for infecting their users while also getting to blame the ad network
The ad network has to take that responsibility while blaming no one because they can't blame the criminals they cut deal with and did nothing to filter the malware.
When I worked at a content company, we ran a check against all ads we'd never seen before on first hit and periodically rechecked. We took responsibility for our users' safety. No sympathy for the newspaper if they didn't bother to protect their users but took the malware ad money.
Re: Killing the deal (Score:3)
I never said ad based content models worked or are a good thing. Only pointing out that the content distributors who use shitty ad networks are fully responsible for the malware they deliver to their visitors. Someone said the readers unfairly blamed the online newspapers. Not so. The readers appropriately blamed the newspapers for delivering malware to their browsers.
Re:Killing the deal (Score:1)
When was that the deal outside of your imagination. Even google itself clearly states in writing that this is not the deal in the opening lines of the description of what robots.txt does:
https://developers.google.com/... [google.com]
"A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page."
Re:Killing the deal (Score:3)
Read it once again ;)
The comment isn't about google and robots.txt, it is about "AI" outfits scraping content and then offering it as their own "AI" creation.
Or at least it appears so to me.
Re:Killing the deal (Score:2)
The complaint raised in the topic is specifically about robots.txt:
>Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm
Re:Killing the deal (Score:2)
The robots exclusion standard (aka robots.txt) is a red herring [wikipedia.org]. It is a gentlemen's agreement [wikipedia.org], originally intended to indicate to crawlers which parts of a web site are unsuitable for them, where crawling would produce nothing but useless burden for the server and the crawler alike. For example, you wouldn't want a crawler to keep requesting URLs from a procedurally generated infinite tree of documents. Whether it has expanded beyond that and can be legally binding under certain circumstances is left as an exercise to the legally inclined nerd.
That said, the issue here isn't the robots exclusion standard, but an implicit understanding that let search engines access web sites and use their content in a way that benefits both the web sites and the search engines. If the search engines, and other AI companies, break that understanding, the web will end up behind the mother of all "paywalls", where the paywall is a mandatory contract that prohibits any non-personal use of the content. This isn't robots.txt, not even robots.txt on steroids. This is "403 Forbidden" territory for anyone who doesn't authenticate as someone who has agreed to a legally binding contract not to divulge or otherwise use the information except for purely personal purposes. Login or fuck off. Nobody wants to feed AI companies their content: They give nothing in return. All you will be able to access freely will be ads and propaganda.
Re:Killing the deal (Score:2)
>an implicit understanding
Isn't the entire argument here that there is in fact no such thing because content producers specifically assert that such agreement doesn't exist and AI companies and google are doing things they don't want google to do, but that they have no desire to actually prevent by gating their content from the public?
Re:Killing the deal (Score:2)
We are in a time of transition. Obviously most web sites do not want to drop out of Google right now, because that still is a sizeable amount of their traffic. However, once people understand "googling" to mean asking a chatbot and getting answers not from the web sites but immediately from the bot, the traffic will dry up and exposing the content to Google and other AI companies won't benefit web sites anymore, and that's when web sites will not only drop from the search results pages, that nobody looks at anymore, but also from the publicly accessible world wide web. Requiring a login from a legally identifiable person will be the only way to keep the content from being stolen by AI companies, unless copyright law prohibits the training of AIs with copyrighted data without license and compensation.
Re:Killing the deal (Score:2)
I strongly disagree on emotionally loaded, objectively and factually incorrect wording such as "content being stolen" when all that's being done is learning from data that certain people chose to share with the public.
But the rest of the analysis is mostly in line with my thinking.
So I suspect the sole point of differential we have is that you see it a moral good to call any person learning from content of another person a thief (someone who steals) as you do in the post above. Whereas I look at entire human history and see learning from content of others being the single strongest driver of human learning and progress throughout human history and see limits placed on that is at best extremely negative for society, at and worst probably the single worst crime against humanity that one could commit.
Re:Killing the deal (Score:2)
You're missing the point, because it doesn't matter at all whether you think "stolen" is the right word. Fact of the matter is that content producers do not agree to that kind of use of their content without getting anything in return, and as soon as they no longer benefit from making content accessible to crawlers, they will stop doing that. If you agree to a contract that forbids you from training an AI model (and all other uses that aren't purely personal), because you won't be able to access anything without doing so, you are legally bound by that contract. Copyright law doesn't come into it anymore at that point. If you prefer a web where you have to legally identify yourself everywhere, so that your agreement to that contract can be verified and you can be held accountable for violations, then do keep that "it's not stealing, our programs are just learning like any human would, hurr durr" stance. AI companies will not be able to keep scraping the web. The only question is whether they take down the open web while trying or if they can be stopped from building their empires on stolen information some other way.
Re:Killing the deal (Score:2)
Emperor has no clothes, and public has every right to see that emperor has no clothes, and learn whatever lessons it chooses to learn from it.
Only a tyrannical emperor makes people avert their eyes and block learning from that public display he himself chose to put on.
Once the emperor chose to put himself up on display in whatever way he chose to do it, no further permits from the emperor are necessary for looking at him, and learning from his visage. And any demands for such permits to be required are so extremely tyrannical, that we have actual folk tales warning us against it.
Re:Killing the deal (Score:2)
Greedy idiots ruining things for everyone by shirking conventions is the oldest tale in the book.
Re:Killing the deal (Score:2)
Indeed. They should stop pretending that they have a right to dictate if people can look at and learn from with content they themselves made public.
Re:Killing the deal (Score:2)
You're an LLM, aren't you?
Re:Killing the deal (Score:2)
It is the current popular way to run away from argument you lost among the terminally online, isn't it?
Before it was "you're a bot", and before that "you're a nazi". My reaction remains the same.
Run away little girl, run away!
Re:Killing the deal (Score:2)
It's a plausible explanation why you keep forgetting all context. You may just be an idiot though.
Re:Killing the deal (Score:2)
Projection on your part is very real.
Re:Killing the deal (Score:1)
>Fact of the matter is that content producers do not agree to that kind of use of their content without getting anything in return, and as soon as they no longer benefit from making content accessible to crawlers, they will stop doing that.
Then: Don't. Put. Shit. On. The. Free. Web.
Period.
It's not rocket surgery here.
You aren't allowed to make a piece of artwork, put it publicly on a billboard in the middle of town, and then say Bob, Jerry, and Sue aren't allowed to look at it, but everyone else in the world can. Public access doesn't work like that.
Don't like it? Go private.
You can make all the rules you want if you show your stuff off in private. "But I can't get free customers then!". Too fucking bad, you bitched when your shit was getting you free customers too, just because you couldn't get ALL of the free customers. Now you can deal with getting even fewer.
Re:Killing the deal (Score:2)
Let's ignore for a moment that you completely ignored the rest of the discussion and naturally missed the point, that letting AI companies get away with delivering content that others created will result in the end of the openly accessible web for everyone, not just these companies, because these parasites and their shills don't take no for an answer. But even with that caveat, you're still wrong. The web isn't a billboard that anyone can look at. It's servers delivering content to individual clients, and you can very well decide Bob, Jerry and Sue aren't allowed to look at the content but everyone else is. Whether robots.txt has the legal power to do that or not is not clear-cut, but there are certainly technical measures you can take, and Bob, Jerry or Sue can go pound sand if you do.
Re:Killing the deal (Score:2)
No you can't decide to host on a non-login site and pick and choose who can view the content. That's the whole damn reason some sites hide forums and user content behind logins.
Well, you CAN, if you know the particular IP addresses you won't respond to. Or you can, I don't know, Implement a login scheme with terms of service.
The "problem" with a login scheme is.... you don't get a free lunch. You don't even get random visitors to ATTEMPT to try to serve ads to. Most of them will see your jank assed attempt to "lock down your content" and just go to one of your competitors who is someone else who DGAF about who scrapes shit so long as they continue to get traffic instead of stagnating.
Shitty internet "companies" can't have it both ways. Period.
Re:Killing the deal (Score:0)
> The deal was, you get to scrape the web, show your ads on your search results pages, and in return the web sites get visitors.
There was never such a deal.
People started writing crawlers and then people proposed robots.txt for avoiding the crawlers taking down websites with too little resources.
Nobody made a deal "You can index, because you link back to us" or anything like that. This just never happened.
Of course, some people wanting clicks may see it as such an exchange, but that's their wish and no contract they have.
The "contract" is: "When you put something online without restrictions, people, bots, and everything else may fetch it."
Re:Killing the deal (Score:2)
The reason I wrote "deal" and not "contract" is that it's a convention or mutually implied understanding, not a piece of paper or an oral agreement sealed with a handshake or somesuch. If the wording confuses you, call it a balance of benefits. Web authors let crawlers access their sites. They do not have to do that. The balance of benefits is being shaken up by crawlers that train AI models. These AI companies then effectively provide the content without anyone having to visit the site where it originated. The point is that the web sites will not keep "putting something online without restrictions". If copyright law doesn't handle this, then the only way to exclude the parasitic AI companies is to make "people, bots and everything else" agree to contracts before letting them connect. If that's the web you want...
Re:Killing the deal (Score:0)
Fair. But still it was/is a de-facto deal and not a contract. You can't sue Google if they don't rank you high enough and they can't sue you if you present their bot an error 403 even though the robots.txt would allow access, just as you cannot sue them when they ignore robots.txt.
If you watch your logs closely, you can see Google coming back with iPhone user agent (probably others as well) without accessing robots.txt after the official crawler was there. I suppose (in good faith) that they check if you present the Google bot other content than normal users.
Anyway, excluding AI bots will only further the Google, Meta and possibly OpenAI monopolies. They use the content crawled for the search engine, the social network, and bought content while the competition doesn't have anything like this. Look at OpenAI making big deals with the publishers. Do you think someone building an open source model can afford buying content like this? Of course they use crawlers. I do not exclude AI crawlers anywhere, where I do not exclude search engine bots, because I do not want to see a Google/Meta/OpenAI monopoly on AI.
Re: Killing the deal (Score:2)
The UN Human Rights Council might disagree with you.
virtually invisible on the web (Score:2)
Re:virtually invisible on the web (Score:2)
Maybe you and your site are the AI.
Copyright Infringement (Score:2)
Take a page out of Nintendo's book; you lawyer up and file a C&D, DMCA, and everything you can for every page they scrape. If they are ignoring your intellectual property rights and policies; then it's unauthorized access. I mean the FBI literally just listed SABnzbd as a pirate site...clearly the standard for infringement is low if the FBI isn't even making correct arguments in court.
Re:Copyright Infringement (Score:2)
Large sites like Youtube have a process in place, and there's software to automatically scan videos for copyrighted music snippets that can be submitted as DMCA violations.
Right now small community sites don't have the expertise or the manpower to manually check access logs and trace where the spiders are coming from, or find the contact details to send C&D and DMCA claims. So they do nothing, and that's if they even realize what is happening.
This is a great problem for the next generation of open source engineers to work on. Ideally, some community group should build a bunch of web analytics (for free) that can be installed on CPanel etc. The software should identify spiders, figure out who they are and what they are doing, and list all the possible DMCA violations. The software should also make it easy to create properly formatted DMCA requests so that they can be sent in one click (by a human, never automate sending requests). There should be a proper record of the requests in case that the owner of the spider doesn't respond in a reasonable timeframe. In that case, the record could also act as proof for later more defensive steps.
Some 20 years ago the community got together to combat the spam problem, which has similar characteristics. There were globally curated blacklists, SMTP server access greylisting, automatically updating keyword filters, etc. Lots of ideas, lots of people sharing what worked or didn't, giving power back to the little guy.
Re:Copyright Infringement (Score:2)
What the community needs is a way to easily and automatically do the C&D + DMCA submissions.
Large sites like Youtube have a process in place, and there's software to automatically scan videos for copyrighted music snippets that can be submitted as DMCA violations.
No way that will be abused by Big Corp lawyers and trolls, right?
Oh wait! It already happens with some DMCA requests made to Youtube by various outsourced "rights management" firms and trolls.
https://www.eff.org/deeplinks/... [eff.org] https://www.businessinsider.co... [businessinsider.com] https://arstechnica.com/tech-p... [arstechnica.com] https://www.thefader.com/2022/... [thefader.com] https://sirtaptap.com/articles... [sirtaptap.com]
Re:Copyright Infringement (Score:0)
The DMCA *requires* you provide a URL reference to the material in question.
Copyright law also requires you to reference the protected work along with the smallest sliver of evidence it was distributed or publicly performed.
When it comes to search results, this is simple and easy to do, however google does honor dmca take down requests for search results already. You won't be able to counter the billions of examples of them honoring such requests, including your own, to show they are ignoring them.
AI bots are quite a different story. Or at least most companies doing this are.
Their goal is very specific in preventing a verbatim "copy/paste" of the data they scrape.
When successful, there is no works distributed for you to reference. Copyright law in its current form is not broken.
Yes, there are some that do verbatim copy, and sure, go after them!
But these are still the exception and not the rule.
Scraping data that you do not then distribute isn't illegal. Nor is this really the topic at hand (it is a different topic for sure, but not one robots.txt will effect)
This topic is completely about scraping that data in the first place after being asked not to.
This isn't a crime, this is being an ass, and you don't deal with assholes doing legal things the same way you deal with anyone doing illegal things.
Re:Copyright Infringement (Score:2)
No court has yet ruled if AI scraped data run through their training programs is a copyright violation or transformative.
Re:Copyright Infringement (Score:0)
Every verbatim copy is like a "bug" caused by incomplete training.
The models are way smaller than the training data and when they are trained correctly there is no reason why they should contain an original as the space could be used more efficient by logic that helps the model understand things or store abstract knowledge instead of verbatim copies.
Does anyone follow robots.txt orders? (Score:2)
It's extra work to pull and read those, and slows the search engine even if ignored. It's far simpler to simply ignore them and build up your "metrics" for the amount of material you've scanned, even when the robots.txt warns you that it's not reliable or even stable.
Re:Does anyone follow robots.txt orders? (Score:1)
Like I said above, it gets hits on my webserver. Not everyone ignores them.
Re:Does anyone follow robots.txt orders? (Score:2)
As a list of target URLs to scrape?
Do the files your robots.txt protect ever get grabbed?
Re:Does anyone follow robots.txt orders? (Score:0)
Here in reality, the overwhelming majority of crawlers respect robots.txt. While it's true that corporations range from amoral to pure evil, it's usually in their best interest to honor robots.txt. (Do a quick search for 'spider trap', for example.) Only a complete moron would try to use it as an access restriction! What the hell do you think it's for anyway?
Besides, the few crawlers that don't honor robots.txt will very quickly get blocked. This is trivial to automate, as anyone with an IQ above room-temperature figured out in the 90's. (Crawlers wreaked havoc in the early days of the web.) It's honestly surprising that some scumbag AI corporations willfully ignore it. I'm going to guess that's due more to incompetence than malice.
Re:Does anyone follow robots.txt orders? (Score:2)
At some point sufficient incompetence is a form of malice.
This stuff is old and anyone capable of writing a web crawler will know about robots.txt.
Re:Does anyone follow robots.txt orders? (Score:1)
Keep in mind that the primary purpose of robots.txt is to provide a list of primary URLs to crawl, as a shortcut for the crawler to get to the stuff that's relevant to index. Yes, it can also be used to advise on what not to fetch, but ostensibly it's in a ethical web spider's best interests to parse and obey this file, as it will save them time and omit unnecessary chaff from the indexed data.
Re:Does anyone follow robots.txt orders? (Score:1)
(Full disclosure; there's actually nothing on my website but the robots.txt and the index.html, so I haven't actually tested spider obedience figures. I just know it's actually being downloaded, and frequently by stuff with words like "crawler" or "spider" in the indent string.)
Re:Does anyone follow robots.txt orders? (Score:2)
It's a mixed bag. I've seen some well behaved and others dive right in to fake test directories that are just there to see who is bad.
Occam's Razor (Score:2)
Occam's razor would suggest that these companies simply never thought to look for or use robots.txt. It is designed to inform web crawlers what to index for search engines, ant I feel there's a good chance these companies never thought to leverage it, or didn't feel it was applicable to what they were doing. They should have, of course, but I feel there is some wiggle room there to give them the benefit of the doubt in this case.
Not to mention at the end of the day, this is a text file anyone can ignore and skip past if they want, and it doesn't take a genius to figure this out. People are gonna scrape stuff you don't want them to and you have to be prepared for when, not if, that happens.
Re:Occam's Razor (Score:0)
I think a person building a crawler knows about robots.txt. Also many frameworks for building crawlers bring support for robots.txt by default.
But I think it is reasonable to think that the people crawling content for AI may not think their use-case is the same as the search engine use-case.
Robots,txt (Score:0)
We should start to use WeWillSueYouRobots.txt
Yawn (Score:5, Interesting)
Remember back in the late 2000s when companies were all about "Reinventing Search" (of the WWW)? It turned out most of them were trying to get juicier results than Google by ignoring robots.txt so they were not actually better and did irritate a lot of people when they ended up recursing indefinitely down programmatically-generated websites whose robots.txt specifically said "don't go here".
It's not news that ignoring robots.txt gets you access to more content on the web. It's also not news that this is usually not going to get you any better content.
Yet another bunch of tech bros are deciding they can succeed by ignoring all of rules, laws, social conventions, and the learnings of the past because they're the superior, innovative people. Instead they will just burn money until they run out, then go around and start another company and get some more money without ever generating anything useful or profitable.
Re:Yawn (Score:2)
It's not news that ignoring robots.txt gets you access to more content on the web.
It's also not news that this is usually not going to get you any better content.
Even Google ignores the robots.txt. They made that decision after the California DMV (Department of Motor Vehicle) blocked them with their robots.txt
And frankly, I can't blame Google.
If you don't want your content to be accessed by everyone, don't put it up on the public internet.
Badly written bots are a separate issue.
technical suggestion (Score:3)
Why would you expect a technical suggestion to work?
It's almost as is (Score:4, Insightful)
Gigantic quasi-monopolies don't respect anyone or any laws, or bother to behave with any sort of decency anymore, since they made themselves untouchable and it does nothing for their shareholders anyway.
I don't think they even bother to pretend to show restraint anymore. Like with the AI stuff infringing copyright on an unprecedented scale: they basically just went "Yeah, that's how it goes now. You can't stop us. Suck it up." It's quite staggering.
Re:It's almost as is (Score:2)
Explain how AI infringes copyright?
Re:It's almost as is (Score:2)
That's one hell of a rock you must have been living under...
Re:It's almost as is (Score:2)
I don't think they even bother to pretend to show restraint anymore.
I think Microsoft's Recall feature demonstrates your point very clearly. No shame at all.
Poison the results. (Score:2)
If an AI is crawling the site, create one page that contains purely random text with a selection of random links, but have the page reachable via any arbitrary URL pointing to any imaginable purely illusiary subdomain.
The AI will harvest however many pages it is set to (possibly all of them), each page diluting and corrupting the AI's neural net.
The AI developers don't give a damn about quality, only the illusion of quality, so will never actually stop and look. But a large enough phantom site should seriously impair AIs relying on random scans.
You can then even advertise your phantom site to authors. All they have to do is get the site hosting them to add a few lines to the Web server config to transparently redirect AI requests to it. The AIs then plunder your nonsense pages rather than ebooks and sample chapters.
We need .. (Score:2)
How to fix (Score:3)
You can use fail2ban to block rude web scrapers. Put a hidden link into your web pages that people would not see, but bots would. Include that link in robots.txt. When anyone hits that link, fail2ban will automatically block them based on the rule you implement.
Re:How to fix (Score:2)
You beat me to the punch. I've done similar things in the past to trap bad bots. My next favorite tool was for use against email harvesters. I generated page after page of fake email addresses for them to collect. Same idea, though. Hidden link on a page, not in the robots.txt though, as the point wasn't to block them but to poison their well.
Linkedin, Microsoft and OpenAI (Score:2)
It's strange, that given Microsoft involvement in both LinkedIn and OpenAI, that Microsoft prevents OpenAI from accessing LinkedIn.
Why does it matter (Score:2)
More expected than surprising (Score:2)
There are two interwined issues here (Score:2)
Oh, yes, please scrape my website (Score:2)
No, I won't sue. I'll file criminal charges for theft against the CEOs. They get to go to JAIL.