Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
AI The Media

Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm (yahoo.com) 108

Multiple AI companies are ignoring Robots.txt files meant to block the scraping of web content for generative AI systems, reports Reuters — citing a warning sent to publisher by content licensing startup TollBit. TollBit, an early-stage startup, is positioning itself as a matchmaker between content-hungry AI companies and publishers open to striking licensing deals with them. The company tracks AI traffic to the publishers' websites and uses analytics to help both sides settle on fees to be paid for the use of different types of content... It says it had 50 websites live as of May, though it has not named them. According to the TollBit letter, Perplexity is not the only offender that appears to be ignoring robots.txt. TollBit said its analytics indicate "numerous" AI agents are bypassing the protocol, a standard tool used by publishers to indicate which parts of its site can be crawled.

"What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites," TollBit wrote. "The more publisher logs we ingest, the more this pattern emerges."


The article includes this quote from the president of the News Media Alliance (a trade group representing over 2,200 U.S.-based publishers). "Without the ability to opt out of massive scraping, we cannot monetize our valuable content and pay journalists. This could seriously harm our industry."

Reuters also notes another threat facing news sites: Publishers have been raising the alarm about news summaries in particular since Google rolled out a product last year that uses AI to create summaries in response to some search queries. If publishers want to prevent their content from being used by Google's AI to help generate those summaries, they must use the same tool that would also prevent them from appearing in Google search results, rendering them virtually invisible on the web.
This discussion has been archived. No new comments can be posted.

Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm

Comments Filter:
  • Yah ... (Score:2, Troll)

    by Savage-Rabbit ( 308260 ) on Sunday June 23, 2024 @06:42AM (#64570703)

    If publishers want to prevent their content from being used by Google's AI to help generate those summaries, they must use the same tool that would also prevent them from appearing in Google search results, rendering them virtually invisible on the web.

    I'm sure this is nothing more than an unfortunate coincidence that Google will fix .... eventually ... rest assured, we're working on it ... any day now ...

    • by aldousd666 ( 640240 ) on Sunday June 23, 2024 @07:02AM (#64570757) Journal
      Naw, they're just effectively 'raising the price' on being indexed in google. For the low fee of 'not a damn penny' + allowing google to train an AI on the summary of your article, you too can have millions of users find your site when they go to search google.
    • by Anonymous Coward on Sunday June 23, 2024 @07:05AM (#64570769)

      Multiple AI Companies Ignore Robots.Txt Files

      I would be shocked if there is anyone who DOESN'T ignore the Robots.Txt file

    • by Luckyo ( 1726890 ) on Sunday June 23, 2024 @08:36AM (#64570975)

      Why would they change this? Appearing on google search results is a privilege, not a right.

      And that's why Google needs to be regulated as a monopoly.

    • by 2TecTom ( 311314 ) on Sunday June 23, 2024 @09:18AM (#64571067) Homepage Journal

      corruption and classism, it's inevitable that the greedy will cross every boundary because greed is insatiable

      classism is the real problem which creates the corruption destroying our society

      • by aSplash0fDerp ( 6329282 ) on Sunday June 23, 2024 @10:44AM (#64571307)
        Those that don't learn from history.... Are doomed to repeat it.

        The search engine providers were not stewards of quality data and not providing a search index for users to peruse that was free of bias, SPAM and adhered to common digital etiquette. That gen 1.0 lesson was easy to learn. It mutated into a corrupted dataset within a decade.

        Looking at many of the software dataset query providers (AI for slang), they do not have the integrity or reputation to even rank as a steward of knowledge, facts or language. They may be vying as a leader in high-performance compute clusters and other technology utilizing large amounts of unverified data, but they in no way represent humanity or language expertise (nuanced) in the 21st century.

        Knowing what you want simplifies a lot with technology,while waiting for certain areas to mature. In many cases, a dictionary and expandable (modular) encyclopedia is plenty for an LLM. All data has been vetted and verified,while continuing on a trajectory that has proven successful for centuries in developed cultures (that explains the culture less war).

        I also think that we'll see companies/opensource orgs specialize in data sorting applications and other specific features of SDQP's and in an reverse Apple'esque play, fork services and features with narrower goals,processes and implementations,rather than offering the market "the kitchen sink" of mediocre software. (Do one thing and one thing well is what masterclass curriculum's are made from).

        With many technologies like USB 4 and Thunderbolt 4, I think we'll start to see "cascading computing" in the market, connecting the PC to specialized platforms (an SBC optimized for dictionary/encyclopedia LLMs for example). Nerds and geeks used 10 Gb networking, with consumers going 1Gb to 2.5Gb to 5Gb, where USB4/TB4 does a theoretical 40Gbps.

        https://en.m.wikipedia.org/wik... [wikipedia.org]

        https://en.m.wikipedia.org/wik... [wikipedia.org]

        With cascading computing platforms, we may also see TCP/IP(Internet 1.0) as a single SBC on the edge, facilitating data transfers to a cascading computer cluster, allowing "indirect access" to hostile networks while limiting exposure for the other PCs. Innovation from 2010-2020 have laid the foundation for many new architectures/functionality in PC designs and evolving LANs to something even more secure (no direct TCP/IP links with cascading computing).

        I will gladly purchase LLM dictionaries and modular/expandable encyclopedias (coding add-ons, etc,making LLms a la carte built with smaller models that are manageable) and even purchase an SBC capable of advanced data sorting to link to my main PC or cluster with a 40Gbps link rather than trying to do everything on (1) computer (who only uses (1) computer in 2024?). That evolution even changes operating system usage (stay on Windows 10 if you choose), now that it does not have a direct link to Internet 1.0. This addresses eWaste and disposable psuedo-innovation where manufacturers were unable to. And much of the software will run for decades on trusted clusters/networks with only feature updates being the priority of devs {speculation}. The synergy with innovation is much stronger in trusted/safe environments. Many were raised on the Information Superhighway in that environment (sub-1-billion e-population).

        Add in network diversification (new digital infrastructure) and digital slum data mining, broken English SEO and other inferior tactics go the way of the dodo bird. IMHO, cascading computing does more for productivity,evolved security/privacy and performance enhancements than any kitchen sink software dataset query provider (that uses slave labor to groom inferior data) ever could.

        Those that avoid an inferior digital umbilical cord and don't allow their kids to have that noose tied around their neck will tell a different story. My kids can use the lo
  • Just a contract (Score:5, Insightful)

    by devslash0 ( 4203435 ) on Sunday June 23, 2024 @06:53AM (#64570727)

    robots.txt is just a good-will contract between the client and a web server. Since AI companies (in fact all companies) go for profit and hardly ever show any good will, why would you expect them to obey by rules outlined in robots.txt? If you want access control over your content, implement actual access control.

  • by jovius ( 974690 ) on Sunday June 23, 2024 @07:04AM (#64570763)

    One way to protest this is to add instructions or other content for only the AI on your site. This could be used for commercial purposes too, if the return message can be modified.

  • by fluffernutter ( 1411889 ) on Sunday June 23, 2024 @07:07AM (#64570771)
    Look around at most of the people in the world. Why would anyone expect anyone else to voluntarily follow the robots.txt?
    • by i.r.id10t ( 595143 ) on Sunday June 23, 2024 @08:06AM (#64570909)

      Exactly. Well, supposedly a "good" company like Google was 20 years ago (ah, for the days of Do No Evil...) I would expect them to honor it but anyone with a "bad" intent would probably use it as a shopping list of where to start indexing.

      Otherwise, it needs to be dealt with in config or code (require authentication of some kind, only allow internal LAN/VPN connections, etc)

  • Killing the deal (Score:4, Interesting)

    by TheNameOfNick ( 7286618 ) on Sunday June 23, 2024 @07:22AM (#64570807)

    The deal was, you get to scrape the web, show your ads on your search results pages, and in return the web sites get visitors. If you scrape the web and "summarize" the content and nobody ever visits the web site, you don't uphold your end of the deal and the deal will end. And if copyright legislation doesn't come down on that behavior, the open web will cease to be. Nobody will get access to anything that isn't advertising or propaganda in itself without signing a contract that excludes any non-personal use of the content, even summarizing and other "fair uses".

    • by DewDude ( 537374 ) on Sunday June 23, 2024 @07:52AM (#64570873) Homepage

      That worked...for a while. The problem is the ad networks, being the capitalists they are, took the "neutral" approach of "whatever you pay for". This resulted in legitiate businesses being used for the illicit distribution of malware.

      The next problem is that ad networks did nothing. They knew they were serving malicious ads; they knew they were selling to bad actors; but they knew they had legal protection and continued to willingly sell malicious adspace under the guise of "we're too big to check".

      So now come ad-blockers. It was one thing when they were just annoying; but it's another when there's actual risk of getting hacked. It didn't go over well when the local newspaper infected 2500 local readers from a bad ad. Did they blame the ad? The paper did. Know who the readers blamed? They blamed the newspaper. "You should have taken more responsibility," is what they screamed as they were canceling subscriptions. The same for a local TV station's website when their ad network was serving malicious ads. They could point the finger all they wanted...but everyone was pointing it at the station.

      That's the other problem; no one places the blame where the blame should be placed. Rather than blame the adnetworks with no morals; they blame the website operators.

      So now we have the ad-blocker wars; and to combat that...more anti-adblock stuff.

      The fact is...ad revenue isn't enough anymore. The lack of privacy laws and no oversight on any of this has meant the biggest export is American user data; sold by American companies, to the highest bidder. They don't care about us...we're just a product to profit off of.

      • by Antique Geekmeister ( 740220 ) on Sunday June 23, 2024 @08:02AM (#64570899)

        Facebook has a version of that problem now. Every single "Sponsored" article leads straight to a malware. Reporting it to Facebook as fraud, which these pages are, gets a response of "we see no violation of our guidelines".

        • by DewDude ( 537374 ) on Sunday June 23, 2024 @08:07AM (#64570915) Homepage

          Yeah...when they're being paid to display it it's never a guideline and there's no concern for users.

          I'm waiting for someone to finally get a judge to start using the elimination of rules and start holding them responsible in civil court. They don't have immunity in civil court anymore over that; it was killed so they could arrest backpage guys.

      • by hdyoung ( 5182939 ) on Sunday June 23, 2024 @08:35AM (#64570971)
        This. The entire ad-driven internet business model is inevitably driven to maximum exploitation and providing the minimum service required to keep people engaged. Just due to simple economic and mathematical considerations. You want something thats more focused on the user? You PAY A SUBSCRIPTION. That makes you the customer, not the cattle, and the entire experience flips to something MUCH nicer.
        • by Anonymous Coward on Sunday June 23, 2024 @02:26PM (#64571851)

          No, it does NOT make you the customer. It just makes you another dumbfuck sucker.

          They sell your info anyway. But when you're paying, they've got more to sell, they've got your real name, your real address, your bank info.

          Doesn't matter if you spent 10s of thousands, the car companies are harvesting and selling your info too. You paid the phone company for a subscription, they sold your location data. You might think "I'll just subscribe to somebody with a good privacy policy" - but you're a fucking fool if you think that. NONE of them have good privacy policies.

          At least when it's free you can block ads, you can give them fake info if you have to register. But the moment you pay, they've fucking got you.

          Fucking paytard.

          • by hdyoung ( 5182939 ) on Sunday June 23, 2024 @03:01PM (#64571927)
            You’re just straight-up wrong, because of fundamental economics. My personal data is worth around a few bucks if sold legitimately, from one company to another. And, once a legit company has my ID info, they have no need to buy it a second time. Ongoing internet browsing data is different, but probably only goes for pennies a pop. Illegal info, like banking info and full “steal a person’s identity” level stuff can run around a hundred bucks, based on my quick research.

            If I pay $15 per month to a company, that company is getting a solid $180 per year from me. Sure, they could try to sell my data, but if they do it too sloppily and cause me trouble, I get pissed off, cancel my subscription, and talk about it online where their bad behavior gets broadcast to a thousand or a million other subscribers. That’s real-life economic damage that can rack up REALLY FAST for a company. Far smarter for them to hold my info close, keep me happy as a customer, and harvest that EZ $15 every month, year after year.

            Even a cheap subscription means that the company is HIGHLY motivated to keep my info (mostly) locked down. It’s not a perfect system. Nothing is.

            But, by all means, keep using free-teir services on Android and convince yourself that you’re screwing the man and you’re better than all the sheeple.
      • by iAmWaySmarterThanYou ( 10095012 ) on Sunday June 23, 2024 @09:58AM (#64571185)

        The websites are a proxy for the malware. They signed a contract with someone serving malware to their innocent users. They are responsible for delivering malware. If they had used a legit ad company that filters out shitty malware ads the users would not have been impacted.

        Here's who gets to blame who:
        Web site users get to blame the web site
        Web site has to take that responsibility for infecting their users while also getting to blame the ad network
        The ad network has to take that responsibility while blaming no one because they can't blame the criminals they cut deal with and did nothing to filter the malware.

        When I worked at a content company, we ran a check against all ads we'd never seen before on first hit and periodically rechecked. We took responsibility for our users' safety. No sympathy for the newspaper if they didn't bother to protect their users but took the malware ad money.

    • by Luckyo ( 1726890 ) on Sunday June 23, 2024 @08:40AM (#64570983)

      When was that the deal outside of your imagination. Even google itself clearly states in writing that this is not the deal in the opening lines of the description of what robots.txt does:

      https://developers.google.com/... [google.com]

      "A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page."

      • by Mr. Dollar Ton ( 5495648 ) on Sunday June 23, 2024 @09:25AM (#64571089)

        Read it once again ;)

        The comment isn't about google and robots.txt, it is about "AI" outfits scraping content and then offering it as their own "AI" creation.

        Or at least it appears so to me.

      • by TheNameOfNick ( 7286618 ) on Sunday June 23, 2024 @09:49AM (#64571161)

        The robots exclusion standard (aka robots.txt) is a red herring [wikipedia.org]. It is a gentlemen's agreement [wikipedia.org], originally intended to indicate to crawlers which parts of a web site are unsuitable for them, where crawling would produce nothing but useless burden for the server and the crawler alike. For example, you wouldn't want a crawler to keep requesting URLs from a procedurally generated infinite tree of documents. Whether it has expanded beyond that and can be legally binding under certain circumstances is left as an exercise to the legally inclined nerd.

        That said, the issue here isn't the robots exclusion standard, but an implicit understanding that let search engines access web sites and use their content in a way that benefits both the web sites and the search engines. If the search engines, and other AI companies, break that understanding, the web will end up behind the mother of all "paywalls", where the paywall is a mandatory contract that prohibits any non-personal use of the content. This isn't robots.txt, not even robots.txt on steroids. This is "403 Forbidden" territory for anyone who doesn't authenticate as someone who has agreed to a legally binding contract not to divulge or otherwise use the information except for purely personal purposes. Login or fuck off. Nobody wants to feed AI companies their content: They give nothing in return. All you will be able to access freely will be ads and propaganda.

        • by Luckyo ( 1726890 ) on Sunday June 23, 2024 @10:16AM (#64571241)

          >an implicit understanding

          Isn't the entire argument here that there is in fact no such thing because content producers specifically assert that such agreement doesn't exist and AI companies and google are doing things they don't want google to do, but that they have no desire to actually prevent by gating their content from the public?

          • by TheNameOfNick ( 7286618 ) on Sunday June 23, 2024 @10:27AM (#64571275)

            We are in a time of transition. Obviously most web sites do not want to drop out of Google right now, because that still is a sizeable amount of their traffic. However, once people understand "googling" to mean asking a chatbot and getting answers not from the web sites but immediately from the bot, the traffic will dry up and exposing the content to Google and other AI companies won't benefit web sites anymore, and that's when web sites will not only drop from the search results pages, that nobody looks at anymore, but also from the publicly accessible world wide web. Requiring a login from a legally identifiable person will be the only way to keep the content from being stolen by AI companies, unless copyright law prohibits the training of AIs with copyrighted data without license and compensation.

            • by Luckyo ( 1726890 ) on Sunday June 23, 2024 @10:35AM (#64571285)

              I strongly disagree on emotionally loaded, objectively and factually incorrect wording such as "content being stolen" when all that's being done is learning from data that certain people chose to share with the public.

              But the rest of the analysis is mostly in line with my thinking.

              So I suspect the sole point of differential we have is that you see it a moral good to call any person learning from content of another person a thief (someone who steals) as you do in the post above. Whereas I look at entire human history and see learning from content of others being the single strongest driver of human learning and progress throughout human history and see limits placed on that is at best extremely negative for society, at and worst probably the single worst crime against humanity that one could commit.

              • by TheNameOfNick ( 7286618 ) on Sunday June 23, 2024 @10:50AM (#64571325)

                You're missing the point, because it doesn't matter at all whether you think "stolen" is the right word. Fact of the matter is that content producers do not agree to that kind of use of their content without getting anything in return, and as soon as they no longer benefit from making content accessible to crawlers, they will stop doing that. If you agree to a contract that forbids you from training an AI model (and all other uses that aren't purely personal), because you won't be able to access anything without doing so, you are legally bound by that contract. Copyright law doesn't come into it anymore at that point. If you prefer a web where you have to legally identify yourself everywhere, so that your agreement to that contract can be verified and you can be held accountable for violations, then do keep that "it's not stealing, our programs are just learning like any human would, hurr durr" stance. AI companies will not be able to keep scraping the web. The only question is whether they take down the open web while trying or if they can be stopped from building their empires on stolen information some other way.

                • by Luckyo ( 1726890 ) on Sunday June 23, 2024 @11:21AM (#64571401)

                  Emperor has no clothes, and public has every right to see that emperor has no clothes, and learn whatever lessons it chooses to learn from it.

                  Only a tyrannical emperor makes people avert their eyes and block learning from that public display he himself chose to put on.

                  Once the emperor chose to put himself up on display in whatever way he chose to do it, no further permits from the emperor are necessary for looking at him, and learning from his visage. And any demands for such permits to be required are so extremely tyrannical, that we have actual folk tales warning us against it.

                • by chmod a+x mojo ( 965286 ) on Sunday June 23, 2024 @02:39PM (#64571901)

                  >Fact of the matter is that content producers do not agree to that kind of use of their content without getting anything in return, and as soon as they no longer benefit from making content accessible to crawlers, they will stop doing that.

                  Then: Don't. Put. Shit. On. The. Free. Web.

                  Period.

                  It's not rocket surgery here.

                  You aren't allowed to make a piece of artwork, put it publicly on a billboard in the middle of town, and then say Bob, Jerry, and Sue aren't allowed to look at it, but everyone else in the world can. Public access doesn't work like that.

                  Don't like it? Go private.

                  You can make all the rules you want if you show your stuff off in private. "But I can't get free customers then!". Too fucking bad, you bitched when your shit was getting you free customers too, just because you couldn't get ALL of the free customers. Now you can deal with getting even fewer.

                  • by TheNameOfNick ( 7286618 ) on Sunday June 23, 2024 @03:25PM (#64571983)

                    Let's ignore for a moment that you completely ignored the rest of the discussion and naturally missed the point, that letting AI companies get away with delivering content that others created will result in the end of the openly accessible web for everyone, not just these companies, because these parasites and their shills don't take no for an answer. But even with that caveat, you're still wrong. The web isn't a billboard that anyone can look at. It's servers delivering content to individual clients, and you can very well decide Bob, Jerry and Sue aren't allowed to look at the content but everyone else is. Whether robots.txt has the legal power to do that or not is not clear-cut, but there are certainly technical measures you can take, and Bob, Jerry or Sue can go pound sand if you do.

                    • by chmod a+x mojo ( 965286 ) on Sunday June 23, 2024 @04:33PM (#64572081)

                      No you can't decide to host on a non-login site and pick and choose who can view the content. That's the whole damn reason some sites hide forums and user content behind logins.

                      Well, you CAN, if you know the particular IP addresses you won't respond to. Or you can, I don't know, Implement a login scheme with terms of service.

                      The "problem" with a login scheme is.... you don't get a free lunch. You don't even get random visitors to ATTEMPT to try to serve ads to. Most of them will see your jank assed attempt to "lock down your content" and just go to one of your competitors who is someone else who DGAF about who scrapes shit so long as they continue to get traffic instead of stagnating.

                      Shitty internet "companies" can't have it both ways. Period.

    • by Anonymous Coward on Sunday June 23, 2024 @12:18PM (#64571575)

      > The deal was, you get to scrape the web, show your ads on your search results pages, and in return the web sites get visitors.
      There was never such a deal.

      People started writing crawlers and then people proposed robots.txt for avoiding the crawlers taking down websites with too little resources.
      Nobody made a deal "You can index, because you link back to us" or anything like that. This just never happened.

      Of course, some people wanting clicks may see it as such an exchange, but that's their wish and no contract they have.
      The "contract" is: "When you put something online without restrictions, people, bots, and everything else may fetch it."

      • by TheNameOfNick ( 7286618 ) on Sunday June 23, 2024 @12:35PM (#64571631)

        The reason I wrote "deal" and not "contract" is that it's a convention or mutually implied understanding, not a piece of paper or an oral agreement sealed with a handshake or somesuch. If the wording confuses you, call it a balance of benefits. Web authors let crawlers access their sites. They do not have to do that. The balance of benefits is being shaken up by crawlers that train AI models. These AI companies then effectively provide the content without anyone having to visit the site where it originated. The point is that the web sites will not keep "putting something online without restrictions". If copyright law doesn't handle this, then the only way to exclude the parasitic AI companies is to make "people, bots and everything else" agree to contracts before letting them connect. If that's the web you want...

        • by Anonymous Coward on Sunday June 23, 2024 @01:49PM (#64571773)

          Fair. But still it was/is a de-facto deal and not a contract. You can't sue Google if they don't rank you high enough and they can't sue you if you present their bot an error 403 even though the robots.txt would allow access, just as you cannot sue them when they ignore robots.txt.

          If you watch your logs closely, you can see Google coming back with iPhone user agent (probably others as well) without accessing robots.txt after the official crawler was there. I suppose (in good faith) that they check if you present the Google bot other content than normal users.

          Anyway, excluding AI bots will only further the Google, Meta and possibly OpenAI monopolies. They use the content crawled for the search engine, the social network, and bought content while the competition doesn't have anything like this. Look at OpenAI making big deals with the publishers. Do you think someone building an open source model can afford buying content like this? Of course they use crawlers. I do not exclude AI crawlers anywhere, where I do not exclude search engine bots, because I do not want to see a Google/Meta/OpenAI monopoly on AI.

  • by MicroSlut ( 2478760 ) on Sunday June 23, 2024 @07:30AM (#64570827)
    Ignoring the basic fundamental rules we all use to keep us from poking each other in the eyes on the playground can get you in trouble. I use Robots.txt to hide junk content on a few dozen websites, including fake username and passwords, fake company names and addresses, and collections of images designed to make hackers question reality. I apologize in advance for using any names and passwords that may be real, like Howard T Duck \ pJV@%mzD*2. Some of the content I create makes me question reality. If they are trying to get AI to understand human nature this way, we are all doomed, because human nature is what AI will eventually understand.
  • by DewDude ( 537374 ) on Sunday June 23, 2024 @07:45AM (#64570855) Homepage

    Take a page out of Nintendo's book; you lawyer up and file a C&D, DMCA, and everything you can for every page they scrape. If they are ignoring your intellectual property rights and policies; then it's unauthorized access. I mean the FBI literally just listed SABnzbd as a pirate site...clearly the standard for infringement is low if the FBI isn't even making correct arguments in court.

    • by martin-boundary ( 547041 ) on Sunday June 23, 2024 @08:32AM (#64570963)
      What the community needs is a way to easily and automatically do the C&D + DMCA submissions.

      Large sites like Youtube have a process in place, and there's software to automatically scan videos for copyrighted music snippets that can be submitted as DMCA violations.

      Right now small community sites don't have the expertise or the manpower to manually check access logs and trace where the spiders are coming from, or find the contact details to send C&D and DMCA claims. So they do nothing, and that's if they even realize what is happening.

      This is a great problem for the next generation of open source engineers to work on. Ideally, some community group should build a bunch of web analytics (for free) that can be installed on CPanel etc. The software should identify spiders, figure out who they are and what they are doing, and list all the possible DMCA violations. The software should also make it easy to create properly formatted DMCA requests so that they can be sent in one click (by a human, never automate sending requests). There should be a proper record of the requests in case that the owner of the spider doesn't respond in a reasonable timeframe. In that case, the record could also act as proof for later more defensive steps.

      Some 20 years ago the community got together to combat the spam problem, which has similar characteristics. There were globally curated blacklists, SMTP server access greylisting, automatically updating keyword filters, etc. Lots of ideas, lots of people sharing what worked or didn't, giving power back to the little guy.

    • by Anonymous Coward on Sunday June 23, 2024 @09:35AM (#64571117)

      The DMCA *requires* you provide a URL reference to the material in question.
      Copyright law also requires you to reference the protected work along with the smallest sliver of evidence it was distributed or publicly performed.

      When it comes to search results, this is simple and easy to do, however google does honor dmca take down requests for search results already. You won't be able to counter the billions of examples of them honoring such requests, including your own, to show they are ignoring them.

      AI bots are quite a different story. Or at least most companies doing this are.

      Their goal is very specific in preventing a verbatim "copy/paste" of the data they scrape.
      When successful, there is no works distributed for you to reference. Copyright law in its current form is not broken.

      Yes, there are some that do verbatim copy, and sure, go after them!
      But these are still the exception and not the rule.

      Scraping data that you do not then distribute isn't illegal. Nor is this really the topic at hand (it is a different topic for sure, but not one robots.txt will effect)

      This topic is completely about scraping that data in the first place after being asked not to.
      This isn't a crime, this is being an ass, and you don't deal with assholes doing legal things the same way you deal with anyone doing illegal things.

  • It's extra work to pull and read those, and slows the search engine even if ignored. It's far simpler to simply ignore them and build up your "metrics" for the amount of material you've scanned, even when the robots.txt warns you that it's not reliable or even stable.

  • by The MAZZTer ( 911996 ) <megazztNO@SPAMgmail.com> on Sunday June 23, 2024 @08:01AM (#64570895) Homepage

    Occam's razor would suggest that these companies simply never thought to look for or use robots.txt. It is designed to inform web crawlers what to index for search engines, ant I feel there's a good chance these companies never thought to leverage it, or didn't feel it was applicable to what they were doing. They should have, of course, but I feel there is some wiggle room there to give them the benefit of the doubt in this case.

    Not to mention at the end of the day, this is a text file anyone can ignore and skip past if they want, and it doesn't take a genius to figure this out. People are gonna scrape stuff you don't want them to and you have to be prepared for when, not if, that happens.

    • by Anonymous Coward on Sunday June 23, 2024 @02:24PM (#64571839)

      I think a person building a crawler knows about robots.txt. Also many frameworks for building crawlers bring support for robots.txt by default.
      But I think it is reasonable to think that the people crawling content for AI may not think their use-case is the same as the search engine use-case.

  • by Anonymous Coward on Sunday June 23, 2024 @08:27AM (#64570947)

    We should start to use WeWillSueYouRobots.txt

  • Yawn (Score:5, Interesting)

    by nicolaiplum ( 169077 ) on Sunday June 23, 2024 @08:32AM (#64570965)

    Remember back in the late 2000s when companies were all about "Reinventing Search" (of the WWW)? It turned out most of them were trying to get juicier results than Google by ignoring robots.txt so they were not actually better and did irritate a lot of people when they ended up recursing indefinitely down programmatically-generated websites whose robots.txt specifically said "don't go here".

    It's not news that ignoring robots.txt gets you access to more content on the web. It's also not news that this is usually not going to get you any better content.

    Yet another bunch of tech bros are deciding they can succeed by ignoring all of rules, laws, social conventions, and the learnings of the past because they're the superior, innovative people. Instead they will just burn money until they run out, then go around and start another company and get some more money without ever generating anything useful or profitable.

    • by stephanruby ( 542433 ) on Monday June 24, 2024 @04:43AM (#64572901)

      It's not news that ignoring robots.txt gets you access to more content on the web.

      It's also not news that this is usually not going to get you any better content.

      Even Google ignores the robots.txt. They made that decision after the California DMV (Department of Motor Vehicle) blocked them with their robots.txt

      And frankly, I can't blame Google.

      If you don't want your content to be accessed by everyone, don't put it up on the public internet.

      Badly written bots are a separate issue.

  • by awwshit ( 6214476 ) on Sunday June 23, 2024 @08:46AM (#64571001)

    Why would you expect a technical suggestion to work?

  • It's almost as is (Score:4, Insightful)

    by Rosco P. Coltrane ( 209368 ) on Sunday June 23, 2024 @09:36AM (#64571121)

    Gigantic quasi-monopolies don't respect anyone or any laws, or bother to behave with any sort of decency anymore, since they made themselves untouchable and it does nothing for their shareholders anyway.

    I don't think they even bother to pretend to show restraint anymore. Like with the AI stuff infringing copyright on an unprecedented scale: they basically just went "Yeah, that's how it goes now. You can't stop us. Suck it up." It's quite staggering.

  • If an AI is crawling the site, create one page that contains purely random text with a selection of random links, but have the page reachable via any arbitrary URL pointing to any imaginable purely illusiary subdomain.

    The AI will harvest however many pages it is set to (possibly all of them), each page diluting and corrupting the AI's neural net.

    The AI developers don't give a damn about quality, only the illusion of quality, so will never actually stop and look. But a large enough phantom site should seriously impair AIs relying on random scans.

    You can then even advertise your phantom site to authors. All they have to do is get the site hosting them to add a few lines to the Web server config to transparently redirect AI requests to it. The AIs then plunder your nonsense pages rather than ebooks and sample chapters.

  • by PPH ( 736903 ) on Sunday June 23, 2024 @10:58AM (#64571331)

    .. a comprehensive and up to date list of all AI web crawlers and their IPs. Just redirect all of them to Encyclopedia Dramatica.

  • by rossz ( 67331 ) <ogre AT geekbiker DOT net> on Sunday June 23, 2024 @11:41AM (#64571463) Journal

    You can use fail2ban to block rude web scrapers. Put a hidden link into your web pages that people would not see, but bots would. Include that link in robots.txt. When anyone hits that link, fail2ban will automatically block them based on the rule you implement.

    • by Wokan ( 14062 ) on Sunday June 23, 2024 @11:54AM (#64571517) Journal

      You beat me to the punch. I've done similar things in the past to trap bad bots. My next favorite tool was for use against email harvesters. I generated page after page of fake email addresses for them to collect. Same idea, though. Hidden link on a page, not in the robots.txt though, as the point wasn't to block them but to poison their well.

  • by ironicsky ( 569792 ) on Sunday June 23, 2024 @05:12PM (#64572141) Journal

    It's strange, that given Microsoft involvement in both LinkedIn and OpenAI, that Microsoft prevents OpenAI from accessing LinkedIn.

  • by wakeboarder ( 2695839 ) on Sunday June 23, 2024 @05:46PM (#64572177)
    A Real Intelligence (RI) doesn't have to follow robots.txt, can't AI be the same?
  • by zkiwi34 ( 974563 ) on Sunday June 23, 2024 @07:47PM (#64572377)
    But then, it just goes to show that when hoovering up screeds of cash is a possibility, well, politeness (robots.txt) and copyright mean less than nothing. Perhaps the most insulting thing is being expected to pay to use a system that has almost certainly pillaged your organisation's data.
  • by Arrogant-Bastard ( 141720 ) on Monday June 24, 2024 @05:28AM (#64572933)
    Every automated agent should honor robots.txt, and that includes not just access to content, but rate-limiting. I've observed multiple AI/LLM companies' spiders not only ignoring it for content, but also sending hundreds of requests per second. I've also observed them faking the User Agent, rotating between different originated addresses (including those in various clouds), and constructing never-existed URLs in an attempt to find possible unlinked content. The first issue is that this scraping for AI/LLMs should never be done with prior, confirmed permission. We shouldn't have to "opt out": they should have to opt in by asking for permission and only accessing sites that grant it. Unlike spidering for search engines - which return value to sites - AI/LLM companies return nothing AND are actively working toward making the sites they scrape irrelevant and obsolete. They're parasites. The second issue is that the behavior of their spiders is abusive. Even WITH permission, no spider should exhibit these behaviors, since they damage the ability of the web servers to actually provide services. These attacks may be actionable under the CFAA, and I know of two companies whose attorneys are investigating that. The root cause of both of these is the same: the sociopaths running these companies don't care who they hurt or what they damage, as long as they can grab headlines, make grandiose (and laughable) claims, and keep the VC money flowing. They're in it for the money and power and fame, and screw everyone else: artists, writers, bloggers, all the wonderful small quirky sites that make the web useful and fun, the sysadmins who keep all this stuff working. They have the EXACT same mindset as spammers and the same fundamental lack of personal and professional ethics. We're going to have to go after them with (a) criminal prosecutions for violations of the CFAA (b) civil actions for IP violations and (c) technical countermeasures to either deny them access and/or deliberately pollute their results with garbage. Because asking them to behave like civilized, decent people clearly won't work: we're going to make them, or punish them harshly if they don't.
  • No, I won't sue. I'll file criminal charges for theft against the CEOs. They get to go to JAIL.

"I shall expect a chemical cure for psychopathic behavior by 10 A.M. tomorrow, or I'll have your guts for spaghetti." -- a comic panel by Cotham

Working...