Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
AI The Media

Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm (yahoo.com) 108

Multiple AI companies are ignoring Robots.txt files meant to block the scraping of web content for generative AI systems, reports Reuters — citing a warning sent to publisher by content licensing startup TollBit. TollBit, an early-stage startup, is positioning itself as a matchmaker between content-hungry AI companies and publishers open to striking licensing deals with them. The company tracks AI traffic to the publishers' websites and uses analytics to help both sides settle on fees to be paid for the use of different types of content... It says it had 50 websites live as of May, though it has not named them. According to the TollBit letter, Perplexity is not the only offender that appears to be ignoring robots.txt. TollBit said its analytics indicate "numerous" AI agents are bypassing the protocol, a standard tool used by publishers to indicate which parts of its site can be crawled.

"What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites," TollBit wrote. "The more publisher logs we ingest, the more this pattern emerges."


The article includes this quote from the president of the News Media Alliance (a trade group representing over 2,200 U.S.-based publishers). "Without the ability to opt out of massive scraping, we cannot monetize our valuable content and pay journalists. This could seriously harm our industry."

Reuters also notes another threat facing news sites: Publishers have been raising the alarm about news summaries in particular since Google rolled out a product last year that uses AI to create summaries in response to some search queries. If publishers want to prevent their content from being used by Google's AI to help generate those summaries, they must use the same tool that would also prevent them from appearing in Google search results, rendering them virtually invisible on the web.
This discussion has been archived. No new comments can be posted.

Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm

Comments Filter:
  • Yah ... (Score:2, Troll)

    If publishers want to prevent their content from being used by Google's AI to help generate those summaries, they must use the same tool that would also prevent them from appearing in Google search results, rendering them virtually invisible on the web.

    I'm sure this is nothing more than an unfortunate coincidence that Google will fix .... eventually ... rest assured, we're working on it ... any day now ...

    • Naw, they're just effectively 'raising the price' on being indexed in google. For the low fee of 'not a damn penny' + allowing google to train an AI on the summary of your article, you too can have millions of users find your site when they go to search google.
    • by Anonymous Coward

      Multiple AI Companies Ignore Robots.Txt Files

      I would be shocked if there is anyone who DOESN'T ignore the Robots.Txt file

      • It gets hits on my webserver. It's clearly not being ignored by everyone.

        • by Entrope ( 68843 )

          There's an important difference between retrieving that file and complying with its contents.

          • True, but it's technically not being "ignored" if it's at least being fetched, even if it's not being actually obeyed...

            • by ls671 ( 1122017 )

              They might ask for robot.txt to get the interesting stuff you try to hide, it would be trivial for you to test it out instead of just saying they ask for it. Who knows? Maybe I could test it myself on my own web servers then, write a click bait article then, profit!

    • by Luckyo ( 1726890 )

      Why would they change this? Appearing on google search results is a privilege, not a right.

      And that's why Google needs to be regulated as a monopoly.

    • by 2TecTom ( 311314 )

      corruption and classism, it's inevitable that the greedy will cross every boundary because greed is insatiable

      classism is the real problem which creates the corruption destroying our society

  • Just a contract (Score:5, Insightful)

    by devslash0 ( 4203435 ) on Sunday June 23, 2024 @06:53AM (#64570727)

    robots.txt is just a good-will contract between the client and a web server. Since AI companies (in fact all companies) go for profit and hardly ever show any good will, why would you expect them to obey by rules outlined in robots.txt? If you want access control over your content, implement actual access control.

    • Re:Just a contract (Score:5, Informative)

      by buck-yar ( 164658 ) on Sunday June 23, 2024 @06:56AM (#64570741)
      From Google:

      A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.

      https://developers.google.com/... [google.com]

      • Yes, this was decided after the California DMV decided to use their robots.tx not to have their site indexed.

        Google's reasoning was that it would become a pretty bad search engine if it didn't couldn't index the DMV site.

    • robots.txt is just a good-will contract between the client and a web server. Since AI companies (in fact all companies) go for profit and hardly ever show any good will, why would you expect them to obey by rules outlined in robots.txt? If you want access control over your content, implement actual access control.

      robots.txt is just a passive request, really. And the AI companies are apparently declining that request.

      Personally, I think they should respect it, but yeah, if you are relying on everybody respecting robots.txt, then you are smoking something ...

    • I think the lawyers would *love* to be able to tell the judge that an AI company is deliberately *not* showing goodwill when they unreasonably scrape the data from other people's servers. Deliberately ignoring ROBOTS.TXT is a nice way of proving the lack of goodwill. Should at least double the monetary damages.
    • robots.txt is just a good-will contract between the client and a web server.

      robots.txt is, in effect, a machine readable/parable copyright notice. So if an AI company scrapes pages in contravention of what is said in a robots.txt then it should be liable for being sued for breach of copyright. They cannot pretend that as they did not read it that they can ignore it; imagine what a judge would say if you reproduced a book or piece of music but said that you did not read the book's copyright or CD's copyright notice.

      But AI companies are rich with expensive lawyers and will fight test

      • by allo ( 1728082 )

        A copyright notice tells who's got the copyright. The robots.txt doesn't talk about copyrights or even author names at all.

    • A copyright, license agreement, and an attorney. Not that is practical for an individual, with the except of those in a jurisdiction with small claims court. But even a moderately sized business should be able to enforce its rights in a civil court.

  • One way to protest this is to add instructions or other content for only the AI on your site. This could be used for commercial purposes too, if the return message can be modified.

  • Look around at most of the people in the world. Why would anyone expect anyone else to voluntarily follow the robots.txt?
    • Exactly. Well, supposedly a "good" company like Google was 20 years ago (ah, for the days of Do No Evil...) I would expect them to honor it but anyone with a "bad" intent would probably use it as a shopping list of where to start indexing.

      Otherwise, it needs to be dealt with in config or code (require authentication of some kind, only allow internal LAN/VPN connections, etc)

  • Killing the deal (Score:4, Interesting)

    by TheNameOfNick ( 7286618 ) on Sunday June 23, 2024 @07:22AM (#64570807)

    The deal was, you get to scrape the web, show your ads on your search results pages, and in return the web sites get visitors. If you scrape the web and "summarize" the content and nobody ever visits the web site, you don't uphold your end of the deal and the deal will end. And if copyright legislation doesn't come down on that behavior, the open web will cease to be. Nobody will get access to anything that isn't advertising or propaganda in itself without signing a contract that excludes any non-personal use of the content, even summarizing and other "fair uses".

    • by DewDude ( 537374 ) on Sunday June 23, 2024 @07:52AM (#64570873) Homepage

      That worked...for a while. The problem is the ad networks, being the capitalists they are, took the "neutral" approach of "whatever you pay for". This resulted in legitiate businesses being used for the illicit distribution of malware.

      The next problem is that ad networks did nothing. They knew they were serving malicious ads; they knew they were selling to bad actors; but they knew they had legal protection and continued to willingly sell malicious adspace under the guise of "we're too big to check".

      So now come ad-blockers. It was one thing when they were just annoying; but it's another when there's actual risk of getting hacked. It didn't go over well when the local newspaper infected 2500 local readers from a bad ad. Did they blame the ad? The paper did. Know who the readers blamed? They blamed the newspaper. "You should have taken more responsibility," is what they screamed as they were canceling subscriptions. The same for a local TV station's website when their ad network was serving malicious ads. They could point the finger all they wanted...but everyone was pointing it at the station.

      That's the other problem; no one places the blame where the blame should be placed. Rather than blame the adnetworks with no morals; they blame the website operators.

      So now we have the ad-blocker wars; and to combat that...more anti-adblock stuff.

      The fact is...ad revenue isn't enough anymore. The lack of privacy laws and no oversight on any of this has meant the biggest export is American user data; sold by American companies, to the highest bidder. They don't care about us...we're just a product to profit off of.

      • Facebook has a version of that problem now. Every single "Sponsored" article leads straight to a malware. Reporting it to Facebook as fraud, which these pages are, gets a response of "we see no violation of our guidelines".

        • by DewDude ( 537374 )

          Yeah...when they're being paid to display it it's never a guideline and there's no concern for users.

          I'm waiting for someone to finally get a judge to start using the elimination of rules and start holding them responsible in civil court. They don't have immunity in civil court anymore over that; it was killed so they could arrest backpage guys.

      • This. The entire ad-driven internet business model is inevitably driven to maximum exploitation and providing the minimum service required to keep people engaged. Just due to simple economic and mathematical considerations. You want something thats more focused on the user? You PAY A SUBSCRIPTION. That makes you the customer, not the cattle, and the entire experience flips to something MUCH nicer.
      • The websites are a proxy for the malware. They signed a contract with someone serving malware to their innocent users. They are responsible for delivering malware. If they had used a legit ad company that filters out shitty malware ads the users would not have been impacted.

        Here's who gets to blame who:
        Web site users get to blame the web site
        Web site has to take that responsibility for infecting their users while also getting to blame the ad network
        The ad network has to take that responsibility while bla

    • by Luckyo ( 1726890 )

      When was that the deal outside of your imagination. Even google itself clearly states in writing that this is not the deal in the opening lines of the description of what robots.txt does:

      https://developers.google.com/... [google.com]

      "A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or passw

      • Read it once again ;)

        The comment isn't about google and robots.txt, it is about "AI" outfits scraping content and then offering it as their own "AI" creation.

        Or at least it appears so to me.

        • by Luckyo ( 1726890 )

          The complaint raised in the topic is specifically about robots.txt:

          >Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm

      • The robots exclusion standard (aka robots.txt) is a red herring [wikipedia.org]. It is a gentlemen's agreement [wikipedia.org], originally intended to indicate to crawlers which parts of a web site are unsuitable for them, where crawling would produce nothing but useless burden for the server and the crawler alike. For example, you wouldn't want a crawler to keep requesting URLs from a procedurally generated infinite tree of documents. Whether it has expanded beyond that and can be legally binding under certain circumstances is left as an

        • by Luckyo ( 1726890 )

          >an implicit understanding

          Isn't the entire argument here that there is in fact no such thing because content producers specifically assert that such agreement doesn't exist and AI companies and google are doing things they don't want google to do, but that they have no desire to actually prevent by gating their content from the public?

          • We are in a time of transition. Obviously most web sites do not want to drop out of Google right now, because that still is a sizeable amount of their traffic. However, once people understand "googling" to mean asking a chatbot and getting answers not from the web sites but immediately from the bot, the traffic will dry up and exposing the content to Google and other AI companies won't benefit web sites anymore, and that's when web sites will not only drop from the search results pages, that nobody looks at

            • by Luckyo ( 1726890 )

              I strongly disagree on emotionally loaded, objectively and factually incorrect wording such as "content being stolen" when all that's being done is learning from data that certain people chose to share with the public.

              But the rest of the analysis is mostly in line with my thinking.

              So I suspect the sole point of differential we have is that you see it a moral good to call any person learning from content of another person a thief (someone who steals) as you do in the post above. Whereas I look at entire huma

              • You're missing the point, because it doesn't matter at all whether you think "stolen" is the right word. Fact of the matter is that content producers do not agree to that kind of use of their content without getting anything in return, and as soon as they no longer benefit from making content accessible to crawlers, they will stop doing that. If you agree to a contract that forbids you from training an AI model (and all other uses that aren't purely personal), because you won't be able to access anything w

                • by Luckyo ( 1726890 )

                  Emperor has no clothes, and public has every right to see that emperor has no clothes, and learn whatever lessons it chooses to learn from it.

                  Only a tyrannical emperor makes people avert their eyes and block learning from that public display he himself chose to put on.

                  Once the emperor chose to put himself up on display in whatever way he chose to do it, no further permits from the emperor are necessary for looking at him, and learning from his visage. And any demands for such permits to be required are so e

                  • Greedy idiots ruining things for everyone by shirking conventions is the oldest tale in the book.

                    • by Luckyo ( 1726890 )

                      Indeed. They should stop pretending that they have a right to dictate if people can look at and learn from with content they themselves made public.

                    • You're an LLM, aren't you?

                    • by Luckyo ( 1726890 )

                      It is the current popular way to run away from argument you lost among the terminally online, isn't it?

                      Before it was "you're a bot", and before that "you're a nazi". My reaction remains the same.

                      Run away little girl, run away!

                    • It's a plausible explanation why you keep forgetting all context. You may just be an idiot though.

                    • by Luckyo ( 1726890 )

                      Projection on your part is very real.

                • >Fact of the matter is that content producers do not agree to that kind of use of their content without getting anything in return, and as soon as they no longer benefit from making content accessible to crawlers, they will stop doing that.

                  Then: Don't. Put. Shit. On. The. Free. Web.

                  Period.

                  It's not rocket surgery here.

                  You aren't allowed to make a piece of artwork, put it publicly on a billboard in the middle of town, and then say Bob, Jerry, and Sue aren't allowed to look at it, but everyone else in the w

                  • Let's ignore for a moment that you completely ignored the rest of the discussion and naturally missed the point, that letting AI companies get away with delivering content that others created will result in the end of the openly accessible web for everyone, not just these companies, because these parasites and their shills don't take no for an answer. But even with that caveat, you're still wrong. The web isn't a billboard that anyone can look at. It's servers delivering content to individual clients, and y

                    • No you can't decide to host on a non-login site and pick and choose who can view the content. That's the whole damn reason some sites hide forums and user content behind logins.

                      Well, you CAN, if you know the particular IP addresses you won't respond to. Or you can, I don't know, Implement a login scheme with terms of service.

                      The "problem" with a login scheme is.... you don't get a free lunch. You don't even get random visitors to ATTEMPT to try to serve ads to. Most of them will see your jank assed attempt

  • Ignoring the basic fundamental rules we all use to keep us from poking each other in the eyes on the playground can get you in trouble. I use Robots.txt to hide junk content on a few dozen websites, including fake username and passwords, fake company names and addresses, and collections of images designed to make hackers question reality. I apologize in advance for using any names and passwords that may be real, like Howard T Duck \ pJV@%mzD*2. Some of the content I create makes me question reality. If
  • Take a page out of Nintendo's book; you lawyer up and file a C&D, DMCA, and everything you can for every page they scrape. If they are ignoring your intellectual property rights and policies; then it's unauthorized access. I mean the FBI literally just listed SABnzbd as a pirate site...clearly the standard for infringement is low if the FBI isn't even making correct arguments in court.

    • What the community needs is a way to easily and automatically do the C&D + DMCA submissions.

      Large sites like Youtube have a process in place, and there's software to automatically scan videos for copyrighted music snippets that can be submitted as DMCA violations.

      Right now small community sites don't have the expertise or the manpower to manually check access logs and trace where the spiders are coming from, or find the contact details to send C&D and DMCA claims. So they do nothing, and that's

  • It's extra work to pull and read those, and slows the search engine even if ignored. It's far simpler to simply ignore them and build up your "metrics" for the amount of material you've scanned, even when the robots.txt warns you that it's not reliable or even stable.

    • Like I said above, it gets hits on my webserver. Not everyone ignores them.

      • As a list of target URLs to scrape?

        Do the files your robots.txt protect ever get grabbed?

        • Keep in mind that the primary purpose of robots.txt is to provide a list of primary URLs to crawl, as a shortcut for the crawler to get to the stuff that's relevant to index. Yes, it can also be used to advise on what not to fetch, but ostensibly it's in a ethical web spider's best interests to parse and obey this file, as it will save them time and omit unnecessary chaff from the indexed data.

          • (Full disclosure; there's actually nothing on my website but the robots.txt and the index.html, so I haven't actually tested spider obedience figures. I just know it's actually being downloaded, and frequently by stuff with words like "crawler" or "spider" in the indent string.)

  • Occam's razor would suggest that these companies simply never thought to look for or use robots.txt. It is designed to inform web crawlers what to index for search engines, ant I feel there's a good chance these companies never thought to leverage it, or didn't feel it was applicable to what they were doing. They should have, of course, but I feel there is some wiggle room there to give them the benefit of the doubt in this case.

    Not to mention at the end of the day, this is a text file anyone can ignore and

  • Yawn (Score:5, Interesting)

    by nicolaiplum ( 169077 ) on Sunday June 23, 2024 @08:32AM (#64570965)

    Remember back in the late 2000s when companies were all about "Reinventing Search" (of the WWW)? It turned out most of them were trying to get juicier results than Google by ignoring robots.txt so they were not actually better and did irritate a lot of people when they ended up recursing indefinitely down programmatically-generated websites whose robots.txt specifically said "don't go here".

    It's not news that ignoring robots.txt gets you access to more content on the web. It's also not news that this is usually not going to get you any better content.

    Yet another bunch of tech bros are deciding they can succeed by ignoring all of rules, laws, social conventions, and the learnings of the past because they're the superior, innovative people. Instead they will just burn money until they run out, then go around and start another company and get some more money without ever generating anything useful or profitable.

    • It's not news that ignoring robots.txt gets you access to more content on the web.

      It's also not news that this is usually not going to get you any better content.

      Even Google ignores the robots.txt. They made that decision after the California DMV (Department of Motor Vehicle) blocked them with their robots.txt

      And frankly, I can't blame Google.

      If you don't want your content to be accessed by everyone, don't put it up on the public internet.

      Badly written bots are a separate issue.

  • by awwshit ( 6214476 ) on Sunday June 23, 2024 @08:46AM (#64571001)

    Why would you expect a technical suggestion to work?

  • It's almost as is (Score:4, Insightful)

    by Rosco P. Coltrane ( 209368 ) on Sunday June 23, 2024 @09:36AM (#64571121)

    Gigantic quasi-monopolies don't respect anyone or any laws, or bother to behave with any sort of decency anymore, since they made themselves untouchable and it does nothing for their shareholders anyway.

    I don't think they even bother to pretend to show restraint anymore. Like with the AI stuff infringing copyright on an unprecedented scale: they basically just went "Yeah, that's how it goes now. You can't stop us. Suck it up." It's quite staggering.

  • If an AI is crawling the site, create one page that contains purely random text with a selection of random links, but have the page reachable via any arbitrary URL pointing to any imaginable purely illusiary subdomain.

    The AI will harvest however many pages it is set to (possibly all of them), each page diluting and corrupting the AI's neural net.

    The AI developers don't give a damn about quality, only the illusion of quality, so will never actually stop and look. But a large enough phantom site should seriou

  • .. a comprehensive and up to date list of all AI web crawlers and their IPs. Just redirect all of them to Encyclopedia Dramatica.

  • by rossz ( 67331 ) <ogre@NosPAm.geekbiker.net> on Sunday June 23, 2024 @11:41AM (#64571463) Journal

    You can use fail2ban to block rude web scrapers. Put a hidden link into your web pages that people would not see, but bots would. Include that link in robots.txt. When anyone hits that link, fail2ban will automatically block them based on the rule you implement.

    • by Wokan ( 14062 )

      You beat me to the punch. I've done similar things in the past to trap bad bots. My next favorite tool was for use against email harvesters. I generated page after page of fake email addresses for them to collect. Same idea, though. Hidden link on a page, not in the robots.txt though, as the point wasn't to block them but to poison their well.

  • It's strange, that given Microsoft involvement in both LinkedIn and OpenAI, that Microsoft prevents OpenAI from accessing LinkedIn.

  • A Real Intelligence (RI) doesn't have to follow robots.txt, can't AI be the same?
  • But then, it just goes to show that when hoovering up screeds of cash is a possibility, well, politeness (robots.txt) and copyright mean less than nothing. Perhaps the most insulting thing is being expected to pay to use a system that has almost certainly pillaged your organisation's data.
  • Every automated agent should honor robots.txt, and that includes not just access to content, but rate-limiting. I've observed multiple AI/LLM companies' spiders not only ignoring it for content, but also sending hundreds of requests per second. I've also observed them faking the User Agent, rotating between different originated addresses (including those in various clouds), and constructing never-existed URLs in an attempt to find possible unlinked content. The first issue is that this scraping for AI/LLMs
  • No, I won't sue. I'll file criminal charges for theft against the CEOs. They get to go to JAIL.

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (3) Ha, ha, I can't believe they're actually going to adopt this sucker.

Working...