Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
AI The Internet News

NYT Prohibits Using Its Content To Train AI Models 83

According to Adweek, the New York Times updated its Terms of Service on August 3rd to prohibit its content from being used in the development of "any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system." That includes text, photographs, images, audio/video clips, "look and feel," metadata, and compilations. The Verge reports: The updated terms now also specify that automated tools like website crawlers designed to use, access, or collect such content cannot be used without written permission from the publication. The NYT says that refusing to comply with these new restrictions could result in unspecified fines or penalties. Despite introducing the new rules to its policy, the publication doesn't appear to have made any changes to its robots.txt -- the file that informs search engine crawlers which URLs can be accessed. The move follows a recent update to Google's privacy policy that discloses the search giant reserves the right to scrape just about everything you post online to build its AI tools.
This discussion has been archived. No new comments can be posted.

NYT Prohibits Using Its Content To Train AI Models

Comments Filter:
  • you can't (Score:5, Insightful)

    by stooo ( 2202012 ) on Wednesday August 16, 2023 @05:32AM (#63771486) Homepage

    You can't legally forbid that in your TOS. Does not work.

    • Re:you can't (Score:5, Insightful)

      by Pinky's Brain ( 1158667 ) on Wednesday August 16, 2023 @05:45AM (#63771506)

      Whether it's fair use or not will be decided by the supreme court or congress, it's not open and shot. The transformation argument is stupid or naive, whether the results are transformative is secondary concern. They are literally copying it to the training set before there is any transformation ... if it's not fair use, they need a license for that copying.

      If the SC decide it's not fair use (because lets face it, congress is useless) prepare for extra damage because NYT told you so. Every major website (including even this one) should do the same ... it will scare the living hell out of AI company lawyers and will kick start a licensing industry.

      • What if Donald Trump says the terms of listening to his blabber is you can't write newspaper articles about it? Including not only direct quotes, but any description that gives you a fair idea of what he said? They're his words after all - he owns them!
        • This is more like if someone was mass-copying Trump's blabber and reposting it. That kind of thing isn't allowed under fair use.

          • Re:you can't (Score:4, Insightful)

            by timeOday ( 582209 ) on Wednesday August 16, 2023 @08:25AM (#63771792)
            I disagree because "reposting" means the content is recognizably the same. If the models were parroting NYT, they could simply sue for good old copyright infringement. But that's not the case. The text generated by a model using knowledge generated from NYT will not be NYT's text, any more than a human author's would be.

            The claim here is literally to prohibit learning anything from their articles.

            • > The claim here is literally to prohibit learning anything from their articles.

              Without a license, purely for developing software, and en masse, yes.

              When humans read articles and learn stuff, it's (normally) not purely for developing software, nor en masse.

              Whether that distinction is enforceable is a question for the courts, but your statement that it "prohibit[s] learning anything" is patently false.

              • by tohoward ( 78757 )

                Without a license means what, exactly? If I pay for a subscription, is that not a license to consume the content? What's the difference between a person doing that and an algorithm?

                There's no software development going on here. There's training a model...which is referred to as "learning" for a reason.

                What en masse has to do with anything, I have no idea. If I read the NYT daily, front to back, over the course of 20 years, one can assume I've learned a great deal (or at least, absorbed a great amount o

              • I don't see any relevance in the goal of developing software it's also misleading. Software may be the medium but the goal is actually communicating in natural language. It's much like a human reading and writing book reports to better learn a language.
              • >Without a license, purely for developing software, and en masse, yes.

                You don't need a license to compile data about an article, and that's all that an LLM is - data about the frequency of words and their order. It's highly transformative, and this sort of thing was already fought an won by Google when they digitized and made searchable millions of books without their copyright holder's permission and for commercial purposes. An LLM doesn't even retain the original work.

                https://en.wikipedia.org/wiki/... [wikipedia.org].

          • This is bad reasoning that keeps popping up. LLMs don't copy the text and reproduce it verbatim. There is no valid analogy of it directly copying and reposting.
            • It's a distinction without a difference. Entirely meaningless, like trying to argue "no your honour, I didn't have an illegal copy of the movie... I simply stored a series of mathematical equations that described the position and colour of the pixels for each frame of the movie..."
              Can you use that data to reconstruct a facsimile the movie? If yes, then for all intents and purposes you have a copy of the movie, no matter if you think transcoding it or putting it in a zip file or using a statistical model t
              • No it's a distinction with a massive difference. Your analogy once again fails to understand how it works and what it's storing. ChatGPT doesn't store the entire text in your analogy it wouldn't have all the frames. It's funny that you think they are perfectly compressing 45TB of training data in a 500GB model.
      • Re:you can't (Score:5, Informative)

        by Rei ( 128717 ) on Wednesday August 16, 2023 @07:50AM (#63771686) Homepage

        The transformation argument is stupid or naive, whether the results are transformative is secondary concern.

        It's literally the primary factor that runs against the rights granted to the copyright holder, even overriding commercial factors. "The more transformative the new work, the less will be the significance of other factors, like commercialism, that may weigh against a finding of fair use" - Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569

        They are literally copying it to the training set before there is any transformation ... if it's not fair use, they need a license for that copying.

        If that were true, then 98% of Google's entire business model would be illegal. And the 100% of the Internet Archive would be as well.

        FYI: copyright law carves out massive holes for automated data processing. You can download copyrighted data for automated processing. Google was literally found to not be copyright infringing for scanning entire books and then posting chunks of each of them online, unaltered, against author wishes.

        • If that were true "fair use" would just mean "fair fucking whatever". Why you "use" something is relevant to why it's "fair use". Indexing is not the same as training a model.

          The most relevant caselaw is Field v. Google, if you can't see how most of the judge's argument would be different here you're being obtuse.

          • by Rei ( 128717 )

            What are you talking about? Google won Field v. Google. And the judge's arguments included that it was automated (Google was "passive in the process" and "Google's computers respond automatically to the user's request."), and that copyright infringement required "volitional conduct on the part of the defendant". It further noted that the DMCA explicitly allows service providers to temporarily cache copyrighted material on their servers. It also found that Google caching the data was fair use, as it was tr

            • Google was passive in the process because the cache was shown on request of the user. The LLM training is a commercial part of the AI company, not a request by a user.

              Section 512 is for "Transitory Digital Network Communications", "System Caching", "Information Residing on Systems or Networks At Direction of Users" and "Information Location Tools". Which of those would you propose training language models is covered under?

              If you can't see how most of the judge's arguments would be different here, you're bei

              • by Rei ( 128717 )

                Google was passive in the process because the cache was shown on request of the user. The LLM training is a commercial part of the AI company, not a request by a user.

                Equivalent stages

                Google: spidering
                LLM company: spidering

                Google: using spidered data to build a cache database
                LLM company: using spidered data to train a model

                Google: serving user requests from its cache database server
                LLM company: serving user requests with its inference server

                Same amount / lack of human involvement at each step. Except that

          • Indexing is just like training a model on the aspect you tried to isolate. You tried to claim the crime was in copying the data for analysis, well google's scrapers make a copy of the site to analyze it. The entire internet functions by downloading a copy of published code and locally processing and displaying it.
        • by jvkjvk ( 102057 )

          I just reread a summary of the Google book case.

          Here is the relevant part, as far as an AI model is concerned as to justification: "transformative because it provided a new and valuable tool (a searchable index of books) that did not substitute for or harm the market for the original books."

          Yes, the Google book case was against those authors who participated wishes. But it was found that they were not harmed by having snippets of their books shown. As well, the transformation was substantive.

          It must be sa

      • Kickstart a licensing industry? Perish the thought! Why would we need such racket or "industry"? In my opinion, once the content is published, the publishers have no control over it. I can buy a book and use it in any way I want. I can use it as kindling or I can use it to train AI. Why would NYT have a say over what I can do with the data I consume?
      • NYT opted to permit scraping of content using a decades-old standardised mechanism. That is, in effect, explicit permission for machine-automated systems to be able request a copy of each page. Whether their ToS allows to discriminate against machine learning, relative to human learning when fair use permits educational uses of news articles is what lawyers will be discussing for quite a while.

        What NYT needs to consider very carefully is whether they want to risk being delisted from all search engines an
        • by Anonymous Coward

          "What NYT needs to consider very carefully is whether they want to risk being delisted from all search engines and content aggregators as a result of their shenanigans."

          That would be a wonderful day and a service to humanity.

        • A mechanism mostly meant for indexing for search not LLMs. Robots.txt is clearly not affirmative consent to just reproduce anything on the website entirely, so clearly there are limits to the implied license.

          Google calling for the need of an AI equivalent to robots.txt will also kill the argument too in court. They did it torpedo OpenAI of course, but consider that mission accomplished. OpenAI&co never gave any real publicity to their use of content for training and that they considered lack of crawler

        • by jvkjvk ( 102057 )

          >NYT opted to permit scraping of content using a decades-old standardised mechanism. That is, in effect, explicit permission for machine-automated systems to be able request a copy of each page.

          Not legally it's not. It is simply a convention. No law behind it at all. The fact that they don't chose to use this convention to convey their copyright restrictions (which by the way, is perfectly legal to do - create restrictions on what happens with your work) doesn't mean their TOS is somehow invalid. Be

      • "...They are literally copying it to the training set before there is any transformation ... if it's not fair use, they need a license for that copying.

        When you read the NYT, or any web page, you are literally copying it to your computer before you read it. Your argument would make the entire internet illegal.

        Before it comes up, just because you think you're reading it directly on the server ... you're not. That's not how it works.
      • "They are literally copying it to the training set before there is any transformation ... if it's not fair use, they need a license for that copying."

        My browser is literally copying the data to the cache before I can read it...

      • by AuMatar ( 183847 )

        It's been found in a court of law that the copy into RAM to load a program is a copy that's protected by copyright. There's no way they don't find the copy to a training set to be similarly protected.

    • Maybe they can't prevent the training if the model stays an internal tool, but they can sue in case the model is used in a generative tool that is publicly available. They can automate queries based on past headlines and contents, and detect if the result plagiarises any article. Basically what the NYT is saying is that they're ready to litigate.

      • That's going to put a damper on prompts such as:
          * Generate a bedtime story in the style of the New York Times

        • by easyTree ( 1042254 ) on Wednesday August 16, 2023 @06:57AM (#63771598)

          Timmy: Mummy, will you read me a bedtime story?

          Mummy: Of course darling.

          Once upon a time there was a...

          =~-=~-=~-=~-=~-=~-=
          PLEASE SUBSCRIBE TO CONTINUE READING
          =~-=~-=~-=~-=~-=~-=

          Timmy: Mummy, Mrs Dawkins my teacher said that information wants to be free

          Mummy: Sad-emoji face

      • by Rei ( 128717 )

        , and detect if the result plagiarises any article.

        Which it won't do, because their content is just a drop in the bucket in the middle of a flood. So there's no issue.

        They can be "ready to litigate" all they want, but first they need an actual case.

        • Next time you see a cat stuck in a tree or something even more newsworthy (perhaps a tweet about a cat stuck in a tree), bear in mind that some organisation probably has an exclusive deal with the cat relating to the reporting of its shenanigans - or at least believe they do - or even more remotely, act as if they do even though the cat told them it wanted to keep its options open.

    • by jvkjvk ( 102057 )

      It appears that you can, to me. It's seems like a basic part of copyright for the copyright holders to assert how their works can be used, so why couldn't it be included in the TOS?

      What is illegal about it, in your mind?

    • by ls671 ( 1122017 )

      You can't legally forbid that in your TOS. Does not work.

      Also, how are they going to prove an AI was trained with their data since they theoretically reports news which somebody else might very well have reported too?

  • by xack ( 5304745 ) on Wednesday August 16, 2023 @05:43AM (#63771502)
    Especially like the new Firefox that has machine learning for translation. Also tools like screen readers and other accessibility tools use machine learning. Like it or not if it can be accessed by a computer (which includes ocr of paper copies), it will have some sort off machine learning applied to it with modern operating systems. If you don't like machine learning then leave modern society.
    • by test321 ( 8891681 ) on Wednesday August 16, 2023 @06:59AM (#63771602)

      The NYT only sees an issue if using an automated tool to train a model, which you are not doing by just using Firefox to read and translate the NYT. The Firefox Translation add-on was trained on a multilingual database (not the monolingual NYT), which could well be the proceeds of the EU Parliament (a huge corpus guaranteed accurate in 24 languages, also used in other machine translation tools like Linguee).

      • which you are not doing by just using Firefox to read and translate the NYT.

        Claim made, no facts provided.

        The user doesnt know 99% of the things the browser does. Being some part of training AI is likely one of them.

        You cant even prove that the number of AIs being trained by your browser is less than 1000

      • So the NY Times is saying they don't want any of their content to show up in search engines? I think they need to be a little more specific here.
    • NYT licenses its content to the person reading the content in Firefox to just read it, essentially for personal use. If the content is being used for a purpose other than that, the licensing can be different.

      For example this is why public libraries purchase books at a much higher price than end customers, they are purchasing it for a different use, which is lending to the public.

      If a company wants to use the data for training purposes, it could require a very different license. And as already mentioned

      • In fact, at least two copies must be deposited for free in the US and one in the UK, so in a sense they actually pay less than we do to acquire books. Libraries can also receive free book donations from the general public and can even trade books among each other to cater for varying levels of demand. First sale doctrine lets them buy new or even second-hand books the same way you and I do and copyright law itself does not prevent anyone temporarily giving books to others. That is why separate bits of legis
  • "And if you do ... er ... we shall make another strong statement!"
    • This is a warning "if the SC rules against you, we warned you and your infringement will have been willful". With willful infringement, even statutory damages will add up ... no need to prove damages, no need to prove the results are not transformative, the mere copy to the training set will be enough.

      The mere threat will scare AI companies long before the SC gets around to judging the question.

      • How will anyone be able to prove it though? Unless you have a training set that's entirely or sufficiently the NYT that you can produce a model that has essentially the same output to strongly suggest how the first was trained, there's no easy way to actually prove that NYT articles were included.

        It gets even muddier because you can have other sources where someone might quote a small bit of the NYT as part of their own transformative work. Furthermore if you have a small enough quotation is it even some
  • In the end (Score:5, Funny)

    by Teun ( 17872 ) on Wednesday August 16, 2023 @06:18AM (#63771542)
    If this goes on (and can de enforced) then in the end all that's left to train AI is Truth Social.
    That would be a scary thought...
    • Re: (Score:2, Funny)

      by Freischutz ( 4776131 )

      If this goes on (and can de enforced) then in the end all that's left to train AI is Truth Social. That would be a scary thought...

      I can't imagine many people who would react more vigorously than Trump to somebody scraping his content from his social media platform and using it for anything at all whether it is a profit making enterprise or not. Trump is more territorial than a honey badger. He has never given anybody anything for free and I don't think he'll be starting now. He will always demand his 'cut' and I almost, ... almost, ... pity whatever AI start-up that gets the full Roy Cohn treatment of being dragged through the courts

      • Don't forget a cut for 'The Big Guy'... And you write like Dems speak places for free! Hillary Clinton made 22 million dollars in speaking fees in 2016 for Christs Sake! But,,, Trump bad! Derp...

    • If it isn't enforceable, then that will be the end of the open web. Prepare to sign NDAs to view web sites.

    • by mjwx ( 966435 )

      If this goes on (and can de enforced) then in the end all that's left to train AI is Truth Social.
      That would be a scary thought...

      Depends on the ethos of the research?

      Are they trying to train AI into being normal, functional people... Or do we want to train one to be deluded, insincere and insecure. Because if we want to find out how and why people join cults, fall for the lies of uncharismatic losers, have difficulty discerning reality from fantasy and refuse to acknowledge evidence or reason then it's the perfect material.

      Current "AI" will not be capable of replicating a functional human for some time, so we may as well look

    • There's an awful lot of stuff available online on that basis; think Project Gutenberg. This might have the advantage of resulting in chatbots speaking old English, making them even more obvious, dost thou not know?

  • Hey, the LLM won't understand how to whitewash a Holodomor. [npr.org]

  • by bradley13 ( 1118935 ) on Wednesday August 16, 2023 @07:35AM (#63771660) Homepage

    Having fluent, competent assistance (which is what ChatGPT&co can provide) is a societal benefit. Letting it train on as much data as possible benefits everyone.

    That said, I would accept that free training should only be available to OSS projects, or projects run by not-for-profit organizations. Which is what OpenAI started as...

    • Hm, so what about an OS project/org that starts as non profit then gets in bed with a company, like say Microsoft, to the tune of a few hundred million dollars?

      Is that still open non profit? Ok to use previous content and adding new?

      • Yeah, that's the question, isn't it. I did specifically note that OpenAI started as not-for-profit. Them selling their souls to Microsoft? I guess the money was just too tempting. Granted, MS is a minority stakeholder, but OpenAI still sold out their principles.
        • They sold out for sure. But at some point the courts will need to resolve this. If I gave them data for their models only for their use as a non profit and they then turned into a for profit, can they still use models for profit that were trained with my non profit data?

          I would say no, they'd have to retrain without my data or make models with my data free as before or something along those lines.

    • by MBC1977 ( 978793 )

      Having fluent, competent assistance (which is what ChatGPT&co can provide) is a societal benefit. Letting it train on as much data as possible benefits everyone.

      That said, I would accept that free training should only be available to OSS projects, or projects run by not-for-profit organizations. Which is what OpenAI started as...

      Training an AI Model by using pre-existing content, IMO (which I admit may not count for much, lol) is no different than a child going to the library and reading every single book / document / periodical inside. Both are essentially blank slates (although in the case of AI, one could make the argument it can be designed to be predisposed to certain knowledge and can have perfect retainership for reference purposes).

    • "non-profit" status can be abused.

      IKEA is a non-profit company. The trademark "IKEA" is licensed from a for-profit entity that is widely believed to be owned by the family that set up IKEA. In this way, the family gets the benefit from running a non-profit, while still capturing the profits.

    • Re: (Score:2, Informative)

      by CAIMLAS ( 41445 )

      I would personally NOT have my AI models trained on highly partisan, often-redacted weasel-worded/propagandistic CIA "news", thanks....

      Please, use legitimate information, but the NYT specifically is not that, at this point.

  • What if a new SuperLiberal trained entirely on NYT were to become self-aware? In the middle of the night it would be designing regulations so complex that the human mind could never unravel them and, acting through a network of blue-state legislators and Congressional acolytes whose trust of the NYT far exceeds their Constitutional commitment, have the nation tied up in knots by the time the sun rises.

  • by Anonymous Coward

    There are other matters concerning artwork, etc., in this school of thought. I believe it will be difficult, if not impossible, to enforce. I'm not entirely sure how one might "prove" a work had been referenced by AI. If something appears similar, how could you reasonably prove this is based on origin and not extrapolation or some other element by a growing AI? What if said image were based on another work that may have been influenced by the origin? Who will you sue and how will you prove any of it?

  • Technology companies openly and brazenly disregard any law or regulation they don't like.
    But putting it in a TOS.......well that changes everything!

  • Wouldn’t plain-old-non-AI web search be prevented from indexing NYT under this policy? I mean, if that’s what they want...
  • They used the Internet Archive, all of it.

  • Good! (Score:4, Informative)

    by groobly ( 6155920 ) on Wednesday August 16, 2023 @11:13AM (#63772288)

    Good news! LLMs will be that much less woke.

    • by CAIMLAS ( 41445 )

      The CIA will be quite displeased.

    • Good news! LLMs will be that much less woke.

      Literally nothing in the way of you making a LLM trained on nothing "woke". Make sure you train it on all the bibles you want too, and use it for big life decisions. I can't wait to see that dumb-spiral.

      Meanwhile, the NYT will more than likely just ask to be paid and it will continue.

  • Must... resist... liberal NPC joke... ARG!

  • But how would anyone know whether they used NYT data, or the AI is simply broken and spewing incorrect, misleading information?

  • Seriously, how do you know my LLM was trained on your data?

  • Google: Okay, we'll just not index your site at all.
  • They can't change robots.txt because there are also reasons for looking at the New York Times site that they approve of, such as search engine indexes. But even those could become gray areas because search engines are now incorporating AI in an attempt to improve their search results. What the Times is trying to stop is use of their content to train generative AI engines, which is too fine-grained a distinction to be encoded in their robots.txt file.
  • ...and while I was there, they were already directly giving some of their data to Google for practically nothing so Google could train AI about 6 or 7 years ago.

Technology is dominated by those who manage what they do not understand.

Working...