Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI News Technology

OpenAI Claims NYT Tricked ChatGPT Into Copying Its Articles 166

Emilia David reports via The Verge: OpenAI has publicly responded to a copyright lawsuit by The New York Times, calling the case "without merit" and saying it still hoped for a partnership with the media outlet. In a blog post, OpenAI said the Times "is not telling the full story." It took particular issue with claims that its ChatGPT AI tool reproduced Times stories verbatim, arguing that the Times had manipulated prompts to include regurgitated excerpts of articles. "Even when using such prompts, our models don't typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts," OpenAI said.

OpenAI claims it's attempted to reduce regurgitation from its large language models and that the Times refused to share examples of this reproduction before filing the lawsuit. It said the verbatim examples "appear to be from year-old articles that have proliferated on multiple third-party websites." The company did admit that it took down a ChatGPT feature, called Browse, that unintentionally reproduced content. However, the company maintained its long-standing position that in order for AI models to learn and solve new problems, they need access to "the enormous aggregate of human knowledge." It reiterated that while it respects the legal right to own copyrighted works -- and has offered opt-outs to training data inclusion -- it believes training AI models with data from the internet falls under fair use rules that allow for repurposing copyrighted works. The company announced website owners could start blocking its web crawlers from accessing their data on August 2023, nearly a year after it launched ChatGPT.
OpenAI stills hopes to form a "constructive partnership with The New York Times and respect its long history," the company said.

Last month, OpenAI struck an unprecedented deal with Politico parent company Axel Springer, allowing ChatGPT to summarize news stories from Politico and Business Insider.
This discussion has been archived. No new comments can be posted.

OpenAI Claims NYT Tricked ChatGPT Into Copying Its Articles

Comments Filter:
  • OpenAI shouldn't bring humanity into this, when your main goal is profit.
  • by StormReaver ( 59959 ) on Monday January 08, 2024 @08:45PM (#64142645)

    The linguistic gymnastics employed here is astounding. It makes the Wookie defense seem almost plausible.

    • Re: (Score:2, Interesting)

      by AmiMoJo ( 196126 )

      It's been independently reproduced too.

      Image generators are just as bad. Someone asked Midjourney for "man in robes with laser sword" and out popper Luke Skywalker.

      • Image generators are just as bad. Someone asked Midjourney for "man in robes with laser sword" and out popper Luke Skywalker.

        Do you have a citation or link to the image? Was Luke Skywalker barefoot?

      • Impossible, every treckie knows Worfenstein's favourite pocket knife is called a "light saber", not "laser sword".

      • by Rei ( 128717 ) on Tuesday January 09, 2024 @08:11AM (#64143601) Homepage

        Searching for this reference only turns up your Slashdot post [google.com].

        That said, given that Luke Skywalker is everywhere on the internet, why shouldn't it know what he looks like? What sort of gating mechanism are you proposing that says "learn some things that are everywhere verbatim" (flags, the Mona Lisa, etc) but not others (Luke Skywalker, in this case), and what about, say, corporate logos, should it know them? Furthermore, at what point will you aim your fire at the user for deliberately attempting to misuse a tool to violate copyright? You're talking like the AI is a person. The law disagrees that AIs are people. The user is a person. And if they're deliberately trying to use the tool to violate copyright, why isn't that on them? One can draw Snoopy in Photoshop in just a couple minutes - is that Adobe's violation? Yeah, AI tools are faster and better, regardless of what the user is trying to do - does that somehow change the equation of who is deliberately trying to violate copyright in this case?

        • Maybe because of the commercial use? In your examples, the user isn't pursuing a commercial application. It isn't against the law for one to draw Snoopy, it is against the law to sell the drawing or claim originality. Something like that anyway. A lawyer could probably make a better contribution than me on this point.
          • by jp10558 ( 748604 )

            I'm not a lawyer, but I'm pretty sure it is actually against US copyright law to draw snoopy. How do you think before computers people violated copyrights?

            • Copyright was created as a response to the printing press. So before computers, yes, but only automatic, large scale, easy copying was enough of a threat to create a law stopping such copies.

              Before that, people used to spend months creating a single copy of a single book.

        • by jp10558 ( 748604 )

          It's precedent that Napster - a tool people used to violate copyright (while the software was never implied to be a person) was responsible for the copyright violations due to some ... well I'd call it interesting logic. So I don't think that's as slam dunk as you think it is.

    • by znrt ( 2424692 )

      the last paragraph of openai's statement is particularly telling.

      nyt will strike a deal, once they come down their high horse and after a few antics to cover up the sleazy trolling that this lawsuit was.

      • You mean this sentence:

        the [OpenAI] company maintained its long-standing position that in order for AI models to learn and solve new problems, they need access to "the enormous aggregate of human knowledge."

        Sounds like a confession to me.

        It is also a stupid statement, "AI models" don't "solve new problems", they regurgitate old data into patterns that "look" the same to the model.

        • by Rei ( 128717 )

          Good AI models outscore the vast majority of humans in creativity tasks, and they're used every day to solve "novel" problems, particularly in programming.

          And FYI, but YOU are a predictive engine. That's literally how human learning works. Your brain makes constant predictions about what your senses will experience; the error between the prediction and the reality then propagates backwards through the network, including through the deeper, more abstract / conceptual layers.

          You cannot make good predictions

          • Good AI models outscore the vast majority of humans in creativity tasks,

            Bullshit. AI models regurgitate the human creativity that was fed into them.

            and they're used every day to solve "novel" problems, particularly in programming

            LOL, aptly placed quote marks. "AI" hasn't done anything that isn't a regurgitation of fed data.

            That's literally how human learning works.

            How would you know :)

            AIs perform well because they have a good conceptual model of the world underlying what they're predicting.

            Quite the opposite, "AI" is a matrix of coefficients that compute a value based on multiplying these to some input data. It doesn't understand what a conceptual model is anymore than you understand what "AI" is.

      • If NYT gets a big enough pound of flesh to satisfy them, the next class action from the rest of the world will butcher them.

        Supreme court saying it's fair use is their only hope, otherwise it's all over.

    • by quantaman ( 517394 ) on Tuesday January 09, 2024 @12:54AM (#64143141)

      The linguistic gymnastics employed here is astounding. It makes the Wookie defense seem almost plausible.

      I find their response fairly straightforward.

      There were two ways ChatGPT could reproduce articles:

      1) You could ask it to scrape the URL and reproduce in real time, OpenAI took down this feature fairly quickly when the copyright issue became apparent so probably not a big deal.

      2) ChatGPT could regurgitate the memorized article.

      #2 is the interesting claim since it gets right to the core of LLMs. Here OpenAI made two main claims.

      a) OpenAI has safeguards to prevent ChatGPT from regurgitating content, they're not perfect, but they're much more than the NYTimes suite suggested.

      b) The articles in question were already widely reproduced on the web.

      I think there's still a couple important questions.

      First, the existence of the articles on the web suggests that OpenAI didn't need to scrape the NYTimes to get the data, but it leaves open the possibility that they did.

      Second, OpenAI is trying to frame this as a debate about whether ChatGPT spits out copyrighted content and if their good-faith efforts are sufficient to protect them from legal liability. NYTimes wants it to be a debate about whether they're allowed to use the NYTimes IP to train their model in the first place.

      • by martin-boundary ( 547041 ) on Tuesday January 09, 2024 @03:38AM (#64143345)
        The existence of articles on the web is not a defense at all. Here's how it works:

        1) The NYT (employees) created the article => The NYT has the copyright for the article => The NYT can license to its customers the right to display the article in limited circumstances => Everyone who gets the article from them is a licensee of the NYT, so they get some limited personal rights, but do not get the right to republish or sublicense the article further to anyone else.

        2) OpenAI scraped the article data illegally from the web: just because it was accessible to their spider does not give them the right to copy it to their own disks, but they did it anyway. At best, the NYT licensee who made the article available unsecured committed an offence by not following the license terms. The spider took advantage, which it had no right to do (eg proceeds of crime).

        TL;DR writing spiders is hard, and it's always better to NOT download a file if there's any doubt at all about the licensing terms and legal owner.

        • It's legal to scrape the web for training, despite what you or the NYT might think. What's not legal is if it spits out the articles verbatim without being prompted with the text of the articles. OpenAI claims that the NYT was using big pieces of their articles as prompts, and the NYT isn't telling what prompts they use, so at this point it's a he-said/he-said between two disreputable characters. (The NYT is continually publishing articles that downplay genocide, and murder of journalists no less [hyperallergic.com], just like

        • by Keick ( 252453 )

          just because it was accessible to their spider does not give them the right to copy it to their own disks

          Don't they though? How is this different from time-shifting a TV show with my TIVO (yes I'm that old).

          I thought copyright was more about protection against me re-publishing or rebroadcasting my time-shifted TV show back to the general public.

          I see ChatGPT as more of generating derivative works based on it's training data - more akin to taking 30 second clips of a bunch of TV shows and stitching them back into a new but similar TV show; Which is perfectly legal already.

      • by Zangief ( 461457 )

        I don't think it matters whether the NYT tried to prompt the AI to reproduce the text verbatim and that this wouldn't happen normally; the problem is that the AI is capable of doing this _at_all_

        GPT4 somehow has in its interior encoded large parts of the archive of the NYTimes and it is using them to generate more text. It's like if I created a program to create-your-own-super-hero, and it included parts of the Spiderman costume that I collage together into new pieces; it does not matter that Marvel had to

        • I don't think it matters whether the NYT tried to prompt the AI to reproduce the text verbatim and that this wouldn't happen normally; the problem is that the AI is capable of doing this _at_all_

          GPT4 somehow has in its interior encoded large parts of the archive of the NYTimes and it is using them to generate more text.

          I can recite copyrighted works as well, it's only an issue if I recite them in certain contexts.

          It's like I said at the end. The NYTimes agrees with your belief that OpenAI is in trouble if those memorized texts exist in ChatGPT at all. OpenAI thinks they're alright if they make a fairly effective good faith effort to stop ChatGPT from reciting that copyrighted content.

          It's like if I created a program to create-your-own-super-hero, and it included parts of the Spiderman costume that I collage together into new pieces;

          I'm not sure that's a good metaphor for a few reasons:

          a) In your example you're obviously just trying to save yourself effort of making your

          • by tlhIngan ( 30335 )

            It's like I said at the end. The NYTimes agrees with your belief that OpenAI is in trouble if those memorized texts exist in ChatGPT at all. OpenAI thinks they're alright if they make a fairly effective good faith effort to stop ChatGPT from reciting that copyrighted content.

            Yes, but that's basically closing the barn door after the horse has run out. LLMs like ChatGPT respond oddly.

            Like how the having it repeat "poem" causes it to suddenly spit out copyrighted text verbatim, this seems like a fruitless task

            • It's like I said at the end. The NYTimes agrees with your belief that OpenAI is in trouble if those memorized texts exist in ChatGPT at all. OpenAI thinks they're alright if they make a fairly effective good faith effort to stop ChatGPT from reciting that copyrighted content.

              Yes, but that's basically closing the barn door after the horse has run out. LLMs like ChatGPT respond oddly.

              Like how the having it repeat "poem" causes it to suddenly spit out copyrighted text verbatim, this seems like a fruitless task - are you going to try to close every loophole that results in it spitting out copyrighted text as they come up? Because that's going to be impossible - that's like saying Windows will be secure if you keep patching security holes as they come up. That just means someone else will find another prompt that will cause it do soenthing wierd and who knows what happens then.

              So what?

              Is occasionally tricking the LLM into spewing out copyrighted data really a source of damage to the IP holders? Because without harm there's no grounds for a lawsuit.

  • by NomDeAlias ( 10449224 ) on Monday January 08, 2024 @09:57PM (#64142797)
    I don't get the logic here. Just because they made many attempts doesn't disqualify the result. If they fed them the articles themselves on the other hand that would be a different story.
    • Technically, if they fed them the articles, it would be the *same* stories. :-)

      • Possibly and that would seem like a logical defense to the accusations. Them trying many times with different prompts to coax verbatim articles I don't understand as a defense. So what they didn't get it the first time and got it on the 15th or 200th? What does the number of attempts matter?
    • by Rei ( 128717 )

      How many times do you have to try to draw Daffy Duck in photoshop before it's you who's trying to violate copyright, not Adobe?

      • It's not a violation to simply draw a character. So the answer would be infinite whether you use photoshop or a pencil.
  • by Opportunist ( 166417 ) on Monday January 08, 2024 @10:04PM (#64142821)

    If it's that easy to trick ChatGPT into breaking the law, maybe it should not be allowed in public until it can be made certain that it doesn't.

    • by XanC ( 644172 )

      And of course the point is to demonstrate that it is storing NYT articles. If it quit merely spewing them back out it would make it harder to prove, but not any less true.

      • by znrt ( 2424692 )

        storing nyt articles is no violation of copyright, not even if you could prove that they got them frpm the nyt. "publishing" it would be.

        • by XanC ( 644172 )

          I don't believe that's correct. The copy is being made when it is stored. "Publishing" it would certainly also be, as would creating a derivative work, as would publishing that derivative work.

          • I don't believe that's correct. The copy is being made when it is stored. "Publishing" it would certainly also be, as would creating a derivative work, as would publishing that derivative work.

            I agree, under US copyright law the rights holder has an exclusive right over the production of "fixed" copies of their copyrighted works even if they are held privately and never distributed or performed.

            Separately there seems to be two obvious questions in this regards.

            1. Is training a neural network on NYT data producing a copy? The answer is clearly no. The neural network is clearly transformative and as a result rights holders have no rights over the transformative work.

            2. Is the AI spitting out copy

            • by XanC ( 644172 ) on Monday January 08, 2024 @11:46PM (#64142993)

              The point the NYT is making is that the neural net was able to spit out large chunks of their article verbatim, indicating that training DOES produce a copy.

              • The point is the NYT used large chunks of their articles in the prompts. Yes, if you feed a big part of the article in as a series of tokens, you will have a good chance of getting it out again.

              • The point the NYT is making is that the neural net was able to spit out large chunks of their article verbatim, indicating that training DOES produce a copy.

                Copyright is not a grant of authority over underlying knowledge and ideas. For example while a phone book is copyrighted the knowledge it contains is not. I can OCR all the numbers in the phone book into a computer database and the copyright holder can't do shit about it.

                Google's search index containing petabytes of copyrighted material has already been adjudicated as a transformative work. It does not seem credible to assume neural networks would be considered differently under law. If anything the ca

            • Is training a neural network on NYT data producing a copy? The answer is clearly no. The neural network is clearly transformative and as a result rights holders have no rights over the transformative work.

              The court may try to divide the resulting work into parts based on how much is from the original work, and how much is from the derived work. (See for example the abstraction filtration comparison test). NYT can't claim to own the entire derived work, but they can claim to own the parts that were from the original work.

              2. Is the AI spitting out copyrighted material in response to user prompts a copyright issue?

              If a copy doesn't exist in some way in the AI's database, then it couldn't reproduce it. Therefore a copy exists (in some way) in the AI's database. Whether it is an interactive conversation

              • A lossy copy is encoded in the mode that can sometimes be reversed back to portions of identical text. But a human mind can do that too. They aren't accused of a violation unless they use that memory to recreate the original work verbatim or a very direct derivative. Human memory is much more limited in fidelity but there has to be some sort of threshold level of fidelity that is the end of acceptable - whether it's a human or computer doing it.

                • There is no law that says, "Copying is ok if you pretend the computer is like a human mind." Humans brains aren't the same as computers when it comes to copyright.
                • No. A computer is not human. A human rule does not apply to a thing. You're making a category error.
            • If it doesn't reproduce the original text, there is probably no copyright issue.

              However, copyright law calls a "transformative work" a derivative work and is a right reserved to the copyright holder and the people they choose to license to. You would say that optioning a book and making a movie is transformative - sometimes greatly so. But you have to pay for that.

            • You are clearly wrong about 1. It's not clearcut: Take a PNG file, convert it to JPEG.

              You: JPEG is lossy, therefore it's clearly not a copy. It's clearly transformative and the PNG file's copyright owner has no rights over the JPEG image.

              Everyone else: The JPEG is a deliberate copy of the PNG, only in a different representation.

            • by dpille ( 547949 )
              The neural network is clearly transformative.

              Only if you're not a lawyer. Those of us that are recognize you have to transform the work, not point to some random, potentially unrelated output and say "tah dah! Transformative!"

              Besides, you appear overly confident there's no intervening copy between NYT server output to what it thinks is a browser and actual OpenAI entry into the black box. I don't see how you do that without ChatGPT figuratively sucking on the fire hose, which nobody seems to have conten
    • by znrt ( 2424692 )

      If it's that easy to trick ChatGPT into breaking the law, maybe it should not be allowed in public until it can be made certain that it doesn't.

      it isn't breaking the law if it is giving the nyt morons the exact answer they specifically engineered the prompt for, that's the whole point. pun intended: that's not fair use :D

      there could be a discussion if some random query produced that output.

      otoh, as i pointed out when this story first appeared here a few days ago with some examples, the text reproduced verbatim was years old and replicated all over the internet, rendering the nyt claims that it was protected and valuable content moot. the whole clai

      • the text reproduced verbatim was years old and replicated all over the internet, rendering the nyt claims that it was protected and valuable content moot.

        Rampant piracy doesn't invalidate a copyright. But one thing it does do for a large language model is heavily weight that phrasing as it enters the model from multiple sources.

    • If it's that easy to trick ChatGPT into breaking the law, maybe it should not be allowed in public until it can be made certain that it doesn't.

      Should this apply to hammers as well?

      • It does, my LLM-powered hammer's input is legally limited to its built-in accelerometer, gravity sensor and whatever touch-sensing stuff it has wrapped around the handle.

  • by SmaryJerry ( 2759091 ) on Monday January 08, 2024 @10:44PM (#64142879)
    None of the prompts in the suit filed by the NYT work at all for completing sentences after trying to do it. I don't know what version of ChatGPT they were on, maybe the bing web based one, that literally does browse the web for you?
  • It's supposed to be unable to regurgitate anything other than small particularly famous quotes.

    • I don't think that's true. If it were, standard web search engines wouldn't pass this test. They index literally all of the text of NYT and other publications, and then show you snippets long enough for you to know whether you got the right link. Would they "be able to" regurgitate the entire text? Certainly they could, they have that information.

      • The design of a search engine doesn't tell you much about the design of a language model. It might say something about what is allowed legally, but that's different too because websites want (and give permission via robots.txt) search engines to index them and benefit symbiotically, but as AI training data they gain nothing.

        • This is one of many news stories about news sites who think traditional search engines are violating their copyrights. https://www.reuters.com/articl... [reuters.com] Apparently, they don't all feel the relationship is "symbiotic."

          The point is, you suggested that AI should be "unable" to regurgitate content. If it indexes the content, it can regurgitate it. That point is not different from a traditional search engine.

          OpenAI and others are really just fancy search engines that pretty-print the results. I think they just n

          • That isn't about websites that didn't want to be indexed by search engines -- basically a death sentence. Getting un-indexed would not make them happy at all. What they wanted is some of the search engine's money, or for the search engine's summary to be sufficiently uninformative that people follow the link to their website. Conversely very few sites would go into full-blown panic if they get removed from an AI training dataset.

            And no, originally search engines were unable to regurgitate websites they inde

            • Memorization is not proof of overtraining, it's just proof of memorization. They were correlated in older ANNs but less so now.

      • If it were, standard web search engines wouldn't pass this test.

        Ok, so why don't you create a "prompt" for a search engine that reproduces a long article verbatim?

        Search engines typically become "unable" to reproduce a long article by simply limiting the output length. If chatGPT were doing that, nobody would be complaining here, or in the courts.

    • In this case, "famous" means the articles were already copied all over the Internet, which does affect the weights on that text phrasing.

    • by Rei ( 128717 )

      What's your standard for famous? Things that are seen once on the internet, no. But copy the same thing around the internet enough, and it'll learn it - the more common it is, the better it'll learn it.

      If you read a poem once, even while trying to memorize, you're not going to be able to cite it verbatim. But if you keep encountering the same poem and trying to memorize it, eventually you're going to learn it verbatim.

  • Don’t bring humanity into & pretend like this is for our benefit, when your main aim is profit
    • by gweihir ( 88907 )

      Indeed. Dishonest assholes, the lot of them. I hope they burn for what they clearly did.

  • ‘OpenAI said the Times "is not telling the full story." It took particular issue with claims that its ChatGPT AI tool reproduced Times stories verbatim, arguing that the Times had manipulated prompts to include regurgitated excerpts of articles.’

    If ChatGPT can be tricked into regurgitating original content then ChatGPT does indeed copy original content.
  • If it _sometimes_ delivers articles without those having been part of the queries, the OpenAI is guilty as hell. At this time, they are just trying to confuse the issue, because they _know_ what they did was deeply wrong and they essentially stole most of their training data.

Life is a healthy respect for mother nature laced with greed.

Working...