Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
AI The Media The Courts

OpenAI's Motion to Dismiss Copyright Claims Rejected by Judge (arstechnica.com) 92

Is OpenAI's ChatGPT violating copyrights? The New York Times sued OpenAI in December 2023. But Ars Technica summarizes OpenAI's response. The New York Times (or NYT) "should have known that ChatGPT was being trained on its articles... partly because of the newspaper's own reporting..."

OpenAI pointed to a single November 2020 article, where the NYT reported that OpenAI was analyzing a trillion words on the Internet.

But on Friday, U.S. district judge Sidney Stein disagreed, denying OpenAI's motion to dismiss the NYT's copyright claims partly based on one NYT journalist's reporting. In his opinion, Stein confirmed that it's OpenAI's burden to prove that the NYT knew that ChatGPT would potentially violate its copyrights two years prior to its release in November 2022... And OpenAI's other argument — that it was "common knowledge" that ChatGPT was trained on NYT articles in 2020 based on other reporting — also failed for similar reasons...

OpenAI may still be able to prove through discovery that the NYT knew that ChatGPT would have infringing outputs in 2020, Stein said. But at this early stage, dismissal is not appropriate, the judge concluded. The same logic follows in a related case from The Daily News, Stein ruled. Davida Brook, co-lead counsel for the NYT, suggested in a statement to Ars that the NYT counts Friday's ruling as a win. "We appreciate Judge Stein's careful consideration of these issues," Brook said. "As the opinion indicates, all of our copyright claims will continue against Microsoft and OpenAI for their widespread theft of millions of The Times's works, and we look forward to continuing to pursue them."

The New York Times is also arguing that OpenAI contributes to ChatGPT users' infringement of its articles, and OpenAI lost its bid to dismiss that claim, too. The NYT argued that by training AI models on NYT works and training ChatGPT to deliver certain outputs, without the NYT's consent, OpenAI should be liable for users who manipulate ChatGPT to regurgitate content in order to skirt the NYT's paywalls... At this stage, Stein said that the NYT has "plausibly" alleged contributory infringement, showing through more than 100 pages of examples of ChatGPT outputs and media reports showing that ChatGPT could regurgitate portions of paywalled news articles that OpenAI "possessed constructive, if not actual, knowledge of end-user infringement." Perhaps more troubling to OpenAI, the judge noted that "The Times even informed defendants 'that their tools infringed its copyrighted works,' supporting the inference that defendants possessed actual knowledge of infringement by end users."

OpenAI's Motion to Dismiss Copyright Claims Rejected by Judge

Comments Filter:
  • It seems to me like AI is just sort of 'ingesting' content, internalizing it, and building its world view based on it... Just like a person would. No word for word copying is going on (otherwise the model would be many many terabytes)... So IMHO this should be dismissed, plain and simple.

    • That's the issue. These Large language models are useless without stealing other people's data and copyrighted work. All it is a glorified search engine profiting off others
      • If I read the *same* material, freely available on the web, and use it to form some type of world-view or intellectually enrich myself, and then use that information to start a business, for example, is this somehow different? I'm not convinced this is the case.
        • Re: (Score:2, Informative)

          by Anonymous Coward
          Exactly. They're essentially summarizing information that these companies have made available to the public. Search engines have done this almost since the beginnings of search engines, and everyone here defended this practice, vigorously (see the stuff from Australia where they didn't like Google summarizing their articles in search results).
        • You are not making commercial use of their content. If you start a competing business using their content, you might well be sued.
          • NYT is off the mark here. Who needs their 10 year old news today? Nobody. It's useless content, with only historical or reference value. Why are they attacking OpenAI for useless content the model only partially regurgitates? Isn't it easier to infringe directly? I mean why would I generate Harry Potter from ChatGPT when it will be slow, expensive and imprecise, while copying is fast, free and exact? It just shows LLMs are the *worst* copyright infringement method ever invented.
      • That's the issue. These Large language models are useless without stealing other people's data and copyrighted work. All it is a glorified search engine profiting off others

        I just stole your post by looking at it. Sue me.

      • by rocket rancher ( 447670 ) <themovingfinger@gmail.com> on Saturday April 05, 2025 @07:04PM (#65284037)

        That's the issue. These Large language models are useless without stealing other people's data and copyrighted work. All it is a glorified search engine profiting off others

        You are either a lame-ass troll, or a software engineer who just got replaced by an LLM. I'm going with the former. If it's the latter, just grow up already, and find a new career. Calling it stealing doesn’t make it so. That may work as clickbait, but it fails as analysis. Courts have a well-defined process for determining whether use of copyrighted material qualifies as infringement—and that process includes doctrines like fair use, transformative use, and de minimis use. You can scream theft all day, but until a judge agrees with you, all you’ve got is a talking point that tattoos "Troll here" on your forehead.

        And no, LLMs are not “glorified search engines.” That’s not even wrong—and I get that trolls like you aren't interested in being right-- but it misunderstands both what search engines do and how large language models work. LLMs are generative systems that create statistically probable outputs based on token prediction across massive, contextually learned patterns. They don’t fetch—they synthesize. That’s a critical difference, and if you’re going to rage-troll about the technology, you should at least try to describe it accurately.

        If you want to argue that LLMs raise real ethical or legal issues, fine—we’re all ears. But if you show up with vague accusations, tech illiteracy, and zero nuance, don’t expect the rest of us to mistake your trollish drivel for a point. There are a lot of serious conversations happening in this thread. Maybe try contributing to one. *plonk*.

    • OpenAI's defense at this point is geared around minimizing damages. They have already lost the infringement, and they know it. The Times has a very strong case, including willful infringement, so the only question is what the penalties will be. All that's happening now is going through the motions to the guilty verdict. The Times will have to screw up royally in order to lose.

      • If they lose some lawyers are going to write to the owners of every registered work under the sun to start the largest class action in history. With statutory damages for wilful infringement times millions they'd be bankrupt if the truth comes out. How many people is Sam willing to kill to prevent the truth of the training set of getting out?

        I think fair use is their only chance.

        • I don't think they need to defend much on this front, copyright is dead, just doesn't realize it yet. There are two choices here: 1. either protect expression while LLMs can generate different enough expression unimpeded, in which case copyright can't be protected anymore, or 2. protect abstractions, styles and facts so LLMs can't use them in any form, but in this case human creators are also going to be barred from the same, which will tank creativty. No way around it, the problem is that now LLMs can quic
          • Copyright protects reproduction, like the reproduction into the training set. What the LLM contains and produces is entirely besides the point ... start with the low hanging fruit, statutory damages for copies of registered works into the training set. That alone can bankrupt OpenAI.

            They pirated every text in the world, same as Meta. Even with assamsinations, I don't think they can cover that up.

      • You are treating a denied motion to dismiss like it’s the closing statement at trial. It is not. The judge didn’t rule that OpenAI infringed copyright—he ruled that the allegations, if proven, are strong enough to warrant discovery and trial. That is a big deal, yes—but it is not a verdict. Pretending otherwise oversimplifies what’s actually happening and turns complex legal proceedings into fanfiction.

        OpenAI's defense at this point is geared around minimizing damages.

        Not even close. OpenAI is still actively defending the core claim that train

      • This is a case of regurgitation that used to happen in the era where LLM developers didn't deduplicate their training sets well enough. Have you seen any other regurgitation suits more recently? No? Because it doesn't happen. In fact it only happens if you are entrapping the model with an exact paragraph from the target material as seed. So you already need to have access to the material to be able to make a LLM regugitate it, and it only happens like 1 in 100 times, and it's usually imprecise.
    • Set booby traps as a form of watermarking.

      Companies will start deliberately seeding articles with fake news, grammatical oddities, made up words and other forms of digital subterfuge. Much like dictionaries and mapmakers used to insert phantom content to detect copying.

      Then when bots scrape your content, you can show the judge the fingerprinting you inserted.

      You may inadvertently invent a whole new vocabulary but once you've draffered the April sneggleklergen, you're past the point of no return.

      • Or if it does AI is just going to collapse on its own. That's because as AI takes over the internet AIs will begin to train on AI.

        That's a problem that was already seen in advance and a lot of work and effort has already gone into solving it and a lot of work and effort will continue to go into solving it.

        So you don't need the poison your content. If they can't find a workable solution to The problem you're raising AI will collapse on its own and if they can then you're wasting your time poisoning
        • by tlhIngan ( 30335 )

          That's why services like CloudFlare go and send AI bots down a maze of twisty passages, all filled with AI generated slop. Normal users aren't likely to come across links to those traps because they're going to be things like white on white text, or links embedded in tiny fonts or on punctuation and other things other than by accident.

          The goal of which is those services that blindly scrape get their AI demise by internalizing AI generate slop. The other stuff of which can be excluded by robots.txt. The AI

      • by xevioso ( 598654 )

        This is happening now.

        You can go to recipe websites and find stupid instructions like, "Cook the chicken at 368 degrees for 20 minutes". No one would ever actually use that as real instructions for chicken, because it doesn't matter if it's 368 degrees or 350 for that amount of time. That number in inserted so the folks making the recipe websites can tell when someone has copied their recipes.

    • That is what the lawsuit is about. What uses are permitted under the copyright law and what are not. Since no one had heard of LLMs when the law was written, this is open to debate. Since as far as I know, none of us are copyright lawyers, are opinions are basically of no value.
      • Since as far as I know, none of us are copyright lawyers, are opinions are basically of no value.

        Our opinions have value because, we as a society, define the laws that we all abide under. This also means we have a stake in this decision because we are directly impacted by it and it's potential to limit our ability to compete globally / domestically.

        • We are entitled to opinions about what the law ought to be. What the law is, is another matter. Whether LLMs should be allowed to use other people's data is a matter of opinion, whether it is legal is a matter for lawyers, judges, and juries.
          • Laws that fail to account for the needs of all, are not laws, but tools wielded against the public for little benefit. Such tools have no place in a democracy. If you want to justify living under the reign of tools, there are plenty of countries around the world that would be a better fit for you.
            • It is called the rule of law, and it is all that protects us from the arbitrary acts of the powerful. The laws may be wrong or out of date, but anarchy or arbitrary totalitarianism are the alternatives. If you do not like the laws elect someone who will change them. That is what the MAGAs have done. Now they get to live with the consequences, that they did not understand. Ignorance is the great enemy of democracy.
    • This is why, as posted by martin-boundary 8 hours ago, on the thread about how Wikipedia is serving 80% of the hits on the site to bots:

      For thousands of years, man took a small boat and went to fish in the ocean to feed his family. Now mega trawlers rake the ocean floor with nets that catch everything swimming for miles around the ship.

      For thousands years, fish populations have existed and been caught by humans. Now, fish populations are going extinct because the trawlers are fishing faster than humans did.
    • by OrangeTide ( 124937 ) on Saturday April 05, 2025 @12:11PM (#65283457) Homepage Journal

      It's copying because it cannot be transformative because for it to have a new purpose and meaning, would imply that so-called AI systems have the ability to express purpose or meaning.

      Data go into computer, data come out of computer. Copyright still holds.

      Laypeople are far too quick to anthropomorphize an algorithm. It's not like you or I reading a Harry Potter, then deciding that it would be fun to write about a wizard's school for cats. Sure maybe derivative, but very rarely is anything in art is cut from whole cloth.

      • You’ve taken a philosophical objection—“AI can’t express meaning”—and tried to turn it into a legal slam dunk. But copyright law is not that simple, and courts do not require sentience to evaluate whether a use is transformative or infringing. You’re confusing how you feel about AI with how the law actually works. More to the point—you claim an AI cannot express purpose or meaning, and therefore cannot be transformative. That may sound deep, but it collapses u

      • Prompt: "Read a Harry Potter, then write about a wizard's school for cats."

        The LLM might use data from Harry Potter under that prompt, but it might not. Removing the Harry Potter bit only makes it more uncertain. Just because an LLM could use data from Harry Potter doesn't mean that the LLM's output is Harry Potter. Nor does it guarantee that Harry Potter's copyright would legally apply.

        People may be quick to anthropomorphize things, but just as many others are quick to declare something as cut and dry
        • I'd recommend suing the AI service's owner if you find a few matching words in the output with your copyrighted work. Then have them prove that the data they ingested didn't get accidentally used in the LLM's output. Since it's a civil case, you pretty much just have to show that a business was materially harmed by what is probably a copyright violation. As it did ingest copyrighted material without permission, and the defendant has no way of knowing if the copyrighted content was use as a basis for the gen

    • more than 100 pages of examples of ChatGPT outputs and media reports showing that ChatGPT could regurgitate portions of paywalled news articles

      You should at least TRY to get to the end of the summary before you make a fool of yourself in public.

    • The content still exists on their servers in order to be transformed into whatever the AI creates and they're not allowed to do that under the law without licensing it. Copyright is just that. Your right to make a copy. And they are absolutely making a copy when they ingest the data.

      I don't think it matters. The courts tend to side with whoever has the most money and whoever can make the most money and in this case the AI companies have an unlimited capacity to make money here and unlimited amounts of ve
      • Let me get this straight. Your argument boils down to: “Ingesting is copying, copying is illegal, therefore case closed,” followed by a shrug and a rant about how money always wins. That is not a legal position. That is a trollish tantrum trying to look like cynicism. The court walked in with a ruling that will echo through every AI model, license agreement, and copyright claim for years to come. If you want to troll at the bumper-sticker level, that is definitely your lane—but do not mis

    • You can't call it a 'world view' if it exists in a complete vacuum away from the world. Nor can you call a direct calculation on only the internet a 'world view'. Having a world view involves experiencing the five senses of all the things happening around you.
    • It seems to me like AI is just sort of 'ingesting' content, internalizing it, and building its world view based on it... Just like a person would. No word for word copying is going on (otherwise the model would be many many terabytes)... So IMHO this should be dismissed, plain and simple.

      You’re absolutely right in spirit—these models do internalize data and abstract patterns from it in a way that feels eerily human. That’s the fascinating part.

      But the legal system isn’t just asking how it learns, it’s asking what it can regurgitate, and under what circumstances. The NYT's case is not claiming that the entire model is a giant database of news articles—it’s alleging that, under the right conditions, ChatGPT can reproduce near-verbatim excerpts from th

    • "No word for word copying is going on"

      You don't know that. It can't keep a record of everything it's read, but the models are largely black boxes. There's really no way to know if it's memorizing long passages

      The only way to test this would be to see if it generated identical passages. And these tools have indeed done so. It actually doesn't even matter if it's actually copying word for word or whether it's parallel construction. Its still infringement if it's the same words.

    • Yes, the training set is 100 to 1000x the size of the model. Even if they wanted, they could not encode the full training set into the model. What the model retains is just a residue of it.
    • by Bongo ( 13261 )

      If material could never be reproduced (reading and remembering) then the material would be worthless to everyone. But if it could always be reproduced with no benefit to its creators, then they could not feed themselves and survive. Where to set the balance is full of detail and difficulty.

      LLMs may well need their own special rules. For example, I for one gave up my O'Reilly subscription because now I can get most of that quickly looking up the basics of some tech thing quite quickly from an LLM -- so someh

    • Yet computers don't remember stuff but rather store it on disk. Keeping a copy. And that's the issue. AI cannot function well, if at all, without copying everything it can get hold of.
  • OpenAI should be liable for users who manipulate ChatGPT to regurgitate content in order to skirt the NYT's paywalls ...

    LLMs are notoriously bad at verbatim retrieval, and the notion that someone would use ChatGPT or whatever to read the NYT is the stupidest thing I'm likely to read on an unusually stupid news day.

    The New York Times is a drop, or at most a bucket, in the ocean of training material used to generate a vast soup of vectors. This is profoundly transformative: it's not a .zip file of NYT articles being published, it's the combined influence of a myriad of sources--including the New York Times. If this isn't fa

    • by StormReaver ( 59959 ) on Saturday April 05, 2025 @11:45AM (#65283433)

      Google won, and if there's any consistency at all then the LLM trainers will win too.

      Google Library Project and LLM trainers aren't even remotely similar, so it would be incredibly inconsistent for OpenAI to prevail. At the very least, Library Project shows only a snippet of a book. It then points users to legitimate purchasing options rather than charging users for access to material for which Google has no legal rights. This does not infringe on the rights-holders ability to monetize their rights. OpenAI, on the other hand, copies the entirety of such material, then directly charges for access to it. This deprives the rights-holders of their ability to control/monetize their creations.

      OpenAI is the largest, most blatant copyright infringer ever created. If OpenAI were to prevail, it would destroy anyone's ability to make a living from any creative endeavor that can exist digitally.

      • OpenAI, on the other hand, copies the entirety of such material, then directly charges for access to it.

        What part of "LLMs are bad at verbatim regurgitation" do you not understand? Do you think OpenAI or Alibaba or whoever discovered a 1000:1 lossless compression scheme?

        You sound like a lot of people I know who have strong opinions on the subject but little to no experience actually using LLMs.

    • The issue here is that the original content still gets stored in some form in order to be used to create the new content.

      Copyright doesn't cover what goes into a human brain. But as soon as you start reading bits and bytes and then copying those bits and bites you've triggered copyright. If the law is applied as written then AI isn't legal.

      I do not expect the law to be applied as written though. There's so much money to be made in AI and judges tend to side with whoever's got the most money. Like th
      • If the case depended on what the law says, and what it means, you would probably be right because it's the judge's job to interpret the law. However, unless the plaintiff's attorneys are incompetent, this case is going to hing on the facts, and the jury is the trier of facts. It doesn't matter what the judge thinks, or how he (she) would rule giving the chance, the only thing that matters is what the jury decides. And, as this is a civil suit, the standard is not the famous "beyond a reasonable doubt," b
        • You're sort of right, but the judge exerts considerable influence on the outcome through decisions on admissibility of evidence and a mass of other procedural rulings.
          • True. However, the judge has to be careful not to be too openly biased, lest he leave himself open to an appeal on the grounds of bias.
      • The issue here is that the original content still gets stored in some form in order to be used to create the new content.

        Nope, that's not alleged in the NYT lawsuit, or I can't find it in their filing despite a thorough search just now. Their beef is with the training, alleged verbatim retrieval, associated search stuff etc.

        Look up "transformative" fair use. Your post reveals a fundamental misunderstanding of how copyright is applied and of the specific issues at hand in this suit.

        • Transformative fair use might apply to the finished product. But it doesn't apply to training. If I am a corporation selling my employees' services to others and I copy my competitors' training material to train my workforce, I have infringed.

          Individuals are covered by fair use for copying things for learning, commercial entities are not.

          • Transformative fair use might apply to the finished product. But it doesn't apply to training.

            [citation missing]

            You won't find that citation, either, because neither statute nor case law has caught up with the technology. Also, your analogy is misapplied because "training" isn't really the same thing in those two cases.

            • Training is not transformative. It's simply copying the data from a website to a database for future processing.

              • Now you're saying two different things:

                1: Training is not transformative, which is patently untrue.
                2: Copying the data is infringing, which is complete nonsense because neither the web nor search engines as we know them would be possible.

                You really need to read up on fair use and modern copyright case law.
  • This is no different from a student educating themself by reading publicly available stuff.
    There is a bigger issue here, the future of copyright and the value of content.
    As content, whether text, images or video becomes effortless to create, its value will drop to zero.
    Some things, like professional journalism, still require great effort to create. This is expensive.
    In the past, the costs were paid by advertising, but bots don't read ads.
    Expect the end of publicly available quality journalism. Expect subscr

  • Some people think AIs work is transformative. Some think it's just mindless copying and passing on no matter how much mixing and creativity there is. Never the twain shall meet because of emotional investments on both sides on topics as esoteric as whether machines, however intelligent, can be said to think. There is no common ground on that topic so let's bypass it. If you think AIs violate copyright every time they read an article what remedy do you want? LLMs cannot exist without doing that and cannot p
  • If OpenAI hacked their way into the NYT, then they are liable. If they just trained on publicly available content? That needs to be declared legal.

    Yes, I know there are s lot of amateurish, ill-behaved bots out there. That's a different problem altogether. The point is: Material made freely available on the internet is free to read: for humans, aliens, or AIs.

  • It's really simple. We have the technology.
    Let AI read everything.
    If you charge for AI, then you share the revenue you make off the knowledge you learned.
    If you charge $10 for a response that includes information learned from 3 books written by John Doe, John Doe gets a % of what you charge.
    Maybe your CEO doesn't get paid $100M a month as a result, I'm ok with that.

  • by rocket rancher ( 447670 ) <themovingfinger@gmail.com> on Saturday April 05, 2025 @03:37PM (#65283747)

    I just plowed through the full 47-page ruling in New York Times v. OpenAI, and the Ars Technica summary leaves out some of the most important bits.

    Yes, the judge let key claims move forward—including contributory infringement—but Ars barely mentions the most striking part: the court rejected OpenAI’s “substantial noninfringing use” defense, calling it a “straw man.” The judge made clear that ChatGPT’s ongoing relationship with users means OpenAI could still be liable if its models regurgitate copyrighted material. This is a big deal. It shows courts may not treat LLMs like neutral tools—and that alone could reshape how AI output liability works.

    Also missing: while the NYT’s “hot news” and DMCA claims were tossed, similar DMCA claims from other plaintiffs in the consolidated cases survived. That nuance matters. As a co-defendant in this case, Microsoft got out clean this time, but the ruling invites a deeper discussion on whether Big Tech partners are merely embedding AI—or helping build bullet-proof copyright infringement engines.

    This ruling is a canary in the coal mine for Meta and others. If courts follow this logic, arguments about “fair use” and “general-purpose tools” may not be enough to avoid discovery—or liability. The AI legal landscape just got a lot more real.

    • thats not how the ai does thing it just uses that info and makes its own version.
    • by allo ( 1728082 )

      There are two sides to generative AI:

      1) Training. Is someone allowed to train on the NYT data?
      2) Output. Are they allowed to produce output too similar to the original data?

      1) will still take a few court cases and it is dangerous to dismiss fair use as it would affect a lot of other uses than AI training.

      2) is not that complicated, I would think. Let's take ChatGPT as blackbox and not care if AI is in there or not. If it now produces content that infringes copyright, it should not matter what's inside the b

  • The "art" is terrible. If someone gave me a birthday card drawn by AI I would punch them in the face because the art is so awful. It reproduces 3 art styles in 1 picture resulting in garish colours, nonsense images and out of context objects. Because there is not underlying artistic reason for the objects we just end up the uncanny valley feeling of undigestible food like substance that reminds you of food .
  • Would someone please create a scraper for the NYT, the WSJ, the economist, and the rest of the paywalled stuff, compare it to the AP and Reuters, and just publish it for free? I'm too lazy to get the script working yet, but the copyright nonsense is about to be bullshit.
  • This is the only suit that focuses on LLM outputs as being infringing. Most other suits focus on inputs, using data in training models. I think this demonstrates that LLMs normally do not infringe copyrights in their outputs, which is a big blow to copyright defenders. If regurgitation was more common, we would see plenty of suits.
  • This does not affect the merits of the case.
    OpenAI tried to weasel out of it with technicalities (NYT could have known we were crawling because we've talked about such things in the past) and the judge told them they won't get out that easily.

I've noticed several design suggestions in your code.

Working...