Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
AI The Media The Courts

The New York Times Sues OpenAI and Microsoft Over AI Use of Copyrighted Work (nytimes.com) 59

The New York Times sued OpenAI and Microsoft for copyright infringement on Wednesday, opening a new front in the increasingly intense legal battle over the unauthorized use of published work to train artificial intelligence technologies. From a report: The Times is the first major American media organization to sue the companies, the creators of ChatGPT and other popular A.I. platforms, over copyright issues associated with its written works. The lawsuit [PDF], filed in Federal District Court in Manhattan, contends that millions of articles published by The Times were used to train automated chatbots that now compete with the news outlet as a source of reliable information.

The suit does not include an exact monetary demand. But it says the defendants should be held responsible for "billions of dollars in statutory and actual damages" related to the "unlawful copying and use of The Times's uniquely valuable works." It also calls for the companies to destroy any chatbot models and training data that use copyrighted material from The Times. The lawsuit could test the emerging legal contours of generative A.I. technologies -- so called for the text, images and other content they can create after learning from large data sets -- and could carry major implications for the news industry. The Times is among a small number of outlets that have built successful business models from online journalism, but dozens of newspapers and magazines have been hobbled by readers' migration to the internet.

This discussion has been archived. No new comments can be posted.

The New York Times Sues OpenAI and Microsoft Over AI Use of Copyrighted Work

Comments Filter:
  • to get totally objective and unbiased coverage with expert analysis by real experts and not just the reporters' friends, roommates, and fuckbuddies.

  • Don't post it on the Intertoobz.

    While we can strut around and bray how "Our Stuff is copyright, so eyez only!" fails understand that once it is on the internet, people can print it out, use it for rule 34 stuff, It's almost like putting top secret stuff on the internet, giving the URL, and demanding that no one look at it.

    • by Entrope ( 68843 )

      Also don't put things in libraries. Anyone can just walk in and use their phone to take pictures, or use the copying machines.

      Also don't put things on news stands, anyone walking by might see the headlines and pictures and such.

      Maybe more than one fix is needed, but boy, they're so easy it's a wonder that no one has done them before!

      • There actually are rules regarding the use of photocopiers in libraries. In the 90s, I remember posted signs warning patrons that it is illegal to photocopy entire books. Ultimately it is up to publishers to enforce those rules. They often look the other way, until they don't.
        • by Entrope ( 68843 )

          There ackchyually are rules about scraping whole web sites and using them for other purposes, too, and they're usually found at the bottom of the web page along with in the robots.txt file. OpenAI, MetaFace, Alphabet etc. just ignored those rules, and now they're being sued over it.

          • Do you have a citation for that? Not to be a dick, but it seems like an area of law that is very unsettled - and I legitly don't know, were there robots.txt files that were ignored?
            • by Entrope ( 68843 )

              The Terms of Service [nytimes.com] linked from the bottom of NYT articles prohibits (without prior written permission) using a spider to crawl their site without permission and also, with a search indexing exception, to "cache or archive the Content". LLM training will violate those terms.

              I would like to see a judge's reaction if a lawyer made the argument "my client didn't bother checking for terms of use, they just assumed that robots.txt completely described the limitations". I do not think it would go well for that

              • Even if robots.txt was configured correctly, only recently has OpenAI started respecting it. And by respecting it, I mean that instead of personally scraping your site they buy it from a third party.

              • The Terms of Service [nytimes.com] linked from the bottom of NYT articles prohibits (without prior written permission) using a spider to crawl their site without permission and also, with a search indexing exception, to "cache or archive the Content". LLM training will violate those terms.

                I would like to see a judge's reaction if a lawyer made the argument "my client didn't bother checking for terms of use, they just assumed that robots.txt completely described the limitations". I do not think it would go well for that lawyer.

                Outfits can say what they like, but lots of spiders crawl NYT - and everyone else, BTW - every day.

                The AI people are just getting sued because they are a handy target.

            • Do you have a citation for that? Not to be a dick, but it seems like an area of law that is very unsettled - and I legitly don't know, were there robots.txt files that were ignored?

              It is very unsettled, because if it's there, it can be copied and used by others. And there are these weird gray areas like web searches.

              As well, does this mean if I read several articles and create something new from them - I have violated copyright? Point is if John Doe is interviewed, and posted as part of a copyrighted article, am I prohibited from say in a blog, write what John Doe said? It's unsettled as hell. And my point was always, put it on the web, and it isn't yours any more. copying it is

              • If you read an article and write something new about what you learned, you haven't violated copyright. If you read an article, put an unauthorized copy of the article in your research notebook, and write something new then you have violated copyright. Not when you wrote something new, but before that when you put a copy in your notebook. And training an AI starts by making a local copy of training data. So developers may have violated copyright well before the training was complete.
                • by catprog ( 849688 )

                  If I open a webpage I have to make a local copy. Does this mean just reading it violates copyright?

      • Also don't put things in libraries. Anyone can just walk in and use their phone to take pictures, or use the copying machines.

        Also don't put things on news stands, anyone walking by might see the headlines and pictures and such.

        Maybe more than one fix is needed, but boy, they're so easy it's a wonder that no one has done them before!

        Certainly copying machines are a long used way of getting information to use. But there is a bit of a difference. At the library, or at work, there's hella lot of copying taking place. I do it myself for research notes. Some of the open literature is not to be removed from the library.

        We can however, make a case for the internet bots taking portions of articles in order to make web searches of any copyrighted words as a defacto violation of copyright, and no more web searches should be allowed.

        The web

  • If the Times doesn't like OpenAI training their AI's on the Times, I wonder how they'd feel about training them on the Post? I expect it would make for a more entertaining Ai, anyway.

    • by Jerrry ( 43027 )

      This would be like the publishers of textbooks suing the users of the books for learning from them and using the knowledge learned in the pursuit of their jobs.

      • No, this would be like publishers of textbooks suing people who photocopied their textbooks instead of buying them, and used the knowledge learned in the pursuit of their jobs.
  • They will probably award themselves the Pulitzer for this next year.

  • Seriously, "billions of dollars in statutory and actual damages related to the unlawful copying and use of The Times's uniquely valuable works"? I just looked and the last valuation of the NYT is $7.75B. Furthermore, I wouldn't say that the outcome of this suit will "carry major implications for the news industry", unless the NYT manage to extort some sort of ongoing payment from MS and OpenAI. Newsprint was already on the decline well before AI made the stage, AI is not going to be what kills it, and a
    • Seriously, "billions of dollars in statutory and actual damages related to the unlawful copying and use of The Times's uniquely valuable works"?

      Those are OpenAI's numbers, they think they are worth $100 billion [yahoo.com]

  • by xack ( 5304745 ) on Wednesday December 27, 2023 @11:08AM (#64109097)
    Who thinks that every single word of their content should be paywalled off and drmed forever. They are still stuck in the print mindset, which is ironic since in the 1970s they made a video [youtube.com] about migrating to modern typesetting techniques. This will just lead to "ai pirates" who will have a superior product to those who have to follow the law.
    • I agree with your criticism of the Times, but at the same time I think it would be hilarious if all the best AI systems were pirate systems and being forced to follow the stupid copyright laws screwed Microsoft. In situations where everyone is a bad guy, sometimes you just root for chaos.

    • Which does beg the question (modern use of the phrase), if the content is paywalled then how were they not remunerated for the computer reading it? Or were they silly enough to allow web crawlers an exception from the paywall? That would be on them.

    • by AmiMoJo ( 196126 )

      I had a look at their lawsuit and they actually seem to have some good points. There are some screenshots here: https://x.com/jason_kint/statu... [x.com]

      The first one shows ChatGPT reproducing their work word-for-word. It's not learned to be like the NYT, it's just copied their work wholesale.

      They also note that different sources seem to be weighted differently as training data, and the NYT is one of the most valuable. That's Microsoft admitting that the NYT comment is valuable to it, and that it was selected caref

      • by znrt ( 2424692 )

        I had a look at their lawsuit and they actually seem to have some good points. There are some screenshots here: https://x.com/jason_kint/statu... [x.com]

        for what it's worth, that literal quote is all over the internet, and chatgpt didn't even exist when it was written. so even if they prove that the text is an actual bot response (i ignore this very important fact but it doesn't seem relevant to this twitter opiner nor his followers) it could have come from anywhere, and i very much doubt that the nyt maintains articles from 2020 online and behind the paywall.

        so this "proof" could actually backfire for failing to police the content they are complaining abou

        • by AmiMoJo ( 196126 )

          Well their very large submission has over 100 other examples to choose from.

          • by znrt ( 2424692 )

            tbh, i wast just shitposting to some extent. what should i know.

            the verbatim reproduction is indeed concerning, and surprising to me. that's not how it is supposed to work, so what is happening here? i don't think a court is the best tool to find out, but then my intuition is that indeed openai have been reckless and the nyt has smelled money and is obviously exploiting copyright law in the most toxic way they possibly can, which ... was all to be expected, and we get to watch!

  • by bradley13 ( 1118935 ) on Wednesday December 27, 2023 @11:21AM (#64109125) Homepage

    If you make something readable, guess what, it may get read. Does it really matter if a human or an AI does the reading? Both may well learn something, both may even recite parts of what they read.

    This is legally muddy, but I hope the courts come down on the side if fair use. If the New York Times doesn't want people reading their articles, they shouldn't publish them.

    • by tlhIngan ( 30335 )

      If you make something readable, guess what, it may get read. Does it really matter if a human or an AI does the reading? Both may well learn something, both may even recite parts of what they read.

      This is legally muddy, but I hope the courts come down on the side if fair use. If the New York Times doesn't want people reading their articles, they shouldn't publish them.

      I read the Linux source code. I write an OS kernel. Is that software I wrote under the GPL because I read the Linux source code?

      Now, I have a

      • Microsoft can then run the source code to say, Paint.net, or Notepad++ through an AI and ship New Paint or New Notepad with Windows as a closed source binary.

        I'm sure there are plenty of Software as a Service (SaaS) vendors who will instantly run plenty of AGPL frameowrks through an AI so they can use them without obeying the terms of the AGPL, because fair use is fair use.

        If you insist copyright doesn't apply to LLM training works, it doesn't apply when an LLM trains on free/open source software either. And without copyright, those FOSS licenses are useless. If the code is free, the GPL isn't required.

        Modality of reproduction is irrelevant. Even if by freak chance you independently "innocent infringement" develop something that happens to be copyright protected by someone else that doesn't save you.

        What matters is whether the resulting work is deemed by a court of law to be derivative of someone else's copyright.

    • Isn't there a substantial amount of case law regarding what constitutes a "derived work"? Why should AI be treated any differently?
  • by WaffleMonster ( 969671 ) on Wednesday December 27, 2023 @11:47AM (#64109187)

    NYT isn't being serious. I don't think they even intend to win they just want to add to the public noise crying over having their business models upended by technology.

    They know full well copyright does not extend to the underlying information. The fact you did the work to gather facts whether it was paying a real journalist to do real journalism or compiling a book filled with all kinds of interesting phone numbers is tough shit. Copyright only protects fixed works. It does not control access to or use of underlying information.

    • As for your overlooking the facts here. The facts are that New York Times doesn't present facts they present narratives.

    • If the NYT articles are in the training database, that's already a copyright violation. They might have a fair use defense, but the content doesn't need to actually be in the chat bot.
    • They're not looking to protect the underlying information but the articles/literary work that has been scraped indiscriminately and that's a good thing. As much as I don't really enjoy monolithic corporations, the way "AI" models are done currently is atrocious for human culture and privacy.
  • by blackomegax ( 807080 ) on Wednesday December 27, 2023 @12:19PM (#64109271) Journal
    I guess we should also "destroy" any human journalist that trained themselves on NYT's previous works. Because that's how LLM's train themselves. They see, learn, adapt, and create. Like humans do.
  • How come a company can claim copyright on news? I mean, anything happening might generate a report, just as we read in this very post from /. . Does that means this post is also a copyrighted news? And does that means that every single news article in the entire world is passive of being copyrighted - or worse, copyright infringed? That will make things look very, very bad in no time.

    • Are you legitimately curious about this? News isn't copyrighted. The words used to express news is. Always has been. You can't copyright that there was a terrorist attack, but you can copyright the phrases used to describe it to readers.

      Does that means this post is also a copyrighted news?

      Yes, and Slashdot could find themselves in legal trouble if they copied the entire story into the summary, as it would breach fair use.

      That will make things look very, very bad in no time.

      Nope, it would literally look the way it looks now.

      • Yep --- you got it exactly right.  Slobbering  big media/JapeChat  excuse-makers  get shuffled-off to  Buffalo. 
  • The argument that, we have stolen so many people's work that the result is indistinguishable is threadbare.
  • How long before ... teachers start suing writers they taught in school for using correct grammar or something else that they taught them ? What about the authors ( or their estates ) of books we had to read in English classes in school ? When will they start suing because we use the writing styles used in those books ? How long before Greece starts suing everyone who uses Pythagoras' theorem ? When will I be sued for using words found in the O.E.D. ? Reductio ad absurdum.
    • by Holi ( 250190 )

      Your comment shows a complete lack of understanding as to what copyright is and how it works. I suggest you do some research on the subject.

    • Copyright ... and for that matter patents, explicitly set-aside a narrow swatch of human IP. Not I, but the legals having been defining and defending those swatches since ROMAN times. Silly issues of suing your teacher don't make-the-hunt. JapeChat does or does-not violate NYT copyright ; depends on current case-law. What's certain is that the huge JapeChat players played fast-and-loose with those swatches, and judgement day has come 'round.

The explanation requiring the fewest assumptions is the most likely to be correct. -- William of Occam

Working...