Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
AI Books

OpenAI Accused of Training GPT-4o on Unlicensed O'Reilly Books (techcrunch.com) 43

A new paper [PDF] from the AI Disclosures Project claims OpenAI likely trained its GPT-4o model on paywalled O'Reilly Media books without a licensing agreement. The nonprofit organization, co-founded by O'Reilly Media CEO Tim O'Reilly himself, used a method called DE-COP to detect copyrighted content in language model training data.

Researchers analyzed 13,962 paragraph excerpts from 34 O'Reilly books, finding that GPT-4o "recognized" significantly more paywalled content than older models like GPT-3.5 Turbo. The technique, also known as a "membership inference attack," tests whether a model can reliably distinguish human-authored texts from paraphrased versions.

"GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O'Reilly books published prior to its training cutoff date," wrote the co-authors, which include O'Reilly, economist Ilan Strauss, and AI researcher Sruly Rosenblat.

OpenAI Accused of Training GPT-4o on Unlicensed O'Reilly Books

Comments Filter:
  • by TheMiddleRoad ( 1153113 ) on Wednesday April 02, 2025 @01:16AM (#65275481)

    People post stuff all over the internet, including from Oreilly. It's probably hard not to suck up copyrighted info if you're not super careful, and these AI scumsuckers most certainly aren't.

    • Could have asked Suchir Balaji if he still lived.

      I used to say you can always find someone with an axe to grind, but I didn't anticipate they'd be suicided.

    • by Bahbus ( 1180627 )

      Also, the way it searches the internet now, means that it could scrape copyrighted information without knowing it's copyrighted. The only way for it to know, would be to purposely train it to recognize and then ignore the copyrighted work(s).

      All these various authors and publishers who have complained about AI being trained on their works are complete morons.

  • "Likely" (Score:4, Interesting)

    by eclectro ( 227083 ) on Wednesday April 02, 2025 @01:32AM (#65275501)

    Maybe they bought a print copy off ebay, scanned the book using a book scanner, and then used it to "train" the computer.

    What then?? Cue the end of "software licensing"??

    • Maybe they bought a print copy off ebay, scanned the book using a book scanner, and then used it to "train" the computer.

      Sure, and I've got a bridge to sell you - cheap!

    • Google needed a fair use ruling for that, OpenAI doesn't have one yet.

    • More likely: O'Rielly submitted portions of the books to GPT-3.5 for testing which were then incorporated into the 4.0 release. It now recognizes the passages because O'Rielly gave it to the AI.

  • O'Reilly still makes books?

  • Like everyone else, Openai developers Trained on Unlicensed O'Reilly Books.

    So what? Until copyright terms are a fair 5 years, pirate on!
    • But unlike many others, they have been caught red-handed and have plenty of money to sue for.
    • by wed128 ( 722152 )
      Why is 5 years fair? Because that's how long you want to wait? 5 years is short enough that everyone would just wait it out, and the publishing industry would pretty much cease to exist. 20 years seems more fair. And there's the problem, nobody can agree on what's "fair".
      • 5 years is enough to encourage people to produce while also stimulating creatvity, the WHOLE POINT OF COPYRIGHT.

          > And there's the problem, nobody can agree on what's "fair".

        No.
        The problem is that Walt Disney and the copyright cartel decided what the law should be.
      • by allo ( 1728082 )

        It's enough time to make the same profit of the book as someone in another job would make during that time.

      • by Bahbus ( 1180627 )

        The publishing industry is garbage and should be destroyed.

    • by Luckyo ( 1726890 )

      Training on material used is free. Always has been free. It's a fundamental right required to progress all society, as all training is done based on previous works of humanity. We're incremental as beings and as society, and that's an inherent feature of our biological reality. We cannot avoid this.

      That means that we literally cannot generate a sort of "copyright" style monopoly for training and learning from materials. We only can only ban copying materials themselves. Banning training and learning from ma

  • by SeaFox ( 739806 ) on Wednesday April 02, 2025 @02:57AM (#65275565)

    They're looking at Facebook and how much trouble they are not getting in for doing it, and realized it's open season now for companies to ignore copyright law if AI is involved.

  • If we cannot educate AI's on educational texts then what is the point ?

  • Former Fox News personality

  • That's what AI is, or has turned out to be.
  • Sorry, I see no difference between a human reader and an AI reader. Neither copies the book verbatim into their head. Both can recall interesting phrases and ways of speaking. AI will eventually bring us the age of knowledge abundance, and with that, we'll have no more need for so-called "intellectual 'property.'"
  • ... my dead tree copy to read? OCR is trivial for today's AIs and it's my copy of the book.

  • Big Joe: The Sherman is broken down and nothing going to move that Tiger out of the square.
    Crapgame: Then make a DEAL!
    Big Joe: What kind of deal?
    Crapgame: A DEAL, deal! Maybe the guy's a Republican. "Business is business," right?

  • We know that the GPT models were trained on several datasets containing unlicensed books. It is highly unlikely that someone hacked a paywall, it is much more likely that the books were included in one of the datasets of unlicensed books. And these datasets are huge. If we got a news story for every publisher that found a work of theirs in there, slashdot would report nothing but what GPT is trained on.

    The question is whether they were allowed to train on unlicensed data. If not, there is no need for O'reil

  • Maybe, with that many of his publishing empire's books training chatbots, *he* might be willing to do the *correct* thing: file criminal charges against the CEOs for receiving stolen goods.

C'est magnifique, mais ce n'est pas l'Informatique. -- Bosquet [on seeing the IBM 4341]

Working...