OpenAI Accused of Training GPT-4o on Unlicensed O'Reilly Books (techcrunch.com) 49

Posted by msmash on Tuesday April 01, 2025 @11:45PM from the secret-sauce dept.

A new paper [PDF] from the AI Disclosures Project claims OpenAI likely trained its GPT-4o model on paywalled O'Reilly Media books without a licensing agreement. The nonprofit organization, co-founded by O'Reilly Media CEO Tim O'Reilly himself, used a method called DE-COP to detect copyrighted content in language model training data.

Researchers analyzed 13,962 paragraph excerpts from 34 O'Reilly books, finding that GPT-4o "recognized" significantly more paywalled content than older models like GPT-3.5 Turbo. The technique, also known as a "membership inference attack," tests whether a model can reliably distinguish human-authored texts from paraphrased versions.

"GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O'Reilly books published prior to its training cutoff date," wrote the co-authors, which include O'Reilly, economist Ilan Strauss, and AI researcher Sruly Rosenblat.

OpenAI Accused of Training GPT-4o on Unlicensed O'Reilly Books

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 49 Comments Log In/Create an Account

Comments Filter:

They were just as likely read from pirated copies (Score:5, Insightful)

by TheMiddleRoad ( 1153113 ) writes: on Wednesday April 02, 2025 @12:16AM (#65275481)

People post stuff all over the internet, including from Oreilly. It's probably hard not to suck up copyrighted info if you're not super careful, and these AI scumsuckers most certainly aren't.

- Re: (Score:2)
  
  by Pinky's Brain ( 1158667 ) writes:
  
  Could have asked Suchir Balaji if he still lived.
  I used to say you can always find someone with an axe to grind, but I didn't anticipate they'd be suicided.
- Re: (Score:2)
  
  by Bahbus ( 1180627 ) writes:
  
  Also, the way it searches the internet now, means that it could scrape copyrighted information without knowing it's copyrighted. The only way for it to know, would be to purposely train it to recognize and then ignore the copyrighted work(s).
  All these various authors and publishers who have complained about AI being trained on their works are complete morons.
  - - Re: (Score:2)
      
      by Bahbus ( 1180627 ) writes:
      
      Incorrect. These comments are not copyrighted. In general, most reddit posts/comments are not copyrighted (unless they contain original creative works). Disclaimers don't mean anything.
      And even if it were correct, there is still no way to have the AI parse that correctly without first giving the AI the full work to compare against. Imagine you find a book in the forest, but you don't know the author, publisher, or anything other than the story. Then imagine you aren't allowed to go to the library to see if
      - Re: (Score:2)
        
        by Bahbus ( 1180627 ) writes:
        
        Reddit posts are not inherently creative works. A single reddit post of the with just the word "penis" wouldn't count as a creative work. Your words and comments are not automatically creative works.
"Likely" (Score:4, Interesting)

by eclectro ( 227083 ) writes: on Wednesday April 02, 2025 @12:32AM (#65275501)

Maybe they bought a print copy off ebay, scanned the book using a book scanner, and then used it to "train" the computer.
What then?? Cue the end of "software licensing"??

- Re: (Score:2)
  
  by 93 Escort Wagon ( 326346 ) writes:
  
  Maybe they bought a print copy off ebay, scanned the book using a book scanner, and then used it to "train" the computer.
  Sure, and I've got a bridge to sell you - cheap!
- Re: (Score:2)
  
  by Pinky's Brain ( 1158667 ) writes:
  
  Google needed a fair use ruling for that, OpenAI doesn't have one yet.
- Re: (Score:1)
  
  by Githyanki ( 4092025 ) writes:
  
  More likely: O'Rielly submitted portions of the books to GPT-3.5 for testing which were then incorporated into the 4.0 release. It now recognizes the passages because O'Rielly gave it to the AI.
- - Re: "Likely" (Score:2)
    
    by eclectro ( 227083 ) writes:
    
    No I have not seen that. I have seen it cite sentences and parts with an answer but itâ(TM)s not distributed a full copy to my device. I imagine a human with a good memory or a research paper making a similar citation and listing it in a bibliography which clearly would be within the definition of traditional fair use.
Who knew? (Score:2)

by kamapuaa ( 555446 ) writes:

O'Reilly still makes books?
- Re: Who knew? (Score:2)
  
  by zawarski ( 1381571 ) writes:
  
  Still have a couple of those animal cover books on my shelf, next to https://www.amazon.com/Magic-G... [amazon.com] and https://a.co/d/5AtaSqX [a.co]
openai developers... (Score:2)

by greytree ( 7124971 ) writes:

Like everyone else, Openai developers Trained on Unlicensed O'Reilly Books.

So what? Until copyright terms are a fair 5 years, pirate on!
- Re: (Score:2)
  
  by fph il quozientatore ( 971015 ) writes:
  
  But unlike many others, they have been caught red-handed and have plenty of money to sue for.
- Re: (Score:1)
  
  by wed128 ( 722152 ) writes:
  
  Why is 5 years fair? Because that's how long you want to wait? 5 years is short enough that everyone would just wait it out, and the publishing industry would pretty much cease to exist. 20 years seems more fair. And there's the problem, nobody can agree on what's "fair".
  - Re: (Score:2)
    
    by greytree ( 7124971 ) writes:
    
    5 years is enough to encourage people to produce while also stimulating creatvity, the WHOLE POINT OF COPYRIGHT.
    
    > And there's the problem, nobody can agree on what's "fair".
    
    No.
    The problem is that Walt Disney and the copyright cartel decided what the law should be.
  - Re: (Score:2)
    
    by allo ( 1728082 ) writes:
    
    It's enough time to make the same profit of the book as someone in another job would make during that time.
  - Re: (Score:2)
    
    by Bahbus ( 1180627 ) writes:
    
    The publishing industry is garbage and should be destroyed.
- Re: (Score:2)
  
  by Luckyo ( 1726890 ) writes:
  
  Training on material used is free. Always has been free. It's a fundamental right required to progress all society, as all training is done based on previous works of humanity. We're incremental as beings and as society, and that's an inherent feature of our biological reality. We cannot avoid this.
  That means that we literally cannot generate a sort of "copyright" style monopoly for training and learning from materials. We only can only ban copying materials themselves. Banning training and learning from ma
  - - Re: (Score:2)
      
      by Luckyo ( 1726890 ) writes:
      
      Can you point me to an example of having to pay a licensing fee to someone making learning materials for for being allowed to learn?
      Note: not copying said materials, or receiving a copy of them, or transferring a copy of them, etc. It needs to be an example for payment for actual learning, rather than copyright.
      - Re: (Score:2)
        
        by Luckyo ( 1726890 ) writes:
        
        So not a single example. Not even one?
        I wonder why. Could it be that I was right as usual?
        
        Re: (Score:2)
        
        by Luckyo ( 1726890 ) writes:
        
        Your claim is indeed ridiculous. Glad we have an understanding.
Why should they worry? (Score:5, Insightful)

by SeaFox ( 739806 ) writes: on Wednesday April 02, 2025 @01:57AM (#65275565)

They're looking at Facebook and how much trouble they are not getting in for doing it, and realized it's open season now for companies to ignore copyright law if AI is involved.

Define Education (Score:2)

by polyp2000 ( 444682 ) writes:

If we cannot educate AI's on educational texts then what is the point ?
- - Re: (Score:1, Insightful)
    
    by wed128 ( 722152 ) writes:
    
    Education isn't free. How about they pay 50 billion for each book, since they will *sell* the contents 50 billion times in a row?
    
    FTFY
    - Re: (Score:1)
      
      by innocent_white_lamb ( 151825 ) writes:
      
      "Education isn't free."
      Pardon me?
      I'm well into my sixth decade; I've been reading and learning all of my life. Taught myself programming and how to play the piano and chess, how to shingle a roof and fix a toilet,... well I could go on but what's the point?
      I used to haul stacks of books home from the library. Now I just read stuff on my computer.
      But I suspect I have a more rounded education than many people who attended modern universities.
      Education can indeed be free, more so now than at any time in prev
  - Re: (Score:2)
    
    by Bahbus ( 1180627 ) writes:
    
    Actually, it is.
Is that Bill O'Reilly ? (Score:1)

by rossdee ( 243626 ) writes:

Former Fox News personality
- Re: (Score:2)
  
  by Targon ( 17348 ) writes:
  
  No, this is https://www.oreilly.com/ [oreilly.com]. I remember having a dozen of their books on my shelf. Here's one source:
  https://openlibrary.org/publis... [openlibrary.org]
Greatest data theft of all time... (Score:1)

by zkiwi34 ( 974563 ) writes:

That's what AI is, or has turned out to be.
Intellectual monopoly (Score:2)

by bool2 ( 1782642 ) writes:

Sorry, I see no difference between a human reader and an AI reader. Neither copies the book verbatim into their head. Both can recall interesting phrases and ways of speaking. AI will eventually bring us the age of knowledge abundance, and with that, we'll have no more need for so-called "intellectual 'property.'"
What if I just give the AI ... (Score:2)

by Qbertino ( 265505 ) writes:

... my dead tree copy to read? OCR is trivial for today's AIs and it's my copy of the book.
Kelly's Heros - 1970 (Score:1)

by lasermike026 ( 528051 ) writes:

Big Joe: The Sherman is broken down and nothing going to move that Tiger out of the square.
Crapgame: Then make a DEAL!
Big Joe: What kind of deal?
Crapgame: A DEAL, deal! Maybe the guy's a Republican. "Business is business," right?
Will everyone do this now? (Score:2)

by allo ( 1728082 ) writes:

We know that the GPT models were trained on several datasets containing unlicensed books. It is highly unlikely that someone hacked a paywall, it is much more likely that the books were included in one of the datasets of unlicensed books. And these datasets are huge. If we got a news story for every publisher that found a work of theirs in there, slashdot would report nothing but what GPT is trained on.
The question is whether they were allowed to train on unlicensed data. If not, there is no need for O'reil
I should write O'Reilly (Score:2)

by whitroth ( 9367 ) writes:

Maybe, with that many of his publishing empire's books training chatbots, *he* might be willing to do the *correct* thing: file criminal charges against the CEOs for receiving stolen goods.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

They were just as likely read from pirated copies (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

"Likely" (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: "Likely" (Score:2)

Who knew? (Score:2)

Re: Who knew? (Score:2)

openai developers... (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Why should they worry? (Score:5, Insightful)

Define Education (Score:2)

Re: (Score:1, Insightful)

Re: (Score:1)

Re: (Score:2)

Is that Bill O'Reilly ? (Score:1)

Re: (Score:2)

Greatest data theft of all time... (Score:1)

Intellectual monopoly (Score:2)

What if I just give the AI ... (Score:2)

Kelly's Heros - 1970 (Score:1)

Will everyone do this now? (Score:2)

I should write O'Reilly (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals