

OpenAI Accused of Training GPT-4o on Unlicensed O'Reilly Books (techcrunch.com) 43
A new paper [PDF] from the AI Disclosures Project claims OpenAI likely trained its GPT-4o model on paywalled O'Reilly Media books without a licensing agreement. The nonprofit organization, co-founded by O'Reilly Media CEO Tim O'Reilly himself, used a method called DE-COP to detect copyrighted content in language model training data.
Researchers analyzed 13,962 paragraph excerpts from 34 O'Reilly books, finding that GPT-4o "recognized" significantly more paywalled content than older models like GPT-3.5 Turbo. The technique, also known as a "membership inference attack," tests whether a model can reliably distinguish human-authored texts from paraphrased versions.
"GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O'Reilly books published prior to its training cutoff date," wrote the co-authors, which include O'Reilly, economist Ilan Strauss, and AI researcher Sruly Rosenblat.
Researchers analyzed 13,962 paragraph excerpts from 34 O'Reilly books, finding that GPT-4o "recognized" significantly more paywalled content than older models like GPT-3.5 Turbo. The technique, also known as a "membership inference attack," tests whether a model can reliably distinguish human-authored texts from paraphrased versions.
"GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O'Reilly books published prior to its training cutoff date," wrote the co-authors, which include O'Reilly, economist Ilan Strauss, and AI researcher Sruly Rosenblat.
They were just as likely read from pirated copies (Score:5, Insightful)
People post stuff all over the internet, including from Oreilly. It's probably hard not to suck up copyrighted info if you're not super careful, and these AI scumsuckers most certainly aren't.
Re: (Score:2)
Could have asked Suchir Balaji if he still lived.
I used to say you can always find someone with an axe to grind, but I didn't anticipate they'd be suicided.
Re: (Score:2)
Also, the way it searches the internet now, means that it could scrape copyrighted information without knowing it's copyrighted. The only way for it to know, would be to purposely train it to recognize and then ignore the copyrighted work(s).
All these various authors and publishers who have complained about AI being trained on their works are complete morons.
Re: (Score:2)
Incorrect. These comments are not copyrighted. In general, most reddit posts/comments are not copyrighted (unless they contain original creative works). Disclaimers don't mean anything.
And even if it were correct, there is still no way to have the AI parse that correctly without first giving the AI the full work to compare against. Imagine you find a book in the forest, but you don't know the author, publisher, or anything other than the story. Then imagine you aren't allowed to go to the library to see if
Re: (Score:2)
Reddit posts are not inherently creative works. A single reddit post of the with just the word "penis" wouldn't count as a creative work. Your words and comments are not automatically creative works.
"Likely" (Score:4, Interesting)
Maybe they bought a print copy off ebay, scanned the book using a book scanner, and then used it to "train" the computer.
What then?? Cue the end of "software licensing"??
Re: (Score:2)
Maybe they bought a print copy off ebay, scanned the book using a book scanner, and then used it to "train" the computer.
Sure, and I've got a bridge to sell you - cheap!
Re: (Score:2)
Google needed a fair use ruling for that, OpenAI doesn't have one yet.
Re: (Score:1)
More likely: O'Rielly submitted portions of the books to GPT-3.5 for testing which were then incorporated into the 4.0 release. It now recognizes the passages because O'Rielly gave it to the AI.
Re: "Likely" (Score:2)
No I have not seen that. I have seen it cite sentences and parts with an answer but itâ(TM)s not distributed a full copy to my device. I imagine a human with a good memory or a research paper making a similar citation and listing it in a bibliography which clearly would be within the definition of traditional fair use.
Who knew? (Score:2)
O'Reilly still makes books?
Re: Who knew? (Score:2)
openai developers... (Score:2)
So what? Until copyright terms are a fair 5 years, pirate on!
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
> And there's the problem, nobody can agree on what's "fair".
No.
The problem is that Walt Disney and the copyright cartel decided what the law should be.
Re: (Score:2)
It's enough time to make the same profit of the book as someone in another job would make during that time.
Re: (Score:2)
The publishing industry is garbage and should be destroyed.
Re: (Score:2)
Training on material used is free. Always has been free. It's a fundamental right required to progress all society, as all training is done based on previous works of humanity. We're incremental as beings and as society, and that's an inherent feature of our biological reality. We cannot avoid this.
That means that we literally cannot generate a sort of "copyright" style monopoly for training and learning from materials. We only can only ban copying materials themselves. Banning training and learning from ma
Re: (Score:2)
Can you point me to an example of having to pay a licensing fee to someone making learning materials for for being allowed to learn?
Note: not copying said materials, or receiving a copy of them, or transferring a copy of them, etc. It needs to be an example for payment for actual learning, rather than copyright.
Why should they worry? (Score:5, Insightful)
They're looking at Facebook and how much trouble they are not getting in for doing it, and realized it's open season now for companies to ignore copyright law if AI is involved.
Define Education (Score:2)
If we cannot educate AI's on educational texts then what is the point ?
Re: (Score:1, Insightful)
FTFY
Re: (Score:1)
"Education isn't free."
Pardon me?
I'm well into my sixth decade; I've been reading and learning all of my life. Taught myself programming and how to play the piano and chess, how to shingle a roof and fix a toilet,... well I could go on but what's the point?
I used to haul stacks of books home from the library. Now I just read stuff on my computer.
But I suspect I have a more rounded education than many people who attended modern universities.
Education can indeed be free, more so now than at any time in prev
Re: (Score:2)
Actually, it is.
Is that Bill O'Reilly ? (Score:1)
Former Fox News personality
Re: (Score:2)
No, this is https://www.oreilly.com/ [oreilly.com]. I remember having a dozen of their books on my shelf. Here's one source:
https://openlibrary.org/publis... [openlibrary.org]
Greatest data theft of all time... (Score:1)
Intellectual monopoly (Score:2)
What if I just give the AI ... (Score:2)
... my dead tree copy to read? OCR is trivial for today's AIs and it's my copy of the book.
Kelly's Heros - 1970 (Score:1)
Big Joe: The Sherman is broken down and nothing going to move that Tiger out of the square.
Crapgame: Then make a DEAL!
Big Joe: What kind of deal?
Crapgame: A DEAL, deal! Maybe the guy's a Republican. "Business is business," right?
Will everyone do this now? (Score:2)
We know that the GPT models were trained on several datasets containing unlicensed books. It is highly unlikely that someone hacked a paywall, it is much more likely that the books were included in one of the datasets of unlicensed books. And these datasets are huge. If we got a news story for every publisher that found a work of theirs in there, slashdot would report nothing but what GPT is trained on.
The question is whether they were allowed to train on unlicensed data. If not, there is no need for O'reil
I should write O'Reilly (Score:2)
Maybe, with that many of his publishing empire's books training chatbots, *he* might be willing to do the *correct* thing: file criminal charges against the CEOs for receiving stolen goods.