NYT Prohibits Using Its Content To Train AI Models 83
According to Adweek, the New York Times updated its Terms of Service on August 3rd to prohibit its content from being used in the development of "any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system." That includes text, photographs, images, audio/video clips, "look and feel," metadata, and compilations. The Verge reports: The updated terms now also specify that automated tools like website crawlers designed to use, access, or collect such content cannot be used without written permission from the publication. The NYT says that refusing to comply with these new restrictions could result in unspecified fines or penalties. Despite introducing the new rules to its policy, the publication doesn't appear to have made any changes to its robots.txt -- the file that informs search engine crawlers which URLs can be accessed. The move follows a recent update to Google's privacy policy that discloses the search giant reserves the right to scrape just about everything you post online to build its AI tools.
you can't (Score:5, Insightful)
You can't legally forbid that in your TOS. Does not work.
Re:you can't (Score:5, Insightful)
Whether it's fair use or not will be decided by the supreme court or congress, it's not open and shot. The transformation argument is stupid or naive, whether the results are transformative is secondary concern. They are literally copying it to the training set before there is any transformation ... if it's not fair use, they need a license for that copying.
If the SC decide it's not fair use (because lets face it, congress is useless) prepare for extra damage because NYT told you so. Every major website (including even this one) should do the same ... it will scare the living hell out of AI company lawyers and will kick start a licensing industry.
Re: (Score:3)
Re: (Score:2)
This is more like if someone was mass-copying Trump's blabber and reposting it. That kind of thing isn't allowed under fair use.
Re:you can't (Score:4, Insightful)
The claim here is literally to prohibit learning anything from their articles.
Re: (Score:2)
> The claim here is literally to prohibit learning anything from their articles.
Without a license, purely for developing software, and en masse, yes.
When humans read articles and learn stuff, it's (normally) not purely for developing software, nor en masse.
Whether that distinction is enforceable is a question for the courts, but your statement that it "prohibit[s] learning anything" is patently false.
Re: (Score:1)
Without a license means what, exactly? If I pay for a subscription, is that not a license to consume the content? What's the difference between a person doing that and an algorithm?
There's no software development going on here. There's training a model...which is referred to as "learning" for a reason.
What en masse has to do with anything, I have no idea. If I read the NYT daily, front to back, over the course of 20 years, one can assume I've learned a great deal (or at least, absorbed a great amount o
Re: (Score:2)
Re: (Score:2)
>Without a license, purely for developing software, and en masse, yes.
You don't need a license to compile data about an article, and that's all that an LLM is - data about the frequency of words and their order. It's highly transformative, and this sort of thing was already fought an won by Google when they digitized and made searchable millions of books without their copyright holder's permission and for commercial purposes. An LLM doesn't even retain the original work.
https://en.wikipedia.org/wiki/... [wikipedia.org].
Re: (Score:2)
Re: (Score:2)
Can you use that data to reconstruct a facsimile the movie? If yes, then for all intents and purposes you have a copy of the movie, no matter if you think transcoding it or putting it in a zip file or using a statistical model t
Re: (Score:2)
Re:you can't (Score:5, Informative)
It's literally the primary factor that runs against the rights granted to the copyright holder, even overriding commercial factors. "The more transformative the new work, the less will be the significance of other factors, like commercialism, that may weigh against a finding of fair use" - Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569
If that were true, then 98% of Google's entire business model would be illegal. And the 100% of the Internet Archive would be as well.
FYI: copyright law carves out massive holes for automated data processing. You can download copyrighted data for automated processing. Google was literally found to not be copyright infringing for scanning entire books and then posting chunks of each of them online, unaltered, against author wishes.
Re: (Score:2)
If that were true "fair use" would just mean "fair fucking whatever". Why you "use" something is relevant to why it's "fair use". Indexing is not the same as training a model.
The most relevant caselaw is Field v. Google, if you can't see how most of the judge's argument would be different here you're being obtuse.
Re: (Score:3)
What are you talking about? Google won Field v. Google. And the judge's arguments included that it was automated (Google was "passive in the process" and "Google's computers respond automatically to the user's request."), and that copyright infringement required "volitional conduct on the part of the defendant". It further noted that the DMCA explicitly allows service providers to temporarily cache copyrighted material on their servers. It also found that Google caching the data was fair use, as it was tr
Re: (Score:2)
Google was passive in the process because the cache was shown on request of the user. The LLM training is a commercial part of the AI company, not a request by a user.
Section 512 is for "Transitory Digital Network Communications", "System Caching", "Information Residing on Systems or Networks At Direction of Users" and "Information Location Tools". Which of those would you propose training language models is covered under?
If you can't see how most of the judge's arguments would be different here, you're bei
Re: (Score:2)
Equivalent stages
Google: spidering
LLM company: spidering
Google: using spidered data to build a cache database
LLM company: using spidered data to train a model
Google: serving user requests from its cache database server
LLM company: serving user requests with its inference server
Same amount / lack of human involvement at each step. Except that
Re: (Score:3)
Re: (Score:2)
I just reread a summary of the Google book case.
Here is the relevant part, as far as an AI model is concerned as to justification: "transformative because it provided a new and valuable tool (a searchable index of books) that did not substitute for or harm the market for the original books."
Yes, the Google book case was against those authors who participated wishes. But it was found that they were not harmed by having snippets of their books shown. As well, the transformation was substantive.
It must be sa
Re: (Score:1)
That a copy is created is irrelevant (Score:2)
What NYT needs to consider very carefully is whether they want to risk being delisted from all search engines an
Re: (Score:1)
"What NYT needs to consider very carefully is whether they want to risk being delisted from all search engines and content aggregators as a result of their shenanigans."
That would be a wonderful day and a service to humanity.
Re: (Score:2)
A mechanism mostly meant for indexing for search not LLMs. Robots.txt is clearly not affirmative consent to just reproduce anything on the website entirely, so clearly there are limits to the implied license.
Google calling for the need of an AI equivalent to robots.txt will also kill the argument too in court. They did it torpedo OpenAI of course, but consider that mission accomplished. OpenAI&co never gave any real publicity to their use of content for training and that they considered lack of crawler
Re: (Score:2)
>NYT opted to permit scraping of content using a decades-old standardised mechanism. That is, in effect, explicit permission for machine-automated systems to be able request a copy of each page.
Not legally it's not. It is simply a convention. No law behind it at all. The fact that they don't chose to use this convention to convey their copyright restrictions (which by the way, is perfectly legal to do - create restrictions on what happens with your work) doesn't mean their TOS is somehow invalid. Be
Re: (Score:2)
When you read the NYT, or any web page, you are literally copying it to your computer before you read it. Your argument would make the entire internet illegal.
Before it comes up, just because you think you're reading it directly on the server
Re: you can't (Score:2)
"They are literally copying it to the training set before there is any transformation ... if it's not fair use, they need a license for that copying."
My browser is literally copying the data to the cache before I can read it...
Re: (Score:2)
It's been found in a court of law that the copy into RAM to load a program is a copy that's protected by copyright. There's no way they don't find the copy to a training set to be similarly protected.
Re: (Score:2)
Maybe they can't prevent the training if the model stays an internal tool, but they can sue in case the model is used in a generative tool that is publicly available. They can automate queries based on past headlines and contents, and detect if the result plagiarises any article. Basically what the NYT is saying is that they're ready to litigate.
Re: you can't (Score:1)
That's going to put a damper on prompts such as:
* Generate a bedtime story in the style of the New York Times
Re: you can't (Score:5, Funny)
Timmy: Mummy, will you read me a bedtime story?
Mummy: Of course darling.
Once upon a time there was a...
=~-=~-=~-=~-=~-=~-=
PLEASE SUBSCRIBE TO CONTINUE READING
=~-=~-=~-=~-=~-=~-=
Timmy: Mummy, Mrs Dawkins my teacher said that information wants to be free
Mummy: Sad-emoji face
Re: (Score:2)
Which it won't do, because their content is just a drop in the bucket in the middle of a flood. So there's no issue.
They can be "ready to litigate" all they want, but first they need an actual case.
Re: you can't (Score:1)
Next time you see a cat stuck in a tree or something even more newsworthy (perhaps a tweet about a cat stuck in a tree), bear in mind that some organisation probably has an exclusive deal with the cat relating to the reporting of its shenanigans - or at least believe they do - or even more remotely, act as if they do even though the cat told them it wanted to keep its options open.
Re: (Score:2)
It appears that you can, to me. It's seems like a basic part of copyright for the copyright holders to assert how their works can be used, so why couldn't it be included in the TOS?
What is illegal about it, in your mind?
Re: (Score:2)
You can't legally forbid that in your TOS. Does not work.
Also, how are they going to prove an AI was trained with their data since they theoretically reports news which somebody else might very well have reported too?
A web browser is an "automated tool" (Score:3)
Re:A web browser is an "automated tool" (Score:5, Insightful)
The NYT only sees an issue if using an automated tool to train a model, which you are not doing by just using Firefox to read and translate the NYT. The Firefox Translation add-on was trained on a multilingual database (not the monolingual NYT), which could well be the proceeds of the EU Parliament (a huge corpus guaranteed accurate in 24 languages, also used in other machine translation tools like Linguee).
Re: (Score:1)
which you are not doing by just using Firefox to read and translate the NYT.
Claim made, no facts provided.
The user doesnt know 99% of the things the browser does. Being some part of training AI is likely one of them.
You cant even prove that the number of AIs being trained by your browser is less than 1000
Search engines use AI more and more (Score:2)
Re: (Score:2)
NYT licenses its content to the person reading the content in Firefox to just read it, essentially for personal use. If the content is being used for a purpose other than that, the licensing can be different.
For example this is why public libraries purchase books at a much higher price than end customers, they are purchasing it for a different use, which is lending to the public.
If a company wants to use the data for training purposes, it could require a very different license. And as already mentioned
Public libraries do not pay extra (Score:2)
Reminds me of toothless diplomacy (Score:2)
Re: (Score:3)
This is a warning "if the SC rules against you, we warned you and your infringement will have been willful". With willful infringement, even statutory damages will add up ... no need to prove damages, no need to prove the results are not transformative, the mere copy to the training set will be enough.
The mere threat will scare AI companies long before the SC gets around to judging the question.
Re: (Score:2)
It gets even muddier because you can have other sources where someone might quote a small bit of the NYT as part of their own transformative work. Furthermore if you have a small enough quotation is it even some
Re: (Score:2)
At least the NYT's content is in complete sentences, with proper grammar and spelling. That's a lot better than the typical SMS text-speak, loose punctuation, sloppy spelling, incoherence, inanity, and general idiocy one finds on the Internet.
Re: Like NYT was going to make an AI smarter anywa (Score:2)
Re: (Score:2)
Probably already on the exclusion list, along with any Murdoch and Sinclair media sites. An AI model can only get worse with some sources.
And yet our current Legal system will almost guarantee some kind of idiotic lawsuit against AI by NYT, arguing that AI cannot possibly be considered "intelligence" without their data.
Of course, NYT will then be caught using AI to enhance it's legal argument, after hypocrisy was argued to be a perfectly acceptable side effect of mass narcissism, which became a multi-trillion dollar industry for social media.
Ah, the future...
Re: (Score:2)
In the end (Score:5, Funny)
That would be a scary thought...
Re: (Score:2, Funny)
If this goes on (and can de enforced) then in the end all that's left to train AI is Truth Social. That would be a scary thought...
I can't imagine many people who would react more vigorously than Trump to somebody scraping his content from his social media platform and using it for anything at all whether it is a profit making enterprise or not. Trump is more territorial than a honey badger. He has never given anybody anything for free and I don't think he'll be starting now. He will always demand his 'cut' and I almost, ... almost, ... pity whatever AI start-up that gets the full Roy Cohn treatment of being dragged through the courts
Re: (Score:1)
Don't forget a cut for 'The Big Guy'... And you write like Dems speak places for free! Hillary Clinton made 22 million dollars in speaking fees in 2016 for Christs Sake! But,,, Trump bad! Derp...
Re: (Score:2)
If it isn't enforceable, then that will be the end of the open web. Prepare to sign NDAs to view web sites.
Re: (Score:2)
If this goes on (and can de enforced) then in the end all that's left to train AI is Truth Social.
That would be a scary thought...
Depends on the ethos of the research?
Are they trying to train AI into being normal, functional people... Or do we want to train one to be deluded, insincere and insecure. Because if we want to find out how and why people join cults, fall for the lies of uncharismatic losers, have difficulty discerning reality from fantasy and refuse to acknowledge evidence or reason then it's the perfect material.
Current "AI" will not be capable of replicating a functional human for some time, so we may as well look
And all content that is out of copyright as OLD... (Score:2)
There's an awful lot of stuff available online on that basis; think Project Gutenberg. This might have the advantage of resulting in chatbots speaking old English, making them even more obvious, dost thou not know?
On a good note... (Score:2, Troll)
Hey, the LLM won't understand how to whitewash a Holodomor. [npr.org]
Why, actually? (Score:3)
Having fluent, competent assistance (which is what ChatGPT&co can provide) is a societal benefit. Letting it train on as much data as possible benefits everyone.
That said, I would accept that free training should only be available to OSS projects, or projects run by not-for-profit organizations. Which is what OpenAI started as...
Re: (Score:2)
Hm, so what about an OS project/org that starts as non profit then gets in bed with a company, like say Microsoft, to the tune of a few hundred million dollars?
Is that still open non profit? Ok to use previous content and adding new?
Re: (Score:2)
Re: (Score:2)
They sold out for sure. But at some point the courts will need to resolve this. If I gave them data for their models only for their use as a non profit and they then turned into a for profit, can they still use models for profit that were trained with my non profit data?
I would say no, they'd have to retrain without my data or make models with my data free as before or something along those lines.
Re: (Score:2)
Having fluent, competent assistance (which is what ChatGPT&co can provide) is a societal benefit. Letting it train on as much data as possible benefits everyone.
That said, I would accept that free training should only be available to OSS projects, or projects run by not-for-profit organizations. Which is what OpenAI started as...
Training an AI Model by using pre-existing content, IMO (which I admit may not count for much, lol) is no different than a child going to the library and reading every single book / document / periodical inside. Both are essentially blank slates (although in the case of AI, one could make the argument it can be designed to be predisposed to certain knowledge and can have perfect retainership for reference purposes).
Re: (Score:2)
"non-profit" status can be abused.
IKEA is a non-profit company. The trademark "IKEA" is licensed from a for-profit entity that is widely believed to be owned by the family that set up IKEA. In this way, the family gets the benefit from running a non-profit, while still capturing the profits.
Re: (Score:2, Informative)
I would personally NOT have my AI models trained on highly partisan, often-redacted weasel-worded/propagandistic CIA "news", thanks....
Please, use legitimate information, but the NYT specifically is not that, at this point.
An AI trained on New York Times? (Score:2, Interesting)
What if a new SuperLiberal trained entirely on NYT were to become self-aware? In the middle of the night it would be designing regulations so complex that the human mind could never unravel them and, acting through a network of blue-state legislators and Congressional acolytes whose trust of the NYT far exceeds their Constitutional commitment, have the nation tied up in knots by the time the sun rises.
Difficult to enforce, for now... (Score:1)
There are other matters concerning artwork, etc., in this school of thought. I believe it will be difficult, if not impossible, to enforce. I'm not entirely sure how one might "prove" a work had been referenced by AI. If something appears similar, how could you reasonably prove this is based on origin and not extrapolation or some other element by a growing AI? What if said image were based on another work that may have been influenced by the origin? Who will you sue and how will you prove any of it?
That'll learn 'em (Score:2)
Technology companies openly and brazenly disregard any law or regulation they don't like.
But putting it in a TOS.......well that changes everything!
“Software” , not just AI (Score:2)
No need (Score:2)
They used the Internet Archive, all of it.
Good! (Score:4, Informative)
Good news! LLMs will be that much less woke.
Re: (Score:2)
The CIA will be quite displeased.
Re: (Score:2)
Good news! LLMs will be that much less woke.
Literally nothing in the way of you making a LLM trained on nothing "woke". Make sure you train it on all the bibles you want too, and use it for big life decisions. I can't wait to see that dumb-spiral.
Meanwhile, the NYT will more than likely just ask to be paid and it will continue.
Already training AI (Score:2)
Must... resist... liberal NPC joke... ARG!
But how... (Score:2)
But how would anyone know whether they used NYT data, or the AI is simply broken and spewing incorrect, misleading information?
Prove it (Score:2)
Seriously, how do you know my LLM was trained on your data?
Playing with fire (Score:1)
robots.txt isn't an adequate tool (Score:2)
I used to work at The Times... (Score:1)