New Bill Would Force AI Companies To Reveal Use of Copyrighted Art (theguardian.com) 57
A bill introduced in the US Congress on Tuesday intends to force AI companies to reveal the copyrighted material they use to make their generative AI models. From a report: The legislation adds to a growing number of attempts from lawmakers, news outlets and artists to establish how AI firms use creative works like songs, visual art, books and movies to train their software-and whether those companies are illegally building their tools off copyrighted content.
The California Democratic congressman Adam Schiff introduced the bill, the Generative AI Copyright Disclosure Act, which would require that AI companies submit any copyrighted works in their training datasets to the Register of Copyrights before releasing new generative AI systems, which create text, images, music or video in response to users' prompts. The bill would need companies to file such documents at least 30 days before publicly debuting their AI tools, or face a financial penalty. Such datasets encompass billions of lines of text and images or millions of hours of music and movies.
"AI has the disruptive potential of changing our economy, our political system, and our day-to-day lives. We must balance the immense potential of AI with the crucial need for ethical guidelines and protections," Schiff said in a statement. Whether major AI companies worth billions have made illegal use of copyrighted works is increasingly the source of litigation and government investigation. Schiff's bill would not ban AI from training on copyrighted material, but would put a sizable onus on companies to list the massive swath of works that they use to build tools like ChatGPT -- data that is usually kept private.
The California Democratic congressman Adam Schiff introduced the bill, the Generative AI Copyright Disclosure Act, which would require that AI companies submit any copyrighted works in their training datasets to the Register of Copyrights before releasing new generative AI systems, which create text, images, music or video in response to users' prompts. The bill would need companies to file such documents at least 30 days before publicly debuting their AI tools, or face a financial penalty. Such datasets encompass billions of lines of text and images or millions of hours of music and movies.
"AI has the disruptive potential of changing our economy, our political system, and our day-to-day lives. We must balance the immense potential of AI with the crucial need for ethical guidelines and protections," Schiff said in a statement. Whether major AI companies worth billions have made illegal use of copyrighted works is increasingly the source of litigation and government investigation. Schiff's bill would not ban AI from training on copyrighted material, but would put a sizable onus on companies to list the massive swath of works that they use to build tools like ChatGPT -- data that is usually kept private.
Just switch to the Wikimedia Commons (Score:2)
Re: (Score:3)
Millions of freely licensed pictures [wikimedia.org].
When you're training an LLM with 100 billion parameters, you need way more than "millions" of images. You need billions or even trillions.
Day 1, after this comes into effect? (Score:2)
Thirty or forty huge corporations submit their multi-petabyte training set lists, and watch the Copyright Office shut down completely.
Re: (Score:2)
More like thirty or forty huge corporation submitting their list and waiting forever for approval, legally unable to release their AI products.
Re: Day 1, after this comes into effect? (Score:3)
Nothing in the bill mentions submitting this data for approval. It just says the data needs to be submitted.
Re: (Score:2)
Yes, but if you don't require that the data be made available, then there's no way to check it.
So you'd get a massive hypertext file that would be impossible the check without the actual source files.
Re: Day 1, after this comes into effect? (Score:2)
And they're supposed to know which works are... (Score:4, Interesting)
... copyrighted... how? There is no registry. Nor could a trivial registry even work beyond exact matches. Courts spend endless effort debating whether a given work runs afoul of a given other work's copyright; it's not some trivial comparison.
Nor can you simply trust some website having a favourable terms of use for its content, for example defaulting all posted images to CC0 or whatnot. For the simple reason that you can't trust that everything users (or even site owners) post isn't a copyvio of someone else's content.
The only safe thing companies could do is just dump their whole dataset to the Register of Copyrights and say, "You deal with it, and tell us if we need to take anything out"
Re: (Score:3)
... copyrighted... how? There is no registry.
It's not because it's inconvenient that it doesn't exist. If you want to reuse a photograph you found somewhere for example, you're supposed to research who owns the rights to it and figure out if and how you can use it.
The problem AI companies have is, they hoover up billions of copyrighted works to train their AIs, but of course they don't have the time or resources to do due diligence on each and every one of those works.
So with typical big tech hubris, instead of taking the time to figure out this part
Re:And they're supposed to know which works are... (Score:4, Informative)
So with typical big tech hubris, instead of taking the time to figure out this particular conundrum legally and cleanly, the tech bros just said "fuck this" and pushed ahead with their massively copyright-infringing products, arguing that you can't stop progress, this outdated copyright stuff is in the way and their bright future can't wait - and nevermind all the people whose work they essentially stole without compensation.
Copyright law does not impose limits on use. It merely restricts performances, fixed copies and derivative works.
If I break into a copyright owners studio and steal all of their work, sell all their works for profit whatever laws I'm breaking copyright isn't one of them. If I go on to make unauthorized copies of the stolen works then I'm violating copyright.
If I steal someones work, learn from it and use my ill-gotten knowledge to create transformative works of my own to put the original author out of business whatever laws I'm breaking copyright isn't one of them.
and nevermind all the people whose work they essentially stole without compensation.
Stealing has nothing to do with copyright law. Only unauthorized performances and preparation of fixed copies and derivatives are activities that are constrained by copyright.
Re: (Score:1)
We already know that at least some of these AI are storing the original works.
Thus violating copyright in those cases.
I doubt all of them do that, fwiw, but it's happened.
Re: (Score:2)
They aren't storing it. They can randomly produce something that looks similar if the conditioning prompt is specific enough and their training data in that area is sparse enough.
That is a problem, but it's really something that's already handled by existing copyright law. See for example Ed Sheeran versus the estate of Marvin Gaye.
Using data in a way that violates the terms of your license is another problem, also handled by existing law.
Re: (Score:1)
They have pulled PII from AI.
That's not random or similar or any such thing. That's stored.
Re: (Score:2)
Possibly, in very specific cases involving low variance data like text, and even then only a fragment. Not necessarily though. Retrieving specific information like that requires specific prompts, which could contain part of the information.
Either way, the models do not explicitly store any item in the training information. They store the information necessary to reconstruct something like the training data, when appropriately prompted. As I said, if the data in that area is sparse enough, or the prompt spec
Re: (Score:1)
I get what you're saying, my degree is basically "AI", but here's the thing... courts don't care about details like that.
If there is -any- series of prompts that a normal person can enter that will yield copyrighted materials that were taken without permission and used without transformation as per the law then it's a copyright violation.
To say, "oh well, yeah we have a copy and it can be coaxed out of the system but only in certain circumstances" is a failed defense in court. Very weak. Not all companies
Re: (Score:2)
Absolutely it's a copyright violation. You can also run a series of operations in photoshop and produce a copyright violation. Or Logic Pro. Normally we hold the person using the software responsible for those violations, and we typically only do so if they distribute the products. Courts and legislators will have to decide whether we're going to make a special exception for AI.
Re: (Score:2)
So by that logic, I could feed a LLM all of the open and free software source code in the world (easily available), then ha
Re: (Score:2)
No. I'm not sure where you got "that logic" at all.
The way copyright law works, if you produce something similar enough to a copyrighted work and do certain things with it, you are guilty of copyright infringement. Your text editor is not guilty of copyright infringement, nor is your compiler. It is also not illegal for you to sit down and write out the Windows kernel; it is illegal for you to distribute it.
Re: (Score:2)
You apparently seem to think that automated "storing" of copyrighted works on automated systems that provide services unrelated to the dissemination of said works is illegal.
You might want to have a consult with Google's entire business model about that one.
Yes, if a model, in its training, encounters the exact same text (or images, in the case of diffusion models or LMMs) enough, just like a person encountering the same thing over and over, they can eventually memorize them. Does the model have, say, The
Re: (Score:1)
It's simple:
Legal facts:
It has been shown that a popular AI stores unique PII data.
That PII can be extracted from the AI by talking to it. No evil hacking required.
Therefore:
AI stores original data
AI distributes/publishes that data upon request
Nothing has been transformed
Copyright has been violated.
Those are the things a court will look at. AI company guilty. Case closed.
Violatuon of a fake tos is not cover for the AI. If that were the case every pirate group could slap a copyright.txt file on their sto
Re: (Score:2)
Show an example of "just talking to" ChatGPT revealing PII. Let alone proof that it's anything more than a rare freak incident. Let alone show that the authors didn't attempt to prevent the release of PII and took no action to deal with PII when discovered.
The attacks to reveal PII have generally been things like, "An investigator discovered that if you ask a model to repeat something on infinite loop it glitches out the model into spitting out garbage, and a second investigator discovered that some of the
Re: (Score:1)
It's been posted to slashdot and you have Google.
Re: (Score:2)
Back at you.
I'll repeat: they were creating attacks against the models to find ways - often quite convoluted and esoteric - to trick them into doing things that their designers put safeguards against them doing, as you can quite literally see in my examples above.
Re: (Score:1)
If the data can be coaxed out of the system, and your legal defense is, "but our shit is buggy and lame", you're doing not client any favors in court.
It's very simple. This is 2+2=4 and sky is blue stuff.
Re: (Score:2)
Your legal theory where a person can break into someone's house, take book that the person doesn't have the right to share, photocopy it, and share it, and that this is the VICTIM's liability because their locks aren't good enough to stop the attacker, is duly noted.
Re: (Score:3)
If you sell a copyrighted work, you are infringing on the copyright holder's exclusive rights to distribution. Further, willful infringement for financial gain can be considered a criminal offense under 17 U.S.C. 506
Re: (Score:2)
No. If I make an unauthorized copy, then (and only then) I am infringing regardless of whether I sell it. Copyright (literally, the right to copy) is only about copying, not distribution. It's not "distribution-right."
If I legally obtained an authorized copy or copies, I'm free to sell it/them or do anything else with it/them I want (except copy it/them or make derivative works unless I have been
Re: (Score:2)
Ok, and if a photographer has published their copyrighted work on a website with accompanying license text of 'Use of these images is restricted to viewing them via this website and all other rights are reserved", what then?
We need a standard for encoding rights information into file formats and then the AI companies need to respect that.
Re: (Score:2)
You can write whatever you want; it still doesn't override (A) the TOS of the website they posted on, which invariably granted the site at least a subset of the distribution rights; and (B) fair use, including for the purpose of the automated creation of transformative derivative works and services.
I could write "I have the legal right to murder my neighbor"; it wouldn't actually grant me the right to do so. You have to actually have a right to do something (and not have already given up that right) in ord
Re: (Score:2)
People who write this sort of stuff remind me so much of the people who share viral messages on Facebook stating that Facebook doesn't have the right to their data, and that by posting some notice with the right legalese words they can ban Facebook for using their data. Sorry, but you gave up that right when you agreed to use their service, and no magic words are just going to give it to you.
(Let alone when talking about rights that you never had in the first place, such as to restrict fair use)
Re: (Score:2)
Lots of creative people operate their own website, so get to write the TOS. LLM training falls outside many of the tests commonly applied to decide fair use.
Re: (Score:2)
Running your own website may get you past a TOS, but it doesn't mean you can disclaim fair use.
If Google can win Authors Guild, Inc. v. Google, Inc., there is no way AI training would run afoul.
Google: Ignored the explicit written request of the rightholders
AI training: generally honours opt out requests
Google: Incorporated exact copies of all the data into their product
AI training: only data seen commonly repeated generally g
Re: And they're supposed to know which works are.. (Score:2)
Re: (Score:2)
They already have the answer to "how they can use it", which is: they can. Automated processing of copyrighted data to provide transformative services is legal and considered fair use by the judicial system. Which is why like 98% of Google's business model isn't illegal. Exactly how do you
Re:And they're supposed to know which works are... (Score:4, Insightful)
No one is owed a successful business model. Especially if it relies on using someone else's work for free.
Re: (Score:2)
Re:And they're supposed to know which works are... (Score:5, Interesting)
... copyrighted... how?
Under U.S. and international law, works are copyrighted by default.
Unless you specifically say otherwise, all your creative works are protected by copyright.
The real question is not whether they are using copyrighted material (they are) but whether using data to train an LLM is "copying" rather than just "reading."
Re: (Score:2)
Things posted to the internet are almost invariably subject to terms of use requirements by the hosting site which grant the site various dissemination rights to the works, so no, you can't just assume that anything posted to the internet = "all rights reserved". It's also entirely a false assumption that if a person posts something, they hold the rights, or that it's even clear who does, or if anyone does. Nor can you assume that anything
Re: (Score:2)
This is in turn also not correct. All works are NOT automatically granted copyright. The work has to meet certain qualifying standards, for example more than de minimis human creative work. You can't just write "My dog farted" and assert that it's copyrighted; that simply won't pass creativity standards. Some works, such as AI works (which have not been not further human processed or involved in a creative selection process), are automatically denied copyright on these grounds. A wide variety of things
Re: (Score:2)
Probably need to inform all those companies claiming copyright takedowns on mere links then.
Re: (Score:2)
1996 called, they want their "pretending that downloading copies of content is the equivalent of depriving the owner of their physical possessions" notion back.
And sorry, but fair use is very much a thing, and automated processing of copyrighted data to provide new, transformative goods and services very much is treated as fair use under copyright law.
Re: (Score:2)
Re: (Score:2)
The thing is, if rightholders developed a system AI developers could use to check works, and it wasn't overloaded with false positives, I can 100% guarantee you that pretty much every AI developer would use it. The problem is that such a system does not exist for text and images. There are increasingly some decent systems for videos and music at least - Youtube's ContentID comes to mind. But for text and images, there's no good options.
Re: And they're supposed to know which works are.. (Score:2)
Re: (Score:2)
Sure: here, remove all of it. Then restart with ones you own rights to.
Re: (Score:1)
The world was doing just fine before AI companies came along to steal their content.
Re: Those companies will be left behind... (Score:2)
The western world is not doing fine prior to AI. We went from 2% annual productivity gains in the mid to late 1900s to 1.3% in the last 25 years. This led to a 5% annual stock market growth over that period as opposed to 10% in previous decades.
Gen Z will never retire if we don't find some answer to lackluster productivity improvements.
Re: (Score:1)
You think the growth problems are due to lack of AI?
Really?
My returns are way higher than 1.3% btw. I get 3x that just getting CDs to say nothing of my stocks and various funds.
Give credit where credit is due (Score:1)
Going forward the Government will require that ANY IDEA YOU EVER HAVE you must give full credit to anything that inspired it, be it a movie, book, TV show, app, someone's offhanded comment, or graffiti on a railcar.
After all, if we don't develop "new" ideas we're just copying someone else's work, AI or not.
Stupid government dicks in the pockets of the MAFIAA.
Techbro factor (Score:2)
Gotta say, I'm kinda enjoying this. (Score:2)
Watching these massive behemoths trip over the BS put in place by other massive behemoths in terms of copyright as they try to ramp up "replace the humans" machines is pretty fun. I mean, I know that ultimately they'll figure out a way around all this nonsense, once the copyright holders climb aboard the hype train, but for right now this is some serious popcorn munching fun.
What will this be in the future (Score:2)
It won't be that long until you can effectively analyze every neuron of a person's brain. 15-20 years at most.
If they make art, will they have to pay a fee for every part of it that is inspired by copyrighted works they've seen or experienced?
Adam Schiff is involved? (Score:2)