Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
United States AI

New Bill Would Force AI Companies To Reveal Use of Copyrighted Art (theguardian.com) 57

A bill introduced in the US Congress on Tuesday intends to force AI companies to reveal the copyrighted material they use to make their generative AI models. From a report: The legislation adds to a growing number of attempts from lawmakers, news outlets and artists to establish how AI firms use creative works like songs, visual art, books and movies to train their software-and whether those companies are illegally building their tools off copyrighted content.

The California Democratic congressman Adam Schiff introduced the bill, the Generative AI Copyright Disclosure Act, which would require that AI companies submit any copyrighted works in their training datasets to the Register of Copyrights before releasing new generative AI systems, which create text, images, music or video in response to users' prompts. The bill would need companies to file such documents at least 30 days before publicly debuting their AI tools, or face a financial penalty. Such datasets encompass billions of lines of text and images or millions of hours of music and movies.

"AI has the disruptive potential of changing our economy, our political system, and our day-to-day lives. We must balance the immense potential of AI with the crucial need for ethical guidelines and protections," Schiff said in a statement. Whether major AI companies worth billions have made illegal use of copyrighted works is increasingly the source of litigation and government investigation. Schiff's bill would not ban AI from training on copyrighted material, but would put a sizable onus on companies to list the massive swath of works that they use to build tools like ChatGPT -- data that is usually kept private.

This discussion has been archived. No new comments can be posted.

New Bill Would Force AI Companies To Reveal Use of Copyrighted Art

Comments Filter:
  • Thirty or forty huge corporations submit their multi-petabyte training set lists, and watch the Copyright Office shut down completely.

  • by Rei ( 128717 ) on Wednesday April 10, 2024 @06:37PM (#64384850) Homepage

    ... copyrighted... how? There is no registry. Nor could a trivial registry even work beyond exact matches. Courts spend endless effort debating whether a given work runs afoul of a given other work's copyright; it's not some trivial comparison.

    Nor can you simply trust some website having a favourable terms of use for its content, for example defaulting all posted images to CC0 or whatnot. For the simple reason that you can't trust that everything users (or even site owners) post isn't a copyvio of someone else's content.

    The only safe thing companies could do is just dump their whole dataset to the Register of Copyrights and say, "You deal with it, and tell us if we need to take anything out"

    • ... copyrighted... how? There is no registry.

      It's not because it's inconvenient that it doesn't exist. If you want to reuse a photograph you found somewhere for example, you're supposed to research who owns the rights to it and figure out if and how you can use it.

      The problem AI companies have is, they hoover up billions of copyrighted works to train their AIs, but of course they don't have the time or resources to do due diligence on each and every one of those works.

      So with typical big tech hubris, instead of taking the time to figure out this part

      • by WaffleMonster ( 969671 ) on Wednesday April 10, 2024 @07:31PM (#64385004)

        So with typical big tech hubris, instead of taking the time to figure out this particular conundrum legally and cleanly, the tech bros just said "fuck this" and pushed ahead with their massively copyright-infringing products, arguing that you can't stop progress, this outdated copyright stuff is in the way and their bright future can't wait - and nevermind all the people whose work they essentially stole without compensation.

        Copyright law does not impose limits on use. It merely restricts performances, fixed copies and derivative works.

        If I break into a copyright owners studio and steal all of their work, sell all their works for profit whatever laws I'm breaking copyright isn't one of them. If I go on to make unauthorized copies of the stolen works then I'm violating copyright.

        If I steal someones work, learn from it and use my ill-gotten knowledge to create transformative works of my own to put the original author out of business whatever laws I'm breaking copyright isn't one of them.

        and nevermind all the people whose work they essentially stole without compensation.

        Stealing has nothing to do with copyright law. Only unauthorized performances and preparation of fixed copies and derivatives are activities that are constrained by copyright.

        • We already know that at least some of these AI are storing the original works.

          Thus violating copyright in those cases.

          I doubt all of them do that, fwiw, but it's happened.

          • by ceoyoyo ( 59147 )

            They aren't storing it. They can randomly produce something that looks similar if the conditioning prompt is specific enough and their training data in that area is sparse enough.

            That is a problem, but it's really something that's already handled by existing copyright law. See for example Ed Sheeran versus the estate of Marvin Gaye.

            Using data in a way that violates the terms of your license is another problem, also handled by existing law.

            • They have pulled PII from AI.

              That's not random or similar or any such thing. That's stored.

              • by ceoyoyo ( 59147 )

                Possibly, in very specific cases involving low variance data like text, and even then only a fragment. Not necessarily though. Retrieving specific information like that requires specific prompts, which could contain part of the information.

                Either way, the models do not explicitly store any item in the training information. They store the information necessary to reconstruct something like the training data, when appropriately prompted. As I said, if the data in that area is sparse enough, or the prompt spec

                • I get what you're saying, my degree is basically "AI", but here's the thing... courts don't care about details like that.

                  If there is -any- series of prompts that a normal person can enter that will yield copyrighted materials that were taken without permission and used without transformation as per the law then it's a copyright violation.

                  To say, "oh well, yeah we have a copy and it can be coaxed out of the system but only in certain circumstances" is a failed defense in court. Very weak. Not all companies

                  • by ceoyoyo ( 59147 )

                    Absolutely it's a copyright violation. You can also run a series of operations in photoshop and produce a copyright violation. Or Logic Pro. Normally we hold the person using the software responsible for those violations, and we typically only do so if they distribute the products. Courts and legislators will have to decide whether we're going to make a special exception for AI.

                • by tlhIngan ( 30335 )

                  Either way, the models do not explicitly store any item in the training information. They store the information necessary to reconstruct something like the training data, when appropriately prompted. As I said, if the data in that area is sparse enough, or the prompt specific enough, "something like it" could end up being similar enough to the training item that we consider it the same.

                  So by that logic, I could feed a LLM all of the open and free software source code in the world (easily available), then ha

                  • by ceoyoyo ( 59147 )

                    No. I'm not sure where you got "that logic" at all.

                    The way copyright law works, if you produce something similar enough to a copyrighted work and do certain things with it, you are guilty of copyright infringement. Your text editor is not guilty of copyright infringement, nor is your compiler. It is also not illegal for you to sit down and write out the Windows kernel; it is illegal for you to distribute it.

          • by Rei ( 128717 )

            You apparently seem to think that automated "storing" of copyrighted works on automated systems that provide services unrelated to the dissemination of said works is illegal.

            You might want to have a consult with Google's entire business model about that one.

            Yes, if a model, in its training, encounters the exact same text (or images, in the case of diffusion models or LMMs) enough, just like a person encountering the same thing over and over, they can eventually memorize them. Does the model have, say, The

            • It's simple:

              Legal facts:
              It has been shown that a popular AI stores unique PII data.
              That PII can be extracted from the AI by talking to it. No evil hacking required.

              Therefore:
              AI stores original data
              AI distributes/publishes that data upon request
              Nothing has been transformed
              Copyright has been violated.

              Those are the things a court will look at. AI company guilty. Case closed.

              Violatuon of a fake tos is not cover for the AI. If that were the case every pirate group could slap a copyright.txt file on their sto

              • by Rei ( 128717 )

                Show an example of "just talking to" ChatGPT revealing PII. Let alone proof that it's anything more than a rare freak incident. Let alone show that the authors didn't attempt to prevent the release of PII and took no action to deal with PII when discovered.

                The attacks to reveal PII have generally been things like, "An investigator discovered that if you ask a model to repeat something on infinite loop it glitches out the model into spitting out garbage, and a second investigator discovered that some of the

                • It's been posted to slashdot and you have Google.

                  • by Rei ( 128717 )

                    Back at you.

                    I'll repeat: they were creating attacks against the models to find ways - often quite convoluted and esoteric - to trick them into doing things that their designers put safeguards against them doing, as you can quite literally see in my examples above.

                    • If the data can be coaxed out of the system, and your legal defense is, "but our shit is buggy and lame", you're doing not client any favors in court.

                      It's very simple. This is 2+2=4 and sky is blue stuff.

                    • by Rei ( 128717 )

                      Your legal theory where a person can break into someone's house, take book that the person doesn't have the right to share, photocopy it, and share it, and that this is the VICTIM's liability because their locks aren't good enough to stop the attacker, is duly noted.

        • by The Cat ( 19816 )

          If you sell a copyrighted work, you are infringing on the copyright holder's exclusive rights to distribution. Further, willful infringement for financial gain can be considered a criminal offense under 17 U.S.C. 506

          • If you sell a copyrighted work, you are infringing on the copyright holder's exclusive rights to distribution.

            No. If I make an unauthorized copy, then (and only then) I am infringing regardless of whether I sell it. Copyright (literally, the right to copy) is only about copying, not distribution. It's not "distribution-right."

            If I legally obtained an authorized copy or copies, I'm free to sell it/them or do anything else with it/them I want (except copy it/them or make derivative works unless I have been

        • Ok, and if a photographer has published their copyrighted work on a website with accompanying license text of 'Use of these images is restricted to viewing them via this website and all other rights are reserved", what then?

          We need a standard for encoding rights information into file formats and then the AI companies need to respect that.

          • by Rei ( 128717 )

            You can write whatever you want; it still doesn't override (A) the TOS of the website they posted on, which invariably granted the site at least a subset of the distribution rights; and (B) fair use, including for the purpose of the automated creation of transformative derivative works and services.

            I could write "I have the legal right to murder my neighbor"; it wouldn't actually grant me the right to do so. You have to actually have a right to do something (and not have already given up that right) in ord

            • by Rei ( 128717 )

              People who write this sort of stuff remind me so much of the people who share viral messages on Facebook stating that Facebook doesn't have the right to their data, and that by posting some notice with the right legalese words they can ban Facebook for using their data. Sorry, but you gave up that right when you agreed to use their service, and no magic words are just going to give it to you.

              (Let alone when talking about rights that you never had in the first place, such as to restrict fair use)

            • Lots of creative people operate their own website, so get to write the TOS. LLM training falls outside many of the tests commonly applied to decide fair use.

              • by Rei ( 128717 )

                Running your own website may get you past a TOS, but it doesn't mean you can disclaim fair use.

                LLM training falls outside many of the tests commonly applied to decide fair use.

                If Google can win Authors Guild, Inc. v. Google, Inc., there is no way AI training would run afoul.

                Google: Ignored the explicit written request of the rightholders
                AI training: generally honours opt out requests

                Google: Incorporated exact copies of all the data into their product
                AI training: only data seen commonly repeated generally g

        • I think it's amusing you flat out said "Copyright doesn't say I can't use a derivative work generation machine, it just says no derivative works".
      • by Rei ( 128717 )

        It's not because it's inconvenient that it doesn't exist. If you want to reuse a photograph you found somewhere for example, you're supposed to research who owns the rights to it and figure out if and how you can use it

        They already have the answer to "how they can use it", which is: they can. Automated processing of copyrighted data to provide transformative services is legal and considered fair use by the judicial system. Which is why like 98% of Google's business model isn't illegal. Exactly how do you

    • by iAmWaySmarterThanYou ( 10095012 ) on Wednesday April 10, 2024 @07:23PM (#64384984)

      No one is owed a successful business model. Especially if it relies on using someone else's work for free.

    • Comment removed based on user account deletion
    • by ShanghaiBill ( 739463 ) on Wednesday April 10, 2024 @07:41PM (#64385030)

      ... copyrighted... how?

      Under U.S. and international law, works are copyrighted by default.

      Unless you specifically say otherwise, all your creative works are protected by copyright.

      The real question is not whether they are using copyrighted material (they are) but whether using data to train an LLM is "copying" rather than just "reading."

      • by Rei ( 128717 )

        Under U.S. and international law, works are copyrighted by default.

        Things posted to the internet are almost invariably subject to terms of use requirements by the hosting site which grant the site various dissemination rights to the works, so no, you can't just assume that anything posted to the internet = "all rights reserved". It's also entirely a false assumption that if a person posts something, they hold the rights, or that it's even clear who does, or if anyone does. Nor can you assume that anything

      • by evanh ( 627108 )

        Probably need to inform all those companies claiming copyright takedowns on mere links then.

    • This enables media companies to develop tools to automatically scan lists/repositories/datasets/etc. in order to find works that they hold copyright to & to either charge a fee or litigate for compensation in the courts. It's just another way for them to charge people & other corporations for using stuff that they claim is theirs.
      • by Rei ( 128717 )

        The thing is, if rightholders developed a system AI developers could use to check works, and it wasn't overloaded with false positives, I can 100% guarantee you that pretty much every AI developer would use it. The problem is that such a system does not exist for text and images. There are increasingly some decent systems for videos and music at least - Youtube's ContentID comes to mind. But for text and images, there's no good options.

    • by whitroth ( 9367 )

      Sure: here, remove all of it. Then restart with ones you own rights to.

  • Going forward the Government will require that ANY IDEA YOU EVER HAVE you must give full credit to anything that inspired it, be it a movie, book, TV show, app, someone's offhanded comment, or graffiti on a railcar.

    After all, if we don't develop "new" ideas we're just copying someone else's work, AI or not.

    Stupid government dicks in the pockets of the MAFIAA.

  • They'll just make no changes, claim they're not infringing, and call it something else instead that means the same thing and hope their bullshit sticks. We saw this with illegal taxis, letting cruise control operate itself, and money laundering (repeatedly). The public and regulators need a teflon coating.
  • Watching these massive behemoths trip over the BS put in place by other massive behemoths in terms of copyright as they try to ramp up "replace the humans" machines is pretty fun. I mean, I know that ultimately they'll figure out a way around all this nonsense, once the copyright holders climb aboard the hype train, but for right now this is some serious popcorn munching fun.

  • It won't be that long until you can effectively analyze every neuron of a person's brain. 15-20 years at most.

    If they make art, will they have to pay a fee for every part of it that is inspired by copyrighted works they've seen or experienced?

  • Yea like this is a real thing!

Avoid strange women and temporary variables.

Working...