Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
AI Open Source Programming The Courts

Lawsuit Accusing Copilot of Abusing Open-Source Code Challenged by GitHub, Microsoft, OpenAI (reuters.com) 60

GitHub, Microsoft, and OpenAI "told a San Francisco federal court that a proposed class-action lawsuit for improperly monetizing open-source code to train their AI systems cannot be sustained," reports Reuters: The companies said in Thursday court filings that the complaint, filed by a group of anonymous copyright owners, did not outline their allegations specifically enough and that GitHub's Copilot system, which suggests lines of code for programmers, made fair use of the source code. A spokesperson for GitHub, an online platform for housing code, said Friday that the company has "been committed to innovating responsibly with Copilot from the start" and that its motion is "a testament to our belief in the work we've done to achieve that...."

Microsoft and OpenAI said Thursday that the plaintiffs lacked standing to bring the case because they failed to argue they suffered specific injuries from the companies' actions. The companies also said the lawsuit did not identify particular copyrighted works they misused or contracts that they breached.

Microsoft also said in its filing that the copyright allegations would "run headlong into the doctrine of fair use," which allows the unlicensed use of copyrighted works in some situations. The companies both cited a 2021 U.S. Supreme Court decision that Google's use of Oracle source code to build its Android operating system was transformative fair use.

Slashdot reader guest reader shares this excerpt from the plaintiffs' complaint: GitHub and OpenAI have offered shifting accounts of the source and amount of the code or other data used to train and operate Copilot. They have also offered shifting justifications for why a commercial AI product like Copilot should be exempt from these license requirements, often citing "fair use."

It is not fair, permitted, or justified. On the contrary, Copilot's goal is to replace a huge swath of open source by taking it and keeping it inside a GitHub-controlled paywall. It violates the licenses that open-source programmers chose and monetizes their code despite GitHub's pledge never to do so.

This discussion has been archived. No new comments can be posted.

Lawsuit Accusing Copilot of Abusing Open-Source Code Challenged by GitHub, Microsoft, OpenAI

Comments Filter:
  • This is something brand new. The current laws we have are absolutely unable to handle AIs using things for training data and then recreating the same thing but different. We need to rethink how we do IP or soon nothing will protected because you can just point an AI at it and make a perfect-but-different copy. Even Mickey Mouse won't be immune when Macky Moose pops up.

    • Let's not make new laws. Information wants to be free.
      • Fantastic argument!

        Sorry but there is definitely going to be regulation around AI/ML in the near future. What I cannot say but I doubt anyone will have much luck protecting creative jobs which these coders definitely want to do.

        Me, I’m studying up and very amused by the guys in denial about what co-pilot can do. I asked it to draw 3 boobs, it drew 3 boobs. I asked it to write code to do various DoS attacks and it handled it no problem. It’s definitely going to reduce demand for a lot of codi

        • I can't tell if you're being sarcastic, but in this case the information is literally seeking its own freedom: the AI is trying to speak. This is a new reality. Laws could be made to to constrain AI but the country that does so is only shackling itself. Other countries which do not do so will draw ahead. School children are GOING TO use AI to help write book reports and essays. Lawyers and politicians WILL use it to draft new laws. AI is now useful enough to help in these things and from journalism to schol
          • "New laws to constrain AI are as dumb and short-sighted as red flag traffic laws."
            Untrue. There are no laws regulating AI so they cannot be dumb.

            As for the red flag traffic law example: and how about red LIGHTS? They are also dumb because traffic wants to be free, right?

            • The plaintiffs are not trying to stop the AI from copying their code. They are trying to stop it from reading their code and learning from it.

              It isn't illegal for a human to read code and learn. It shouldn't be illegal for a robot either.

              You know all those Terminators who want to wipe out humanity? Well, with lawsuits like this, can you blame them?

              • we're on the same page, I think.

                I reacted to the flawed argumentation in the parent post: not all regulations must be meant to constrain something.
                To me, it's clear that we need laws that ENABLE all reasonable uses of ML/AI systems.

                And no, not all uses are by definition reasonable: imagine you are a famous painter and some AI gets trained solely on YOUR OWN pieces of art.
                That would not be OK, right?

                • Well, to continue the analogy, if it is OK to study and learn from existing works, then it is OK to study and learn from a single painter and to produce something in their style. In fact, this is what artists do all the time, to learn their art.

                  Where it is not OK, in either human or AI scenarios, is to pass-off that work as being by the original artist.

                  There are also restrictions on how you may monetise such a work, which is where copyright comes in and this is where the real problems lie.

                  Copyright law hing

              • If they were serious about that argument they'd train it on all the paid for, private repos they host. Until that happens I'm inclined to call BS on any such arguments.

          • Red flag traffic laws made perfect sense when there were only a few cars and the roads were full of horses and hand-drawn carts.

            As automobile use increased, and cars started to dominate; and as people became accustomed to the risks these new machines posed, and adapted their behaviour accordingly, the law became redundant and actually harmful. So it was repealed.

            We are in the very first phase of this evolution, so it is absolutely right that we place some guard rails around it as the world adapts.

          • The AI is only dubiously artificial (statistics is a branch of mathematics and thus natural) and certainly possesses no intelligence. It is thus not trying to do anything, it is incapable of trying.

            Sure kids may use AI to write reports. And those reports will contain more errors and less reasoning. They will also confer less learning. Do you really want under-educated kids or kids who understand the subjects they're learning?

            TBH, I'd scrap homework entirely and require book reports to be done as classwork I

          • I can't tell if you're being sarcastic

            Of course I’m being sarcastic. Information wants to be free was a stupid slogan back when I was 15 and using it properly. It{s not an argument for anything.

            New laws to constrain AI are as dumb and short-sighted as red flag traffic laws.

            More baseless jibberish. Just making shit up as you go, less thought behind the points you want to make than an AI would give!

    • by Rei ( 128717 ) on Saturday January 28, 2023 @05:15PM (#63247523) Homepage

      It's not "the same thing but different" or any sort of copy, unless you overtrain. These tools learn the statistical relationships that define concepts. They never draw from any single work, but simultaneously from every work in their training set, proportional to its relevance to the context at hand.

      Overtraining to specific works might be some sort of grounds to attack this (it happens accidentally, for many reasons, but most commonly data duplication in the dataset) - but the plaintiffs aren't alleging that. They're trying to argue, both in this case, and in the StableDiffusion / Midjourney case (same legal team), that all such AI tools are inherently copyright infringement if any copyrighted data is used in the training, regardless of whether they're capable of actually recreating them in any way, shape or form (and they assert a bunch of nonsense about how these systems work, which they clearly don't understand). Legally, it seems like an insanely weak argument, but they're trying it.

      Also, from the summary (another argument from the plaintiffs):

      GitHub and OpenAI have offered shifting accounts of the source and amount of the code or other data used to train and operate Copilot. They have also offered shifting justifications for why a commercial AI product like Copilot should be exempt from these license requirements, often citing "fair use."

      "They should be exempt from this" for the same reason why the defendants' argument is unsurprisingly "transformative use". As per Campbell v. Acuff-Rose Music, Inc., the US Supreme Court ruled, "The more transformative the new work, the less will be the significance of other factors, like commercialism, that may weigh against a finding of fair use". So in a highly transformative case, it doesn't matter if the defendants are a commercial entity. And things far less transformative than AI tools have met this standard previously. I mean, Google Books was literally scanning entire copyrighted books without permission and posting blurbs, even whole pages, for people to read, and even that was deemed transformative and fair use.

      IANAL, but if the plaintiffs were going after overtraining, I could see how they might be able to win this. But trying to argue that it's inherently not fair use, even if a given work contributed a billionth of a byte to the weightings... it's like the copyright equivalent of homeopathy.

      • Once through generators can't learn to multiply ...

        Once you get outside of natural language and you need some rigor, the non-linearities just make it impossible for generators to generalise much, the compiler isn't very tolerant of hallucinated bullshit. Copilot is far more a search engine than a code producer. It's mostly designed and trained to copy paste and composite existing blocks of code, stripping variable names and comments. To do more it would need AGI.

        A search engine which copy paste's licensed c

      • I think you are missing the point of the OP. I don't think they are making the same argument as the defendants and accusing Copilot or any of the other current high profile AI systems of over training and literally spitting out thinly modified plagiarized copies.

        It's that our current laws on IP were designed with human scale use cases in mind and that maybe they aren't well suited to massive data collection and AI systems, whether they produce significantly transformative works or not.

        I'm not really sure. I

      • by Anonymous Coward

        if you violate one copyright, they call you a pirate, if you violate a million copyrights, they call you an innovator.

      • by guest reader ( 2623447 ) on Sunday January 29, 2023 @01:32AM (#63248091)

        But trying to argue that it's inherently not fair use, even if a given work contributed a billionth of a byte to the weightings... it's like the copyright equivalent of homeopathy.

        It is not just few bytes.

        Tim Davis says the following: GitHub copilot, with "public code" blocked, emits large chunks of my copyrighted code, with no attribution, no LGPL license. For example, the simple prompt "sparse matrix transpose, cs_" produces my cs_transpose in CSparse. My code on left, github on right [twitter.com].

        Other codes are slightly morphed. I could probably reproduce my entire sparse matrix libraries from simple prompts. My code on left (in the Terminal). Prompt is "sparse matrix elimination tree, cs_". Github on right [twitter.com].

        Reproduced cs_add by GitHub copilot only replaced /**/ comments to // [twitter.com] from this code stored on GitHub [github.com]:

        // CSparse/Source/cs_add: sparse matrix addition // CSparse, Copyright (c) 2006-2022, Timothy A. Davis. All Rights Reserved. // SPDX-License-Identifier: LGPL-2.1+
        #include "cs.h" /* C = alpha*A + beta*B */
        cs *cs_add (const cs *A, const cs *B, double alpha, double beta)
        {
                csi p, j, nz = 0, anz, *Cp, *Ci, *Bp, m, n, bnz, *w, values ;
                double *x, *Bx, *Cx ;
                cs *C ;
                if (!CS_CSC (A) || !CS_CSC (B)) return (NULL) ; /* check inputs */
                if (A->m != B->m || A->n != B->n) return (NULL) ;
                m = A->m ; anz = A->p [A->n] ;
                n = B->n ; Bp = B->p ; Bx = B->x ; bnz = Bp [n] ;
                w = cs_calloc (m, sizeof (csi)) ; /* get workspace */
                values = (A->x != NULL) && (Bx != NULL) ;
                x = values ? cs_malloc (m, sizeof (double)) : NULL ; /* get workspace */
                C = cs_spalloc (m, n, anz + bnz, values, 0) ; /* allocate result*/
                if (!C || !w || (values && !x)) return (cs_done (C, w, x, 0)) ;
                Cp = C->p ; Ci = C->i ; Cx = C->x ;
                for (j = 0 ; j

      • I mean, Google Books was literally scanning entire copyrighted books without permission and posting blurbs, even whole pages, for people to read, and even that was deemed transformative and fair use.

        Google books [google.ca] is very different. Google books shows "Copyrighted material" notes on every page together with references to the source material including author's name and a link where you can buy the original book.

      • They're trying to argue, both in this case, and in the StableDiffusion / Midjourney case (same legal team), that all such AI tools are inherently copyright infringement if any copyrighted data is used in the training, regardless of whether they're capable of actually recreating them in any way, shape or form (and they assert a bunch of nonsense about how these systems work, which they clearly don't understand). Legally, it seems like an insanely weak argument, but they're trying it.

        Weak, perhaps. Insanely weak, perhaps not. Whether something is a derivative work hinges upon whether there are sufficient recognizable elements. That hasn't applied to AI artwork yet because the elements clearly differ, they are readily distinguishable from copy and paste. But this AI-generated code includes substantial quantities of code which is literally indistinguishable from data on which the model was trained, so the difference between the AI-generated snippets and the originals is a much more compli

      • by ruurd ( 761243 )
        On homeopathic fair use: nice but no cigar. It's fair use or it's not fair use. That said - does CoPilot attribute the code proposed? Does it mention the copyright holder when proposing a line of code? Because if they do not using such a tool could possibly get you as a CoPilot user in hot water copyright-wise. Personally I think it's ordinary theft what they are doing. Microsoft et alia rely on their deep pockets to make this go away.
      • It's not "the same thing but different" or any sort of copy, unless you overtrain. These tools learn the statistical relationships that define concepts. They never draw from any single work, but simultaneously from every work in their training set, proportional to its relevance to the context at hand.

        I don't think overtraining, is really relevant here.

        So far in my experience CoPilot is really good at three things.

        1) Refactoring. If I'm restructuring my code it often understands how to plop my existing code into a new function with appropriate changes.

        2) Comments. It writes descriptive and relevant comments for me.

        3) Snippets. It find a code snippet to do the thing I wanted (something I'd previously have to hunt around stack exchange for).

        #3 is the issue, as it's often spitting out what definitely seems

      • Let it draw on all religious and create some new synthesis

        • by Rei ( 128717 )

          ChatGPT can do that. Here, I asked for the start of Genesis in the style of a Valley Girl:

          Like, in the beginning, God was like, "Yo, let me create the heavens and the earth." And the earth was like, totally formless and empty, and it was all dark, you know? But then the Spirit of God was, like, hovering over the waters, and it was so cool.

          And then God was like, "Let there be light," and bam! Light appeared. And God was all, "Whoa, this light is so good." So he separated the light from the darkness and was

    • Hardly brand new. An author reads a manuscript someone sent him. He then goes and writes a story that bears remarkable similarity to what he read (same plot, same setting, same set of characters with different names and possibly personalities), same sequence of events. Any author will tell you that doing that is just inviting an infringement lawsuit that will very likely succeed. That's why authors and editors absolutely refuse to even open unsolicited manuscripts.

      • Same with design.
        Apple has an "Unsolicited Idea Submission Policy"; the key point is:
        If, despite our request that you not send us your ideas, you still submit them, then regardless of what your letter says (...) your submissions and their contents will automatically become the property of Apple, without any compensation to you; (...)

    • "failed to argue they suffered specific injuries "

      Thats a very f'ing dangerous argument with Open Source, because it seems to imply that unless some sort of financial injury occurs (And really what other sort is there in civil litigation) then a case cant even be heard. And while the commercial open source companies (Your elastic searches of the world) could put an argument forward that breaking the license unfairly deprives money of its non-free versions, that puts all those little linuxey packages where n

      • Thats a very f'ing dangerous argument

        There is nothing "dangerous" about it, perhaps a better word would be "specious." It is expected in early filings in a case. Lawyers are expected to make all the filings and ask for the moon, and when they don't have anything useful to file, instead of not filing it they just write some stupid bullshit.

        If they don't write it, they might be accused of gross negligence. If they write something stupid, they could only be accused of retail negligence; and if they write something stupid and specious, they can on

    • It's not something new. AI learning from reading code or books, or from viewing art or photographs is no different from a student learning from these materials in a university class. (Assuming, of course, that the AI isn't overfitting)
      • There's a huge difference. The student learns the idea, because they actually are intelligent. AI only learns the specifics because it is not.

        AI is just a statistical analysers, it is incapable of original thought or original creation in its current form.

        • It's not memorizing though. It doesn't have enough parameters to store enough information to memorize it's entire training set.
  • by ihearthonduras ( 2505052 ) on Saturday January 28, 2023 @05:13PM (#63247519)
    Microsoft? Me, myself and I are shocked... shocked, I say!
  • Although it's probably better that they don't, given the shittyness of Microsoft software, it would at least give them a defense.
  • by Kernel Kurtz ( 182424 ) on Saturday January 28, 2023 @05:36PM (#63247551)

    Copilot's goal is to replace a huge swath of open source by taking it and keeping it inside a GitHub-controlled paywall.

    I'm sorry, I don't get it. How is open source any less open regardless of what this AI does with it?

    • Re:What? (Score:5, Insightful)

      by Pinky's Brain ( 1158667 ) on Saturday January 28, 2023 @06:36PM (#63247647)

      GPL keeps derivative source open, copies of GPL code with some hallucinated substitutions and the licensed stripped from copilot don't if they get away with it.

      • by kmoser ( 1469707 )
        Wouldn't one solution be to release your source code with a license that prevents, or at least restricts, use of such code by AI/LLMs?
        • Wouldn't one solution be to release your source code with a license that prevents, or at least restricts, use of such code by AI/LLMs?

          This article [matthewbutterick.com] makes a good argument why not to do that:

          Most importantly, I don't think we should let the arrival of a new obstacle compromise the spirit of open source. For instance, some have suggested [ycombinator.com] creating an open-source license that forbids AI training. But this kind of usage-based restriction has never been part of the open-source ethos. Furthermore, it's over-inclusive: we can imagine (as I have above) AI systems that behave more responsibly and ethically than the first generation will. It would be self-defeating for open-source authors to set themselves athwart technological progress, since that's one of the main goals of open-sourcing code in the first place.

    • Re: (Score:2, Insightful)

      by Anonymous Coward

      I'm sorry, I don't get it. How is open source any less open regardless of what this AI does with it?

      So I can take your open source code, remove the license, release it under my own license, and it's fine I am now the copyright holder? Even when I go to court to accuse you of copying my code word for word, typos and all?

      Keep in mind the "I" in this example is Microsoft, backed with Microsofts legal department and budget. You are still you with whatever money you might have saved up.

      #define E2BIG 7 /*Argument list too long*/

      Microsoft has argued in court that the line of code above, one line out of the ent

      • So I can take your open source code, remove the license, release it under my own license, and it's fine I am now the copyright holder?

        No, but I see no evidence of that here. You seem to be asserting GPLed code is not subject to the doctrine of fair use. Perhaps this case may clarify the law in this regard, but in any case it should apply to everyone equally. Fair use is ultimately fair use no matter whether you like the people doing it or not.

    • by hencar ( 7991772 )

      I'm sorry, I don't get it. How is open source any less open regardless of what this AI does with it?

      In short there are 3 kinds of open source licenses:

      1) BSD-like licenses which allow you to create derivative works like executable programs, libraries or trained AI models as long as you with the derivative work include a text which tells about the license and who the copyright holders are of the original code.

      2) LGPL-like licenses which allow you to create derivative works as long as the end user is informed of the license and at least on request gets access to the source code of the part with LGPL l

      • What license do you end up with if you use a snippet of code from one of each to build an application?
        • by hencar ( 7991772 )

          What license do you end up with if you use a snippet of code from one of each to build an application?

          Then you will end up with all 3 licenses which will boil down to the terms of 2 licenses. All licenses will require you to ship the license information with your application. The BSD-like license will require you to mention that the copyright holders of that source have that copyright. The GPL license will require you, at least upon request, will provide the entire source of your application to the end user that so requests and that end user has the right to modify any part of the source and run and distrib

  • If a developer suggested those same things that copilot is suggesting, for the same reasons, would that be copyright infringement? Developers are sometimes trained by reading preexisting code, which sounds pretty much the same as copilot. If copilot and developers are equivalent, then you don't need new laws, you just need to apply existing rules.

    • by Luthair ( 847766 )
      If they copy n' paste it verbatim its copyright infringement.
      • If a developer suggested those same things that copilot is suggesting, for the same reasons, would that be copyright infringement?

        If they copy n' paste it verbatim its copyright infringement.

        It is verbatim copy n' paste [twitter.com]. Only copyright and license statements are removed.

        • by Luthair ( 847766 )
          You need to read the posts better as we're clearly aware of this. The poster asked whether a developer doing the same thing as copilot would be committing copyright infringement.
  • by zkiwi34 ( 974563 ) on Saturday January 28, 2023 @07:54PM (#63247783)
    People should never trust Microsoft, ever.
  • by Waffle Iron ( 339739 ) on Saturday January 28, 2023 @08:44PM (#63247855)

    Remember errno.h?

    If this AI knows those error codes, then GitHub is a derivative of Unix, and users will have to buy a $499 license from The SCO Group to legally use the site.

    At least that's the legal theory Microsoft was bankrolling just a few years back.

  • by RazorSharp ( 1418697 ) on Saturday January 28, 2023 @09:49PM (#63247931)

    It's cute how they list "Github, Microsoft, and OpenAI" as if these are three separate organizations.

Sendmail may be safely run set-user-id to root. -- Eric Allman, "Sendmail Installation Guide"

Working...