Lawsuit Accusing Copilot of Abusing Open-Source Code Challenged by GitHub, Microsoft, OpenAI (reuters.com) 60
GitHub, Microsoft, and OpenAI "told a San Francisco federal court that a proposed class-action lawsuit for improperly monetizing open-source code to train their AI systems cannot be sustained," reports Reuters:
The companies said in Thursday court filings that the complaint, filed by a group of anonymous copyright owners, did not outline their allegations specifically enough and that GitHub's Copilot system, which suggests lines of code for programmers, made fair use of the source code. A spokesperson for GitHub, an online platform for housing code, said Friday that the company has "been committed to innovating responsibly with Copilot from the start" and that its motion is "a testament to our belief in the work we've done to achieve that...."
Microsoft and OpenAI said Thursday that the plaintiffs lacked standing to bring the case because they failed to argue they suffered specific injuries from the companies' actions. The companies also said the lawsuit did not identify particular copyrighted works they misused or contracts that they breached.
Microsoft also said in its filing that the copyright allegations would "run headlong into the doctrine of fair use," which allows the unlicensed use of copyrighted works in some situations. The companies both cited a 2021 U.S. Supreme Court decision that Google's use of Oracle source code to build its Android operating system was transformative fair use.
Slashdot reader guest reader shares this excerpt from the plaintiffs' complaint: GitHub and OpenAI have offered shifting accounts of the source and amount of the code or other data used to train and operate Copilot. They have also offered shifting justifications for why a commercial AI product like Copilot should be exempt from these license requirements, often citing "fair use."
It is not fair, permitted, or justified. On the contrary, Copilot's goal is to replace a huge swath of open source by taking it and keeping it inside a GitHub-controlled paywall. It violates the licenses that open-source programmers chose and monetizes their code despite GitHub's pledge never to do so.
Microsoft and OpenAI said Thursday that the plaintiffs lacked standing to bring the case because they failed to argue they suffered specific injuries from the companies' actions. The companies also said the lawsuit did not identify particular copyrighted works they misused or contracts that they breached.
Microsoft also said in its filing that the copyright allegations would "run headlong into the doctrine of fair use," which allows the unlicensed use of copyrighted works in some situations. The companies both cited a 2021 U.S. Supreme Court decision that Google's use of Oracle source code to build its Android operating system was transformative fair use.
Slashdot reader guest reader shares this excerpt from the plaintiffs' complaint: GitHub and OpenAI have offered shifting accounts of the source and amount of the code or other data used to train and operate Copilot. They have also offered shifting justifications for why a commercial AI product like Copilot should be exempt from these license requirements, often citing "fair use."
It is not fair, permitted, or justified. On the contrary, Copilot's goal is to replace a huge swath of open source by taking it and keeping it inside a GitHub-controlled paywall. It violates the licenses that open-source programmers chose and monetizes their code despite GitHub's pledge never to do so.
AI needs whole new laws (Score:2, Interesting)
This is something brand new. The current laws we have are absolutely unable to handle AIs using things for training data and then recreating the same thing but different. We need to rethink how we do IP or soon nothing will protected because you can just point an AI at it and make a perfect-but-different copy. Even Mickey Mouse won't be immune when Macky Moose pops up.
Re: (Score:1)
Re: (Score:1)
Fantastic argument!
Sorry but there is definitely going to be regulation around AI/ML in the near future. What I cannot say but I doubt anyone will have much luck protecting creative jobs which these coders definitely want to do.
Me, I’m studying up and very amused by the guys in denial about what co-pilot can do. I asked it to draw 3 boobs, it drew 3 boobs. I asked it to write code to do various DoS attacks and it handled it no problem. It’s definitely going to reduce demand for a lot of codi
Re: (Score:1)
no dumb laws, or no laws? (Score:1)
"New laws to constrain AI are as dumb and short-sighted as red flag traffic laws."
Untrue. There are no laws regulating AI so they cannot be dumb.
As for the red flag traffic law example: and how about red LIGHTS? They are also dumb because traffic wants to be free, right?
Re: (Score:3)
The plaintiffs are not trying to stop the AI from copying their code. They are trying to stop it from reading their code and learning from it.
It isn't illegal for a human to read code and learn. It shouldn't be illegal for a robot either.
You know all those Terminators who want to wipe out humanity? Well, with lawsuits like this, can you blame them?
Re: (Score:1)
we're on the same page, I think.
I reacted to the flawed argumentation in the parent post: not all regulations must be meant to constrain something.
To me, it's clear that we need laws that ENABLE all reasonable uses of ML/AI systems.
And no, not all uses are by definition reasonable: imagine you are a famous painter and some AI gets trained solely on YOUR OWN pieces of art.
That would not be OK, right?
Re: no dumb laws, or no laws? (Score:1)
Well, to continue the analogy, if it is OK to study and learn from existing works, then it is OK to study and learn from a single painter and to produce something in their style. In fact, this is what artists do all the time, to learn their art.
Where it is not OK, in either human or AI scenarios, is to pass-off that work as being by the original artist.
There are also restrictions on how you may monetise such a work, which is where copyright comes in and this is where the real problems lie.
Copyright law hing
Re: no dumb laws, or no laws? (Score:2)
If they were serious about that argument they'd train it on all the paid for, private repos they host. Until that happens I'm inclined to call BS on any such arguments.
Re: AI needs whole new laws (Score:1)
Red flag traffic laws made perfect sense when there were only a few cars and the roads were full of horses and hand-drawn carts.
As automobile use increased, and cars started to dominate; and as people became accustomed to the risks these new machines posed, and adapted their behaviour accordingly, the law became redundant and actually harmful. So it was repealed.
We are in the very first phase of this evolution, so it is absolutely right that we place some guard rails around it as the world adapts.
Re: AI needs whole new laws (Score:2)
The AI is only dubiously artificial (statistics is a branch of mathematics and thus natural) and certainly possesses no intelligence. It is thus not trying to do anything, it is incapable of trying.
Sure kids may use AI to write reports. And those reports will contain more errors and less reasoning. They will also confer less learning. Do you really want under-educated kids or kids who understand the subjects they're learning?
TBH, I'd scrap homework entirely and require book reports to be done as classwork I
Re: (Score:1)
I can't tell if you're being sarcastic
Of course I’m being sarcastic. Information wants to be free was a stupid slogan back when I was 15 and using it properly. It{s not an argument for anything.
New laws to constrain AI are as dumb and short-sighted as red flag traffic laws.
More baseless jibberish. Just making shit up as you go, less thought behind the points you want to make than an AI would give!
Re:AI needs whole new laws (Score:4, Insightful)
It's not "the same thing but different" or any sort of copy, unless you overtrain. These tools learn the statistical relationships that define concepts. They never draw from any single work, but simultaneously from every work in their training set, proportional to its relevance to the context at hand.
Overtraining to specific works might be some sort of grounds to attack this (it happens accidentally, for many reasons, but most commonly data duplication in the dataset) - but the plaintiffs aren't alleging that. They're trying to argue, both in this case, and in the StableDiffusion / Midjourney case (same legal team), that all such AI tools are inherently copyright infringement if any copyrighted data is used in the training, regardless of whether they're capable of actually recreating them in any way, shape or form (and they assert a bunch of nonsense about how these systems work, which they clearly don't understand). Legally, it seems like an insanely weak argument, but they're trying it.
Also, from the summary (another argument from the plaintiffs):
"They should be exempt from this" for the same reason why the defendants' argument is unsurprisingly "transformative use". As per Campbell v. Acuff-Rose Music, Inc., the US Supreme Court ruled, "The more transformative the new work, the less will be the significance of other factors, like commercialism, that may weigh against a finding of fair use". So in a highly transformative case, it doesn't matter if the defendants are a commercial entity. And things far less transformative than AI tools have met this standard previously. I mean, Google Books was literally scanning entire copyrighted books without permission and posting blurbs, even whole pages, for people to read, and even that was deemed transformative and fair use.
IANAL, but if the plaintiffs were going after overtraining, I could see how they might be able to win this. But trying to argue that it's inherently not fair use, even if a given work contributed a billionth of a byte to the weightings... it's like the copyright equivalent of homeopathy.
Re: (Score:3)
Once through generators can't learn to multiply ...
Once you get outside of natural language and you need some rigor, the non-linearities just make it impossible for generators to generalise much, the compiler isn't very tolerant of hallucinated bullshit. Copilot is far more a search engine than a code producer. It's mostly designed and trained to copy paste and composite existing blocks of code, stripping variable names and comments. To do more it would need AGI.
A search engine which copy paste's licensed c
Re: (Score:2)
I think you are missing the point of the OP. I don't think they are making the same argument as the defendants and accusing Copilot or any of the other current high profile AI systems of over training and literally spitting out thinly modified plagiarized copies.
It's that our current laws on IP were designed with human scale use cases in mind and that maybe they aren't well suited to massive data collection and AI systems, whether they produce significantly transformative works or not.
I'm not really sure. I
Re: (Score:1)
if you violate one copyright, they call you a pirate, if you violate a million copyrights, they call you an innovator.
Re:AI needs whole new laws (Score:5, Informative)
But trying to argue that it's inherently not fair use, even if a given work contributed a billionth of a byte to the weightings... it's like the copyright equivalent of homeopathy.
It is not just few bytes.
Tim Davis says the following: GitHub copilot, with "public code" blocked, emits large chunks of my copyrighted code, with no attribution, no LGPL license. For example, the simple prompt "sparse matrix transpose, cs_" produces my cs_transpose in CSparse. My code on left, github on right [twitter.com].
Other codes are slightly morphed. I could probably reproduce my entire sparse matrix libraries from simple prompts. My code on left (in the Terminal). Prompt is "sparse matrix elimination tree, cs_". Github on right [twitter.com].
Reproduced cs_add by GitHub copilot only replaced /**/ comments to // [twitter.com] from this code stored on GitHub [github.com]:
Re: (Score:2)
Even ignoring this, the person entirely ignored the comment about overtraining.
That said, overtraining is much simpler to happen unintentionally with text than with images.
But I agree with you, this is a straightforward, uncopywritable example.
Re: (Score:2)
I mean, Google Books was literally scanning entire copyrighted books without permission and posting blurbs, even whole pages, for people to read, and even that was deemed transformative and fair use.
Google books [google.ca] is very different. Google books shows "Copyrighted material" notes on every page together with references to the source material including author's name and a link where you can buy the original book.
Re: (Score:2)
They're trying to argue, both in this case, and in the StableDiffusion / Midjourney case (same legal team), that all such AI tools are inherently copyright infringement if any copyrighted data is used in the training, regardless of whether they're capable of actually recreating them in any way, shape or form (and they assert a bunch of nonsense about how these systems work, which they clearly don't understand). Legally, it seems like an insanely weak argument, but they're trying it.
Weak, perhaps. Insanely weak, perhaps not. Whether something is a derivative work hinges upon whether there are sufficient recognizable elements. That hasn't applied to AI artwork yet because the elements clearly differ, they are readily distinguishable from copy and paste. But this AI-generated code includes substantial quantities of code which is literally indistinguishable from data on which the model was trained, so the difference between the AI-generated snippets and the originals is a much more compli
Re: (Score:1)
Re: (Score:2)
It's not "the same thing but different" or any sort of copy, unless you overtrain. These tools learn the statistical relationships that define concepts. They never draw from any single work, but simultaneously from every work in their training set, proportional to its relevance to the context at hand.
I don't think overtraining, is really relevant here.
So far in my experience CoPilot is really good at three things.
1) Refactoring. If I'm restructuring my code it often understands how to plop my existing code into a new function with appropriate changes.
2) Comments. It writes descriptive and relevant comments for me.
3) Snippets. It find a code snippet to do the thing I wanted (something I'd previously have to hunt around stack exchange for).
#3 is the issue, as it's often spitting out what definitely seems
They should have AI write a new Bible (Score:2)
Let it draw on all religious and create some new synthesis
Re: (Score:3)
ChatGPT can do that. Here, I asked for the start of Genesis in the style of a Valley Girl:
Re: (Score:3)
Hardly brand new. An author reads a manuscript someone sent him. He then goes and writes a story that bears remarkable similarity to what he read (same plot, same setting, same set of characters with different names and possibly personalities), same sequence of events. Any author will tell you that doing that is just inviting an infringement lawsuit that will very likely succeed. That's why authors and editors absolutely refuse to even open unsolicited manuscripts.
Re: (Score:2)
Same with design.
Apple has an "Unsolicited Idea Submission Policy"; the key point is:
If, despite our request that you not send us your ideas, you still submit them, then regardless of what your letter says (...) your submissions and their contents will automatically become the property of Apple, without any compensation to you; (...)
Re: (Score:2)
"failed to argue they suffered specific injuries "
Thats a very f'ing dangerous argument with Open Source, because it seems to imply that unless some sort of financial injury occurs (And really what other sort is there in civil litigation) then a case cant even be heard. And while the commercial open source companies (Your elastic searches of the world) could put an argument forward that breaking the license unfairly deprives money of its non-free versions, that puts all those little linuxey packages where n
Re: (Score:2)
Thats a very f'ing dangerous argument
There is nothing "dangerous" about it, perhaps a better word would be "specious." It is expected in early filings in a case. Lawyers are expected to make all the filings and ask for the moon, and when they don't have anything useful to file, instead of not filing it they just write some stupid bullshit.
If they don't write it, they might be accused of gross negligence. If they write something stupid, they could only be accused of retail negligence; and if they write something stupid and specious, they can on
Re: (Score:2)
Re: AI needs whole new laws (Score:2)
There's a huge difference. The student learns the idea, because they actually are intelligent. AI only learns the specifics because it is not.
AI is just a statistical analysers, it is incapable of original thought or original creation in its current form.
Re: (Score:2)
Challenged by Microsoft, Microsoft and... (Score:5, Funny)
Is Microsoft code also used as source for hints ? (Score:1)
Re: Is Microsoft code also used as source for hint (Score:2)
Of course not. Microsoft are only interested in violating the copyrights of others.
Re: Times changing (Score:2)
According to some of the other comments above, that's exactly what they are doing. Regurgitating large chunks of code sans licensing agreement.
Re: (Score:2)
> that's exactly what they are doing. Regurgitating large chunks of code sans licensing agreement.
My junior devs have been trying to do that since someone told them StackOverflow was a thing.
What? (Score:3)
Copilot's goal is to replace a huge swath of open source by taking it and keeping it inside a GitHub-controlled paywall.
I'm sorry, I don't get it. How is open source any less open regardless of what this AI does with it?
Re:What? (Score:5, Insightful)
GPL keeps derivative source open, copies of GPL code with some hallucinated substitutions and the licensed stripped from copilot don't if they get away with it.
Re: (Score:2)
Re: (Score:3)
Wouldn't one solution be to release your source code with a license that prevents, or at least restricts, use of such code by AI/LLMs?
This article [matthewbutterick.com] makes a good argument why not to do that:
Most importantly, I don't think we should let the arrival of a new obstacle compromise the spirit of open source. For instance, some have suggested [ycombinator.com] creating an open-source license that forbids AI training. But this kind of usage-based restriction has never been part of the open-source ethos. Furthermore, it's over-inclusive: we can imagine (as I have above) AI systems that behave more responsibly and ethically than the first generation will. It would be self-defeating for open-source authors to set themselves athwart technological progress, since that's one of the main goals of open-sourcing code in the first place.
Re: (Score:2, Insightful)
I'm sorry, I don't get it. How is open source any less open regardless of what this AI does with it?
So I can take your open source code, remove the license, release it under my own license, and it's fine I am now the copyright holder? Even when I go to court to accuse you of copying my code word for word, typos and all?
Keep in mind the "I" in this example is Microsoft, backed with Microsofts legal department and budget. You are still you with whatever money you might have saved up.
#define E2BIG 7 /*Argument list too long*/
Microsoft has argued in court that the line of code above, one line out of the ent
Re: (Score:2)
So I can take your open source code, remove the license, release it under my own license, and it's fine I am now the copyright holder?
No, but I see no evidence of that here. You seem to be asserting GPLed code is not subject to the doctrine of fair use. Perhaps this case may clarify the law in this regard, but in any case it should apply to everyone equally. Fair use is ultimately fair use no matter whether you like the people doing it or not.
Re: What? (Score:2)
See the examples given by developers elsewhere in the replies.
Re: (Score:1)
I'm sorry, I don't get it. How is open source any less open regardless of what this AI does with it?
In short there are 3 kinds of open source licenses:
1) BSD-like licenses which allow you to create derivative works like executable programs, libraries or trained AI models as long as you with the derivative work include a text which tells about the license and who the copyright holders are of the original code.
2) LGPL-like licenses which allow you to create derivative works as long as the end user is informed of the license and at least on request gets access to the source code of the part with LGPL l
Re: (Score:2)
Re: (Score:1)
What license do you end up with if you use a snippet of code from one of each to build an application?
Then you will end up with all 3 licenses which will boil down to the terms of 2 licenses. All licenses will require you to ship the license information with your application. The BSD-like license will require you to mention that the copyright holders of that source have that copyright. The GPL license will require you, at least upon request, will provide the entire source of your application to the end user that so requests and that end user has the right to modify any part of the source and run and distrib
analogous to developer (Score:2)
If a developer suggested those same things that copilot is suggesting, for the same reasons, would that be copyright infringement? Developers are sometimes trained by reading preexisting code, which sounds pretty much the same as copilot. If copilot and developers are equivalent, then you don't need new laws, you just need to apply existing rules.
Re: (Score:2)
Re: (Score:3)
If a developer suggested those same things that copilot is suggesting, for the same reasons, would that be copyright infringement?
If they copy n' paste it verbatim its copyright infringement.
It is verbatim copy n' paste [twitter.com]. Only copyright and license statements are removed.
Re: (Score:2)
Which is why (Score:3)
Re: Which is why (Score:2)
Agreed. Microsoft should be split up or chewed up, but never trusted.
Re: Citing Oracle vs. Google (Android)? (Score:2)
We must assume the courts are consistent, but Microsoft's JScript was stopped by the courts (and Microsoft will still be smarting from that, I suspect). So there has to be a difference, in the eyes of the law, between JScript and Google's reimplementation.
If we understood that difference, we'd understand a bit more about what the courts see as important.
How quickly things change (Score:3)
Remember errno.h?
If this AI knows those error codes, then GitHub is a derivative of Unix, and users will have to buy a $499 license from The SCO Group to legally use the site.
At least that's the legal theory Microsoft was bankrolling just a few years back.
Cute (Score:3)
It's cute how they list "Github, Microsoft, and OpenAI" as if these are three separate organizations.