To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use, at least in the US and some other jurisdictions.
If the training is established as fair use, the underlying license doesn't really matter. The term you added would likely be void or deemed unenforceable if someone ever brought it to a court.
It depends on the license terms, if you have a license that allowed you to get it legally where you agreed to those terms it would not be legal for that purpose.
One of the craziest experiences in this "post AI" world is to see how quickly a lot of people in the "information wants to be free" or "hell yes I would download a car" crowds pivoted to "stop downloading my car, just because its on a public and openly available website doesn't make it free"
I wouldn't say this is settled law, but it looks like this is one of the likely outcomes. It might not be possible to write a license to prevent training.
Isn't the court fight on fair use failing pretty hard on the prong that flooding the market with cheap copies eliminates the market for the original work?
Copyright protects the expression of an idea, not the idea itself. Therefore, an LLM transforming concepts it learned into a response (a new expression) would hardly qualify as copyright infringement in court.
This principle is also explicitly declared in US law:
> In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work. (Section 102 of the U.S. Copyright Act)
Recoding a video file doesn't get rid of the copyright therefore doing some automatic processing on a copyrighted material doesn't remove the copyright.
The problem is that openai has too much money. But if I did what they are doing I'd get into massive legal troubles.
If a human reads code, and then reproduces said code, that can be a copyright violation. But you can read the code, learn from it, and produce something totally different. The middle ground, where you read code, and produce something similar is a grey area.
Bad analogy, probably made up by capitalists to confuse people. ML models cannot and do not learn. "learning" is a name of a process, when model developer downloads pirated material and processes it with an algorithm (computes parameters from it).
Also, humans do not need to read million of pirated books to learn to talk. And a human artist doesn't need to steal million pictures to learn to draw.
> And a human artist doesn't need to steal million pictures to learn to draw.
They... do? Not just pictures, but also real life data, which is a lot more data than an average modern ML system has. An average artist has probably seen- stolen millions of pictures from their social media feeds over their lifetime.
Also, claiming to be anti-capitalist while defending one of the most offensive types of private property there is. The whole point of anti-capitalism is being anti private property. And copyright is private property because it gives you power over others. You must be against copyright and be against the concept of "stealing pictures" if you are to be an anti-capitalist.
For property to give you power, you need to concentrate lot of it, or own something expensive or exclusive. That's what capitalists do: a capitalist owns an expensive lathe, and you don't so you have no choice but to work for the capitalist on capitalist's terms (you can replace lathe with GPU farm for more modern analogy).
Owning a song, a book or a picture doesn't give you much power by itself.
Would such a license fall under the definition of free software? Difficult to say. Counter-proposition: a license which permits training if the model is fully open.
My next project will be released under a GPL-like license with exactly this condition added. If you train a model on this code, the model must be open source & open weights
In light of the fact that the courts have found training an AI model to be fair use under US copyright law, it seems unlikely this condition will have any actual relevance to anyone. You're probably going to need to not publicly distribute your software at all, and make such a condition a term of the initial sale. Even there, it's probably going to be a long haul to get that to stick.
Because it would violate freedom zero. Adding such terms to the GNU GPL would also mean that you can remove them, they would be considered "further restrictions" and can be removed (see section 7 of the GNU GPL version 3).
Freedom 0 is not violated. GPL includes restrictions for how you can use the software, yet it's still open source.
You can do whatever you want with the software, BUT you must do a few things. For GPL it's keeping the license, distributing the source, etc. Why can't we have a different license with the same kind of restrictions, but also "Models trained on this licensed work must be open source".
Edit: Plus the license would not be "GPL+restriction" but a new license altogether, which includes the requirements for models to be open.
That is not really correct, the GNU GPL doesn't have any terms whatsoever on how you can use, or modify the program to do things. You're free to make a GNU GPL program do anything (i.e., use).
I suggest a careful reading of the GNU GPL, or the definition of Free Software, where this is carefully explained.
> You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:
"A work based on the program" can be defined to include AI models (just define it, it's your contract). "All of these conditions" can include conveying the AI model in an open source license.
I'm not restricting your ability to use the program/code to train an AI. I'm imposing conditions (the same as the GPL does for code) onto the AI model that is derivative of the licensed code.
Edit: I know it may not be the best section (the one after regarding non-source forms could be better) but in spirit, it's exactly the same imo as GPL forcing you to keep the GPL license on the work
I think maybe you're mixing up distribution and running a program, at least taking your initial comment into account, "if you train/run/use a model, it must be open source".
I should have been more precise: "If you train and distribute an AI model on this work, it must use the same license as the work".
Using AGPL as the base instead of GPL (where network access is distribution), any user of the software will have the rights to the source code of the AI model and weights.
My goal is not to impose more restrictions to the AI maker, but to guarantee rights to the user of software that was trained on my open source code.
Yet the GPL imposes requirements for me and we consider it free software.
You are still free to train on the licensed work, BUT you must meet the requirements (just like the GPL), which would include making the model open source/weight.
And distributing an AI model trained on that text is neither distributing the work nor a modification of the work, so the GPL (or other) license terms don't apply. As it stands, the courts have found training an AI model to be a sufficiently transformative action and fair use which means the resulting output of that training is not a "copy" for the terms of copyright law.
> And distributing an AI model trained on that text is neither distributing the work nor a modification of the work, so the GPL (or other) license terms don't apply.
If I print an harry potter book in red ink then I won't have any copyright issues?
I don't think changing how the information is stored removes copyright.
If it is sufficiently transformative yes it does. That’s why “information” per se is not eligible for copyright, no matter what the NFL wants you to think. No printing the entire text of a Harry Potter book in red ink is not likely to be viewed as sufficiently transformative. But if you take the entirety of that book and publish a list of every word and the frequency, it’s extremely unlikely to be found a violation of copyright. If you publish a count of every word with the frequency weighted by what word came before it, you’re also very likely to not be found to have violated copyright. If you distribute the MD5 sum of the file that is a Harry Potter book you’re also not likely to be found to have violated copyright. All of these are “changing how the information is stored”.
I don't think it's fair use, but everyone on Earth disagree with me. So even with the standard default licence that prohibits absolutely everything, the humanity-1 consider it fair use.
Honest question: why don’t you think it is fair use?
I can see how it pushes the boundary, but I can’t lay out logic that it’s not. The code has been publish for the public to see. I’m always allowed to read it, remember it, tell my friends about it. Certainly, this is what the author hoped I would do. Otherwise, wouldn’t they have kept it to themselves?
These agents are just doing a more sophisticated, faster version of that same act.
Some project like Wine forbids you to contribute if you ever have seen the source of MS Windows [1]. The meatball inside your head is tainted.
I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.
> Some people cannot contribute to Wine because of potential copyright
violation. This would be anyone who has seen Microsoft Windows source
code (stolen, under an NDA, disassembled, or otherwise). There are some
exceptions for the source code of add-on components (ATL, MFC, msvcrt);
see the next question.
> I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.
This is close to how I would actually recommend reimplementing a legacy system (owned by the re-implementer) with AI SWE. Not to avoid copyright, but to get the AI to build up everything it needs to maintain the system over a long period of time. The separate team is just a new AI instance whose context doesn’t contain the legacy the code (because that would pollute the new result). The amplify isn’t too apt though since there is a difference between having something in your context (which you can control and is very targeted) and the code that the model was trained on (which all AI instance will share unless you use different models, and anyways, it isn’t supposed to be targeted).
Before LLMs programmers had pretty good intuition what GPL license allowed for. It is of course clear that you cannot release a closed source program with GPL code integrated into it. I think it was also quite clear, that you cannot legally incorporate GPL code into such a program, by making changes here and there, renaming some stuff, and moving things around, but this is pretty much what LLMs are doing. When humans do it intentionally, it is violation of the license, when it is automated and done on a huge scale, is it really fair use?
The fair use prong that's problematic is that the fair use can't decimate the value of the original work. It's the difference between me imitating your art style for a personal project and me making 1,000,000 copies of your art so that your art isn't worth much anymore. One is a fair use, the other is exploitative extraction
Why forbid it when you could do exactly what this post suggests: go explicit and say that by including this copyrighted material in AI training you consent to release of the model. And you clarify that the terms are contractual, and that training the model on data represents implicit acceptance of the terms.
"Why forbid selling drugs when you can just put a warning label on them? And you could clarify that an overdose is lethal."
It doesn't solve any problems and just pushes enforcement actions into a hopelessly diffuse space. Meanwhile the cartel continues to profit and small time users are temporarily incarcerated.
That wouldn't matter too much though - how often do you worry about competitors directly stealing your code? Either it's server-side, or it's obfuscated or it's compiled. Anyway there's never that much stuff that's so special that it needs big legal stuff to prevent it from being copied, and if the LLM produces it you can just use another LLM to copy the same feature. And say it's 99% LLM and 1% human, who's going to know what the 1% is that's not safe to copy?
You probably misunderstood how "infection" of GPL works. (which is very common)
If your close-sourced project uses some GPL code, it doesn't automatically put your whole project in public domain or under GPL. It just means you're infringing the right of the code author and they can sue you (for money and stopping using their code, not for making your whole project GPL).
In the simplest terms, GPL is:
if codebase.is_gpl_compitable:
gpl_code.give_permission(code_base)
else if codebase.is_using(gpl_code):
throw new COPYRIGHT_INFRINGEMENT // the copyright owner and the court deal with that with usual copyright laws
GPL can't do much more than that. A license over a piece of code cannot automatically change the copyright status of another piece of code. There simply isn't legal framework for that.
Similarly, AI code's copyleft status can't affect the rest of the codebase, unless we make new laws specifically saying that.
Also similarly, even if Github lost the class action, it will NOT automatically release the model behind GPL to the public. It will open the possibility for all the GPL repo authors to ask Microsoft for compensation for stealing their code.
It just means you're infringing the right of the code author and they can sue you (for money and stopping using their code, not for making your whole project GPL).
they can sue you and settle for whatever you will accept that makes them happy.
if you lose then the alternative to not making your code GPL is to make your code disappear, that is you are no longer allowed to sell your product.
consequently, if AI code is subject to the GPL then the rest of the codebase is too, or the alternative would be that the could not be distributed.
First of all, pure AI-generated code is uncopyrightable now. Uncopyrightable code can't be under GPL.
Secondly, GPL can't "make your (proprietary) code disappear." Violating GPL is essentially just stealing code. One cannot distribute the version that includes stolen code. But they can remove the stolen part and replace it with their own code. Of course they still need to settle/pay for the previous infringement.
GPL simply can't affect the copyright status of rest of the codebase, because it's a license, not a contract. It cannot restrict the user's right further than the copyright laws.
Again, it's very common misunderstanding of GPL's "virality." It has been a several-decade long debate about whether GPL should be treated like a contract instead of a mere license, but there is no ruling giving it this special legal state (yet), at least in the US.
First of all, pure AI-generated code is uncopyrightable now. Uncopyrightable code can't be under GPL.
if AI generates something that is equal to existing code, then the license of that code applies. the AI generated product as a whole can't be copyrighted, but the portions that reproduce copyrighted code retain the original copyright.
they can remove the stolen part and replace it with their own code
sure, if they can do that, then they can distribute their code again. but until then they can't.
> if AI generates something that is equal to existing code, then the license of that code applies.
No, it doesn't, if the generation is independent of the existing code. If a person using AI uses existing code and makes a literal copy of it, then, yes, the copyright (and any license offer applicable in the circumstances) of the existing code may apply (it may also not, the same as with copies of portions of code made by other means), and it's less than clear if (especially for small portions of code) that legally such a copy has been made when a work is in the training set.
Copyright protects against copying. It doesn't protect against someone creating the same content by means other than copying.
btw, this is the reason why people who at some point of time may have had access to windows source code are not allowed to work on wine. because if wine accidentally reproduces windows code, the only defense is that none of the contributors have ever seen that code before.
if AI has seen that code in training, then this defense is no longer possible.
No, almost cerainly it would be practically impossible if you reproduced the entire work, on top of evidence that you had perused it, because it would be very hard to convince a trier of fact that the duplication really was coincidence rather than copying, but it might be a very different story if you had read Harry Potter and then wrote another work that includes the text “Up!” she screeched. (which appears verbatim in the first volume of the series.)
You can use GPL code in proprietary code. You just can't distribute said proprietary code if you don't also distribute its sources in accordance with the GPL, and that is how the "infection" happens.
But then we would need a way to prove that some code was LLM generated, right?
Like if I copy-paste GPL-licenced code, the way you realise that I copy-pasted it is because 1) you can see it and 2) the GPL-licenced code exists. But when code is LLM generated, it is "new". If I claim I wrote it, how would you oppose that?
The same way a current product advertises itself as hand-crafted, or organic, etc. Just pure trust. And if you can charge a higher price because this is considered valuable by your customers, then you get to sustain it as a business.