We need a new license that forbids all training. That is the only way to stop bi...

maxloh · 2025-11-27T14:30:56 1764253856

To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use, at least in the US and some other jurisdictions.

If the training is established as fair use, the underlying license doesn't really matter. The term you added would likely be void or deemed unenforceable if someone ever brought it to a court.

rileymat2 · 2025-11-27T14:50:45 1764255045

It depends on the license terms, if you have a license that allowed you to get it legally where you agreed to those terms it would not be legal for that purpose.

But this is all grey area… https://www.authorsalliance.org/2023/02/23/fair-use-week-202...

justin_murray · 2025-11-27T14:40:30 1764254430

This is at least murky, since a lot of pirated material is “publicly available”. Certainly some has ended up in the training data.

michaelmrose · 2025-11-27T14:47:58 1764254878

It isn't? You have to break the law to get it. It's publicly available like your TV is if I were to break into your house and avoid getting shot.

MangoToupe · 2025-11-27T14:58:53 1764255533

Maybe you have some legalistic point that escapes comprehension, but I certainly consider my house to be much private and the internet public.

basilgohar · 2025-11-27T14:58:41 1764255521

That isn't even remotely a sensible analogy. Equating copyright violation with stealing physical property is an extremely failed metaphor.

tpmoney · 2025-11-27T19:42:48 1764272568

One of the craziest experiences in this "post AI" world is to see how quickly a lot of people in the "information wants to be free" or "hell yes I would download a car" crowds pivoted to "stop downloading my car, just because its on a public and openly available website doesn't make it free"

voidfunc · 2025-11-27T21:48:22 1764280102

"Rules for thee, but not for me"

colechristensen · 2025-11-27T14:42:47 1764254567

I wouldn't say this is settled law, but it looks like this is one of the likely outcomes. It might not be possible to write a license to prevent training.

conartist6 · 2025-11-27T21:47:13 1764280033

Isn't the court fight on fair use failing pretty hard on the prong that flooding the market with cheap copies eliminates the market for the original work?

LtWorf · 2025-11-27T19:01:43 1764270103

Fair use was for citing and so on not for ripping off 100% of the content.

maxloh · 2025-11-27T19:18:06 1764271086

Copyright protects the expression of an idea, not the idea itself. Therefore, an LLM transforming concepts it learned into a response (a new expression) would hardly qualify as copyright infringement in court.

This principle is also explicitly declared in US law:

> In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work. (Section 102 of the U.S. Copyright Act)

https://www.copyrightlaws.com/are-ideas-protected-by-copyrig...

LtWorf · 2025-11-27T23:18:19 1764285499

Recoding a video file doesn't get rid of the copyright therefore doing some automatic processing on a copyrighted material doesn't remove the copyright.

The problem is that openai has too much money. But if I did what they are doing I'd get into massive legal troubles.

1gn15 · 2025-11-28T04:45:53 1764305153

Not true. You can train on copyrighted material and post the resulting model on HuggingFace, and you won't get into trouble. Pinky promise.

mr_toad · 2025-11-27T19:39:48 1764272388

Fair use doesn’t need a license, so it doesn’t matter what you put in the license.

Generally speaking licenses give rights (they literally grant license). They can’t take rights away, only the legislature can do that.

anticensor · 2025-11-28T19:20:31 1764357631

Exclusive or co-exclusive licences can nullify your default fair use in certain jurisdictions.

munchler · 2025-11-27T14:43:25 1764254605

By that logic, humans would also be prevented from “training” on (i.e. learning from) such code. Hard to see how this could be a valid license.

psychoslave · 2025-11-27T14:49:05 1764254945

Isn’t it the very reason why we need cleanroom software engineering:

https://en.wikipedia.org/wiki/Cleanroom_software_engineering

mr_toad · 2025-11-27T19:45:16 1764272716

If a human reads code, and then reproduces said code, that can be a copyright violation. But you can read the code, learn from it, and produce something totally different. The middle ground, where you read code, and produce something similar is a grey area.

bluefirebrand · 2025-11-28T00:25:16 1764289516

There is absolutely no reason that LLMs (or Corporations) should have the same rights as humans

codedokode · 2025-11-27T14:58:42 1764255522

Bad analogy, probably made up by capitalists to confuse people. ML models cannot and do not learn. "learning" is a name of a process, when model developer downloads pirated material and processes it with an algorithm (computes parameters from it).

Also, humans do not need to read million of pirated books to learn to talk. And a human artist doesn't need to steal million pictures to learn to draw.

1gn15 · 2025-11-27T15:55:57 1764258957

> And a human artist doesn't need to steal million pictures to learn to draw.

They... do? Not just pictures, but also real life data, which is a lot more data than an average modern ML system has. An average artist has probably seen- stolen millions of pictures from their social media feeds over their lifetime.

Also, claiming to be anti-capitalist while defending one of the most offensive types of private property there is. The whole point of anti-capitalism is being anti private property. And copyright is private property because it gives you power over others. You must be against copyright and be against the concept of "stealing pictures" if you are to be an anti-capitalist.

codedokode · 2025-11-28T15:23:40 1764343420

For property to give you power, you need to concentrate lot of it, or own something expensive or exclusive. That's what capitalists do: a capitalist owns an expensive lathe, and you don't so you have no choice but to work for the capitalist on capitalist's terms (you can replace lathe with GPU farm for more modern analogy).

Owning a song, a book or a picture doesn't give you much power by itself.

James_K · 2025-11-27T14:21:25 1764253285

Would such a license fall under the definition of free software? Difficult to say. Counter-proposition: a license which permits training if the model is fully open.

Orygin · 2025-11-27T14:27:27 1764253647

My next project will be released under a GPL-like license with exactly this condition added. If you train a model on this code, the model must be open source & open weights

tpmoney · 2025-11-27T19:48:53 1764272933

In light of the fact that the courts have found training an AI model to be fair use under US copyright law, it seems unlikely this condition will have any actual relevance to anyone. You're probably going to need to not publicly distribute your software at all, and make such a condition a term of the initial sale. Even there, it's probably going to be a long haul to get that to stick.

fouronnes3 · 2025-11-27T14:31:03 1764253863

Not sure why the FSF or any other organization hasn't released a license like this years ago already.

amszmidt · 2025-11-27T14:38:16 1764254296

Because it would violate freedom zero. Adding such terms to the GNU GPL would also mean that you can remove them, they would be considered "further restrictions" and can be removed (see section 7 of the GNU GPL version 3).

Orygin · 2025-11-27T14:54:42 1764255282

Freedom 0 is not violated. GPL includes restrictions for how you can use the software, yet it's still open source.

You can do whatever you want with the software, BUT you must do a few things. For GPL it's keeping the license, distributing the source, etc. Why can't we have a different license with the same kind of restrictions, but also "Models trained on this licensed work must be open source".

Edit: Plus the license would not be "GPL+restriction" but a new license altogether, which includes the requirements for models to be open.

amszmidt · 2025-11-27T15:17:00 1764256620

That is not really correct, the GNU GPL doesn't have any terms whatsoever on how you can use, or modify the program to do things. You're free to make a GNU GPL program do anything (i.e., use).

I suggest a careful reading of the GNU GPL, or the definition of Free Software, where this is carefully explained.

Orygin · 2025-11-27T15:29:58 1764257398

> You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:

"A work based on the program" can be defined to include AI models (just define it, it's your contract). "All of these conditions" can include conveying the AI model in an open source license.

I'm not restricting your ability to use the program/code to train an AI. I'm imposing conditions (the same as the GPL does for code) onto the AI model that is derivative of the licensed code.

Edit: I know it may not be the best section (the one after regarding non-source forms could be better) but in spirit, it's exactly the same imo as GPL forcing you to keep the GPL license on the work

amszmidt · 2025-11-27T15:44:05 1764258245

I think maybe you're mixing up distribution and running a program, at least taking your initial comment into account, "if you train/run/use a model, it must be open source".

Orygin · 2025-11-27T16:08:31 1764259711

I should have been more precise: "If you train and distribute an AI model on this work, it must use the same license as the work".

Using AGPL as the base instead of GPL (where network access is distribution), any user of the software will have the rights to the source code of the AI model and weights.

My goal is not to impose more restrictions to the AI maker, but to guarantee rights to the user of software that was trained on my open source code.

amszmidt · 2025-11-27T14:36:04 1764254164

It isn't the difficult, a license that forbids how the program is used is a non-free software license.

"The freedom to run the program as you wish, for any purpose (freedom 0)."

Orygin · 2025-11-27T14:58:20 1764255500

Yet the GPL imposes requirements for me and we consider it free software.

You are still free to train on the licensed work, BUT you must meet the requirements (just like the GPL), which would include making the model open source/weight.

helterskelter · 2025-11-27T14:54:33 1764255273

Running the program and analyzing the source code are two different things...?

amszmidt · 2025-11-27T15:18:14 1764256694

In the context of Free Software, yes. Freedom one is about the right to study a program.

LtWorf · 2025-11-27T19:05:52 1764270352

But training an AI on a text is not running it.

tpmoney · 2025-11-27T19:52:00 1764273120

And distributing an AI model trained on that text is neither distributing the work nor a modification of the work, so the GPL (or other) license terms don't apply. As it stands, the courts have found training an AI model to be a sufficiently transformative action and fair use which means the resulting output of that training is not a "copy" for the terms of copyright law.

LtWorf · 2025-11-27T23:43:48 1764287028

> And distributing an AI model trained on that text is neither distributing the work nor a modification of the work, so the GPL (or other) license terms don't apply.

If I print an harry potter book in red ink then I won't have any copyright issues?

I don't think changing how the information is stored removes copyright.

tpmoney · 2025-11-28T00:38:45 1764290325

If it is sufficiently transformative yes it does. That’s why “information” per se is not eligible for copyright, no matter what the NFL wants you to think. No printing the entire text of a Harry Potter book in red ink is not likely to be viewed as sufficiently transformative. But if you take the entirety of that book and publish a list of every word and the frequency, it’s extremely unlikely to be found a violation of copyright. If you publish a count of every word with the frequency weighted by what word came before it, you’re also very likely to not be found to have violated copyright. If you distribute the MD5 sum of the file that is a Harry Potter book you’re also not likely to be found to have violated copyright. All of these are “changing how the information is stored”.

tomrod · 2025-11-27T14:36:57 1764254217

Model weights, source, and output.

tensor · 2025-11-27T20:24:28 1764275068

So if you put this hypothetical license on spam emails, then spam filters can't train to recognize them? I'm sure ad companies would LOVE it.

WithinReason · 2025-11-27T14:28:05 1764253685

Wouldn't it be still legal to train on the data due to fair use?

gus_massa · 2025-11-27T14:36:52 1764254212

I don't think it's fair use, but everyone on Earth disagree with me. So even with the standard default licence that prohibits absolutely everything, the humanity-1 consider it fair use.

justin_murray · 2025-11-27T14:50:23 1764255023

Honest question: why don’t you think it is fair use?

I can see how it pushes the boundary, but I can’t lay out logic that it’s not. The code has been publish for the public to see. I’m always allowed to read it, remember it, tell my friends about it. Certainly, this is what the author hoped I would do. Otherwise, wouldn’t they have kept it to themselves?

These agents are just doing a more sophisticated, faster version of that same act.

gus_massa · 2025-11-27T15:44:58 1764258298

Some project like Wine forbids you to contribute if you ever have seen the source of MS Windows [1]. The meatball inside your head is tainted.

I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.

[1] From https://gitlab.winehq.org/wine/wine/-/wikis/Developer-FAQ#wh...

> Who can't contribute to Wine?

> Some people cannot contribute to Wine because of potential copyright violation. This would be anyone who has seen Microsoft Windows source code (stolen, under an NDA, disassembled, or otherwise). There are some exceptions for the source code of add-on components (ATL, MFC, msvcrt); see the next question.

seanmcdirmid · 2025-11-27T17:51:08 1764265868

> I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.

This is close to how I would actually recommend reimplementing a legacy system (owned by the re-implementer) with AI SWE. Not to avoid copyright, but to get the AI to build up everything it needs to maintain the system over a long period of time. The separate team is just a new AI instance whose context doesn’t contain the legacy the code (because that would pollute the new result). The amplify isn’t too apt though since there is a difference between having something in your context (which you can control and is very targeted) and the code that the model was trained on (which all AI instance will share unless you use different models, and anyways, it isn’t supposed to be targeted).

mixedbit · 2025-11-27T15:02:26 1764255746

Before LLMs programmers had pretty good intuition what GPL license allowed for. It is of course clear that you cannot release a closed source program with GPL code integrated into it. I think it was also quite clear, that you cannot legally incorporate GPL code into such a program, by making changes here and there, renaming some stuff, and moving things around, but this is pretty much what LLMs are doing. When humans do it intentionally, it is violation of the license, when it is automated and done on a huge scale, is it really fair use?

WithinReason · 2025-11-27T15:28:46 1764257326

> this is pretty much what LLMs are doing

I think this is the part where we disagree. Have you used LLMs, or is this based on something you read?

mixedbit · 2025-11-27T16:02:34 1764259354

Do you honestly believe there are people on this board who haven't used LLMs? Ridiculing someone you disagree with is a poor way to make an argument.

WithinReason · 2025-11-27T16:36:44 1764261404

lots of people on this board are philosophically opposed to them so it was a reasonable question, especially in light of your description of them

conartist6 · 2025-11-27T21:49:43 1764280183

The fair use prong that's problematic is that the fair use can't decimate the value of the original work. It's the difference between me imitating your art style for a personal project and me making 1,000,000 copies of your art so that your art isn't worth much anymore. One is a fair use, the other is exploitative extraction

LtWorf · 2025-11-27T19:03:03 1764270183

Just corporations, their shills, and people who think llms are god's gift to humanity disagree with you.

cryptonector · 2025-11-28T07:25:48 1764314748

Not if it's an EULA and you make the bot click through an "I agree" button.

conartist6 · 2025-11-27T21:44:58 1764279898

Why forbid it when you could do exactly what this post suggests: go explicit and say that by including this copyrighted material in AI training you consent to release of the model. And you clarify that the terms are contractual, and that training the model on data represents implicit acceptance of the terms.

themafia · 2025-11-27T21:56:40 1764280600

Taken to an extreme:

"Why forbid selling drugs when you can just put a warning label on them? And you could clarify that an overdose is lethal."

It doesn't solve any problems and just pushes enforcement actions into a hopelessly diffuse space. Meanwhile the cartel continues to profit and small time users are temporarily incarcerated.

d0mine · 2025-11-28T02:02:48 1764295368

> cartel continues to profit

It doesn't follow. The reverse is more likely: If you end prohibition, you end the mafia.

scotty79 · 2025-11-27T14:31:46 1764253906

We need a ruling that LLM generated code enters public domain automatically and can't be covered by any license.

joegibbs · 2025-11-28T01:04:37 1764291877

That wouldn't matter too much though - how often do you worry about competitors directly stealing your code? Either it's server-side, or it's obfuscated or it's compiled. Anyway there's never that much stuff that's so special that it needs big legal stuff to prevent it from being copied, and if the LLM produces it you can just use another LLM to copy the same feature. And say it's 99% LLM and 1% human, who's going to know what the 1% is that's not safe to copy?

raincole · 2025-11-27T18:49:30 1764269370

It's more or less already the case though. Pure AI-generated works without human touches are not copyrightable.

LtWorf · 2025-11-27T19:08:44 1764270524

We need it to be infecting the rest like GPL does.

raincole · 2025-11-27T19:36:14 1764272174

You probably misunderstood how "infection" of GPL works. (which is very common)

If your close-sourced project uses some GPL code, it doesn't automatically put your whole project in public domain or under GPL. It just means you're infringing the right of the code author and they can sue you (for money and stopping using their code, not for making your whole project GPL).

In the simplest terms, GPL is:

    if codebase.is_gpl_compitable:
        gpl_code.give_permission(code_base)
    else if codebase.is_using(gpl_code):
        throw new COPYRIGHT_INFRINGEMENT // the copyright owner and the court deal with that with usual copyright laws

GPL can't do much more than that. A license over a piece of code cannot automatically change the copyright status of another piece of code. There simply isn't legal framework for that.

Similarly, AI code's copyleft status can't affect the rest of the codebase, unless we make new laws specifically saying that.

Also similarly, even if Github lost the class action, it will NOT automatically release the model behind GPL to the public. It will open the possibility for all the GPL repo authors to ask Microsoft for compensation for stealing their code.

em-bee · 2025-11-27T21:55:09 1764280509

It just means you're infringing the right of the code author and they can sue you (for money and stopping using their code, not for making your whole project GPL).

they can sue you and settle for whatever you will accept that makes them happy.

if you lose then the alternative to not making your code GPL is to make your code disappear, that is you are no longer allowed to sell your product.

consequently, if AI code is subject to the GPL then the rest of the codebase is too, or the alternative would be that the could not be distributed.

raincole · 2025-11-27T23:11:28 1764285088

First of all, pure AI-generated code is uncopyrightable now. Uncopyrightable code can't be under GPL.

Secondly, GPL can't "make your (proprietary) code disappear." Violating GPL is essentially just stealing code. One cannot distribute the version that includes stolen code. But they can remove the stolen part and replace it with their own code. Of course they still need to settle/pay for the previous infringement.

GPL simply can't affect the copyright status of rest of the codebase, because it's a license, not a contract. It cannot restrict the user's right further than the copyright laws.

Again, it's very common misunderstanding of GPL's "virality." It has been a several-decade long debate about whether GPL should be treated like a contract instead of a mere license, but there is no ruling giving it this special legal state (yet), at least in the US.

[0]: https://lwn.net/Articles/61292/ [1]: https://en.wikipedia.org/wiki/GNU_General_Public_License#Leg...

em-bee · 2025-11-27T23:27:09 1764286029

First of all, pure AI-generated code is uncopyrightable now. Uncopyrightable code can't be under GPL.

if AI generates something that is equal to existing code, then the license of that code applies. the AI generated product as a whole can't be copyrighted, but the portions that reproduce copyrighted code retain the original copyright.

they can remove the stolen part and replace it with their own code

sure, if they can do that, then they can distribute their code again. but until then they can't.

dragonwriter · 2025-11-27T23:42:15 1764286935

> if AI generates something that is equal to existing code, then the license of that code applies.

No, it doesn't, if the generation is independent of the existing code. If a person using AI uses existing code and makes a literal copy of it, then, yes, the copyright (and any license offer applicable in the circumstances) of the existing code may apply (it may also not, the same as with copies of portions of code made by other means), and it's less than clear if (especially for small portions of code) that legally such a copy has been made when a work is in the training set.

Copyright protects against copying. It doesn't protect against someone creating the same content by means other than copying.

em-bee · 2025-11-28T01:01:13 1764291673

if the generation is independent of the existing code

well, that's the big question, isn't it? if the code is used for training AI and the AI reproduces the same code, is that really independent?

i don't think so.

Copyright protects against copying. It doesn't protect against someone creating the same content by means other than copying.

if the code is the same, how do you prove it's not a copy?

it's the same problem as with plagiarism, isn't it?

em-bee · 2025-11-28T22:22:48 1764368568

btw, this is the reason why people who at some point of time may have had access to windows source code are not allowed to work on wine. because if wine accidentally reproduces windows code, the only defense is that none of the contributors have ever seen that code before.

if AI has seen that code in training, then this defense is no longer possible.

LtWorf · 2025-11-27T23:48:47 1764287327

If I read harry potter and randomly rewrite it you think I have a chance against Rowling?

dragonwriter · 2025-11-28T00:56:03 1764291363

No, almost cerainly it would be practically impossible if you reproduced the entire work, on top of evidence that you had perused it, because it would be very hard to convince a trier of fact that the duplication really was coincidence rather than copying, but it might be a very different story if you had read Harry Potter and then wrote another work that includes the text “Up!” she screeched. (which appears verbatim in the first volume of the series.)

LtWorf · 2025-11-28T06:51:13 1764312673

And what if I reproduced just a chapter of a few paragraphs?

cryptonector · 2025-11-28T07:27:35 1764314855

You can use GPL code in proprietary code. You just can't distribute said proprietary code if you don't also distribute its sources in accordance with the GPL, and that is how the "infection" happens.

palata · 2025-11-27T14:43:04 1764254584

But then we would need a way to prove that some code was LLM generated, right?

Like if I copy-paste GPL-licenced code, the way you realise that I copy-pasted it is because 1) you can see it and 2) the GPL-licenced code exists. But when code is LLM generated, it is "new". If I claim I wrote it, how would you oppose that?

chii · 2025-11-28T04:49:40 1764305380

you could have the inverse - proof that the code was _not_ LLM generated. It's like a mark of origin/country of origin for produce.

palata · 2025-11-28T09:33:19 1764322399

How would you do it? How can you prove that the message you sent was not LLM generated?

chii · 2025-11-28T09:52:25 1764323545

The same way a current product advertises itself as hand-crafted, or organic, etc. Just pure trust. And if you can charge a higher price because this is considered valuable by your customers, then you get to sustain it as a business.

michaelmrose · 2025-11-27T14:50:18 1764255018

Laws exist to protect those who make and have money. If trillions could be made harvesting your kids kidneys it would be legal.

basilgohar · 2025-11-27T15:04:02 1764255842

It's done extrajudicially in warzones such as Palestine where hostages are returned from Israeli jails, with missing organs, dead or alive [0].

[0] https://factually.co/fact-checks/justice/evidence-investigat...

myth_drannon · 2025-11-28T14:07:07 1764338827

If you talking about an organ such as a brain, they went in without one. Such is the curse of cousin marriages, brother. Salam.

BeFlatXIII · 2025-11-27T17:20:19 1764264019

How is that enforceable against the fly-by-night startups?

cryptonector · 2025-11-28T07:24:45 1764314685

So an EULA?