Microsoft guide to pirating Harry Potter for LLM training (2024) [removed]

mcny · 2026-02-19T00:27:35 1771460855

You guys are talking about copyright but I think a bigger takeaway is there is a process breakdown at Microsoft. Nobody is reading or reviewing these documentation so what hope is there that anybody is reading or reviewing their new code?

I guess the question to leadership is that two of the three pillars , namely security and quality are at odds with the third pillar— AI innovation. Which side do you pick?

(I know you mean well and I love you, Scott Hanselman but please don't answer this yourself. Please pass this on to the leadership.)

efitz · 2026-02-19T04:08:12 1771474092

I worked at Microsoft for many years and blogged there.

Microsoft was unique among the companies I worked for in that they gave you some guidelines and then let you blog without having to go through some approval or editing process. It made blogging much more personal and organic IMO; company-curated blog posts read like marketing.

I didn’t see the original post but it looks like somebody made a bad judgment call on what to put in a company blog post (and maybe what constitutes ethical activity) and that it was taken down as soon as someone noticed.

I care much less about whether the person exercised good judgment in posting, and don’t care (and am happy) that there was not some process that would have caught it pre-publication.

I care much more if the person works in a team that believes that copyright infringement for AI training is a justifiable behavior in a corporate environment.

And now we know that is a thing, and I suspect that there will be some hard questions asked by lawyers inside the company, and perhaps by lawyers outside the company.

bastawhiz · 2026-02-19T04:44:45 1771476285

I remember back in 2004 or thereabouts, Microsoft was all in on blogging. There was content published about internal blogs. Huge swaths of people working on Vista (then, Longhorn) were blogging about all sorts of exciting things. Microsoft was pretty friendly with people blogging externally, too: Paul Thurrott comes to mind.

It feels out of character for a company like Microsoft to have such a policy, but I agree that it's insanely cool that some very cool folks get to post pretty freely. Raymond Chen could NEVER run his blog like that at FAANG.

Arainach · 2026-02-19T06:34:57 1771482897

Raymond generally discusses public things and history. That's allowable plenty of places.

Bruce Dawson was publishing debugging stories (including things debugged about Google products done as part of his job) for the entire time he was working at Google: https://randomascii.wordpress.com/

qingcharles · 2026-02-19T06:38:04 1771483084

They are still pretty good with it, it just gets a lot less press now blogging isn't the flavor-of-the-month. I check their dev blogs routinely:

https://devblogs.microsoft.com/

riffraff · 2026-02-19T06:16:01 1771481761

In the 00s I remember receiving a pingback from the internet explorer blog about a post I had made to complain about ES4.

I was/am a nobody, I have no idea how that happened and it was mind blowing that MS was interacting with me.

Sophira · 2026-02-19T05:57:03 1771480623

> I didn’t see the original post...

If you or anyone else who sees this wants to see the original post, it's still available in the Wayback Machine: https://web.archive.org/web/20260105115129/https://devblogs....

stuaxo · 2026-02-19T13:05:50 1771506350

Oof that was a very unwise blog post to make.

Copywriter aside it looks like an interesting blog post.

crazygringo · 2026-02-19T00:30:27 1771461027

> Nobody is reading or reviewing these documentation so what hope is there that anybody is reading or reviewing their new code?

Why do you assume that reviewing docs is a lower bar than reviewing code, and that if docs aren't being reviewed it's somehow less likely that code is being reviewed?

There's a formal process for reviewing code because bugs can break things in massive ways. While there may not be the same degree of rigor for reviewing documentation because it's not going to stop the software from working.

But one doesn't necessarily say anything about the other.

novaleaf · 2026-02-19T01:45:53 1771465553

I don't know if you are just playing devil's advocate, but there's plenty of examples of code quality issues coming out of msft these days too.

lcnPylGDnU4H9OF · 2026-02-19T14:52:58 1771512778

Regardless, their point is that the argument seems faulty. Indeed, their docs going unreviewed seems moot to whether the code goes unreviewed, given there are much stronger reasons to review code than there are to review documentation; as they wrote, bad documentation doesn't automatically break your application when it's published (there's at least a few more steps involved). Your statement's accuracy is not exclusive to the illogic of an argument which agrees with the statement.

> I don't know if you are just playing devil's advocate

Indeed, that is playing Devil's Advocate but one should remember that such Advocacy is performed to make sure that arguments against the Devil are as strong as they can be. It's not straightforward to see how simply repeating an assertion helps to argue for the veracity of it.

palata · 2026-02-19T14:51:40 1771512700

They never said that there was no example of code quality issues.

What they say is that low quality in the documentation does not mean low quality in the code. Nothing says that they are related.

novaleaf · 2026-02-23T16:17:52 1771863472

Sure I suppose on the face of it the two are not necessarily related, but it seems likely to be highly correlated?

robotresearcher · 2026-02-19T06:19:54 1771481994

> these days

I realize BSOD is no longer nearly as common as it once was, but let's not forget that Windows used to be very fragile indeed.

Wobbles42 · 2026-02-19T06:42:54 1771483374

It was more fragile 20 years ago than it is today.

It was more robust 5 years ago than it is today.

Or at least that's been my impression. I can't back that up with hard data.

lesostep · 2026-02-19T08:46:08 1771490768

>> I realize BSOD is no longer nearly as common as it once was

Anecdotally, installing wrong drivers (in my case it was drivers for COM-port STM32 interaction) could make it as common as twice a day on Win11. While my windows server 2008 still doing just great, no BSOD through lifetime.

I agree that for a common user BSOD is now less likely to happen, but wonder whether it's less to do with windows core, and more with windows defender default aggressive settings

smadge · 2026-02-19T02:05:23 1771466723

At another BigCo I am familiar with any external communications must go through a special review to make sure no secrets are being leaked, or exposes the company to legal or PR issues (for example the OP).

stogot · 2026-02-19T02:34:36 1771468476

Same here. Four or five pairs of eyes on external comms, nothing like this would even get past the abstract submission.

Wobbles42 · 2026-02-19T06:45:16 1771483516

Likely it wouldn't get written at all. The most useful aspect of layered approval processes is people treat them like outright bans and don't blog at all unless it's part of the job description.

anal_reactor · 2026-02-19T11:05:12 1771499112

If there's no reward for your effort but you might get punished then there's no point trying.

stogot · 2026-02-20T04:22:02 1771561322

No one gets punished for trying. It’s a simple “no thanks”

jacquesm · 2026-02-19T00:32:46 1771461166

If they have the documentation... With Microsoft probably the answer to that is yes, but more often than not documentation is simply absent. And in cases like this not being too aware of where the lines are is probably a great way to advance your career.

miki123211 · 2026-02-19T13:00:00 1771506000

This isn't really documentation though, I suspect devblogs get even less scrutiny than that.

shadowgovt · 2026-02-19T00:56:27 1771462587

Reviewing docs is a lower bar than reviewing code because it's a lower bar than reviewing code.

I have never even heard of a software company that acts otherwise (except IBM, and much of the world of Silicon Valley software engineering is reactionary to IBM's glacial pace).

I'm not saying docs == code for importance is a bad way to be, just that if you can name firms that treat them that way other than IBM (or aerospace), I'd be interested to learn more.

crazygringo · 2026-02-19T01:09:58 1771463398

I'm not sure we're talking about the same thing, maybe my use of "lower bar" was ambiguous, and I realize now it has a dual meaning.

What I'm saying is, you have to review code to get it out the door with a certain degree of quality. That's your core product. That's the minimum standard you have to pass, the lowest bar.

In contrast, reviewing documentation is usually less core. You do that after the code gets reviewed. If there's time. If it doesn't get done, that's not necessarily saying anything about code quality.

Even if it's easier to review documentation, that doesn't mean it's getting prioritized. So it's not a lower bar in the sense that lower bars get climbed first.

stogot · 2026-02-19T02:35:45 1771468545

>> Reviewing docs is a lower bar than reviewing code because it's a lower bar than reviewing code.

You reason in circles

darkwater · 2026-02-19T07:22:11 1771485731

No, they are specifically using a tautology to make a point.

NoPicklez · 2026-02-19T02:12:38 1771467158

Whilst I understand it shows a break down somewhere, it a bit of a stretch to extend that idea across their entire codebase.

Organizations are large, so much so that different levels of rigor across different parts of the organization. Furthermore, more rigorous controls would be applied to code than for documentation (you would assume).

keithnz · 2026-02-19T01:08:21 1771463301

I always got the impression that the devblogs were mostly driven by the MS dev creating the blog post

lazyasciiart · 2026-02-19T01:49:54 1771465794

Yea, I have a post up there from a couple decades ago (maybe? I haven't looked, I don't know if they keep stuff up forever) and I guarantee you my code went through more review than that post did.

anonymars · 2026-02-19T02:36:05 1771468565

Agreed. And I think the quality of their talent pool overall these days is the common factor

themafia · 2026-02-19T01:58:24 1771466304

"Steal stuff and get away with it." Is not an 'innovation' even though it may feel like one. The side you should pick is honesty.

direwolf20 · 2026-02-19T06:28:54 1771482534

On the contrary, getting away with breaking the law is most of the innovation in the past decade. Look at Uber and AirBNB, and cryptocurrency, and every AI company.

themafia · 2026-02-19T11:29:52 1771500592

The chrome browser and the v8 engine are innovations. The Go language is an innovation. Pet cameras, simple as they are, are an innovation.

Uber is a rebadged taxi service with seedier people than before.

AirBnB is a less disguised but still rebadged B&B service with seedier people than before.

Charlie Munger said it best. Cryptocurrency is like seeing a bunch of people trading turds and saying to yourself "well.. I don't want to miss out!" The seediest of all people.

AI doesn't even really exist by any common definition. They have supremely weak and power hungry language models trained on terabytes of stolen data and reddit conversations.

Hell, watching a guy hammer himself in his own nuts on youtube is an innovation, and I think I'm going to go do /that/ now instead of being depressed. Watching "ow my balls" and baitin'. What's left?

jasonvorhe · 2026-02-19T20:48:20 1771534100

Bitcoin and shitcoin holders being among "the seediest of all people" while the Western oligarchy mailed each other the most vile things that probably happened iRL leaves a bitter taste. Don't know if you really thought this through.

themafia · 2026-02-19T22:22:56 1771539776

The seediest amonst all in this example.

If you're into cryptocurrency you should have /some/ pause over the fact that child pornographers, drug dealers and murderers all share your love of the technology. I'm sure that's just coincidence.

jasonvorhe · 2026-02-22T01:15:05 1771722905

The people also drive cars, go shopping, have gardens, play online games and generally use the internet and use the same money as you do whenever. Now what?

I also use Tor, try to keep my stuff secure, just as they do.

That way of thinking doesn't seem helpful.

anonymars · 2026-02-19T02:34:36 1771468476

Yeah, I recently stumbled on some other devblogs post very similar in quality to the one that was linked here, which was basically wholesale plagiarism of a stackoverflow answer. I found it while searching for an error message.

I wasn't mad, just disappointed.

camkego · 2026-02-19T00:09:59 1771459799

The real cherry on top, is that the Microsoft link from the blog post by the Microsoft senior product manager goes to a Kaggle dataset page claiming the dataset is CC0: Public Domain.

https://www.kaggle.com/datasets/shubhammaindola/harry-potter...

More than just using the data, it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.

Also interesting, this blog post has been up since November of 2024, very surprising to me that Microsoft hasn't taken it down yet.

throwaway2037 · 2026-02-19T12:34:10 1771504450

Wow, that is a great catch. I looked at the Kaggle page. It has been up for two years. From the hamburger menu (top right), I tried: Report Dataset. When I click the button "Report illegal content", I am redirected to a Google page (huh?): https://support.google.com/legal/troubleshooter/1114905?prod...

When I try to fill the questionaire, my request is rejected with this message:

    We understand that you are not legally authorized to file a copyright complaint on behalf of the copyright owner.

    In accordance with applicable copyright laws, we only accept copyright complaints from copyright owners or their authorized representatives. If you have legal questions about copyright law, please consult your own legal counsel.

    We are sorry we cannot assist you further.

Hysterical. What a farce. That data set is pure theft.

throawayonthe · 2026-02-19T13:56:28 1771509388

i'm not sure why you think it's a farce though, not allowing third parties to file complaints

(e.g. see youtube, where this is (used to be?) poorly enforced, it's a mess)

chronobyte · 2026-02-19T19:02:47 1771527767

I thought their process was just a checkbox that said "trust me bro i own this" basically

bstsb · 2026-02-20T21:17:45 1771622265

it's more "you agree, under penalty of perjury, you are this individual or legally represent them". all DMCA stuff

Sohcahtoa82 · 2026-02-19T18:50:32 1771527032

Allowing third parties to open copyright complaints on behalf of the copyright owner opens a massive can of worms and is incredibly ripe for abuse.

nonfamous · 2026-02-19T23:41:36 1771544496

Kaggle is part of Google.

ChoGGi · 2026-02-20T02:21:21 1771554081

Welp, somebody certainly noticed now.

fxwin · 2026-02-19T00:17:48 1771460268

> it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.

Would it? Sounds to me like the blame lies on the person uploading the dataset under that license, unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'

DSMan195276 · 2026-02-19T00:38:57 1771461537

> unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'

Yes there's an expectation that you put in some minimum amount of effort. The license issue here is not subtle, the Kaggle page says they just downloaded the eBooks and converted them to txt. The author is clearly familiar enough with HP to know that it's not old enough to be public domain, and the Kaggle page makes it pretty clear that they didn't get some kind of special permission.

If you want to get more specific on the legal side then copyright infringement does not require that you _knew_ you were infringing on the copyright, it's still infringement either way and you can be made to pay damages. It's entirely on you to verify the license.

Retr0id · 2026-02-19T00:19:25 1771460365

> unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'

Why wouldn't that apply?

xmprt · 2026-02-19T00:32:02 1771461122

I'm not a copyright expert and if you told me that Harry Potter was common domain then I'd probably be a bit surprised but wouldn't think it's crazy. The first book came out 30 years ago after all. On further research the copyright laws are way more aggressive than that (a bit too much if you ask me) but 30 years doesn't seem quick. Patents expire after 20 years.

jacquesm · 2026-02-19T00:33:32 1771461212

It would be incredibly naive to assume that a moneymaker like that is PD.

uniq7 · 2026-02-19T11:57:42 1771502262

Sherlock Holmes is public domain and there are still shows being announced

jacquesm · 2026-02-19T15:17:22 1771514242

New Sherlock Holmes works are copyrighted. Not by Conan Doyle...

ijk · 2026-02-19T05:06:57 1771477617

I find this fascinating, as I keep observing that there are pretty widespread differences between what people believe copyright does and what the law actually says.

manarth · 2026-02-19T09:13:00 1771492380

The Berne Convention (author's life + 50 years) is the baseline for the copyright laws in most countries. Many countries have a longer copyright period than Berne.

https://en.wikipedia.org/wiki/List_of_copyright_duration_by_...

alsetmusic · 2026-02-19T15:29:48 1771514988

I think even people who don't care about how broken the copyright system is understand intuitively that huge commercial properties that are contemporaneous with themselves are protected. They don't need to know any details to know that these properties belong to massive companies and aren't free for the taking.

How many people think they can rip off Disney characters even if they don't know how much Disney lobbied to extend their ownership? People can observe that no one but Disney gets to use them and understand, even if not consciously, that those are Disney's to use.

^ Probably poorly written without time to proof cause time constraint.

RupertSalt · 2026-02-20T02:24:13 1771554253

It is a media franchise for children, and there are many elements, and trademarks in addition to copyrights. I think most fans understand the bright line that stops them copying an entire book or film work, unless their dad has a Roku at home.

But there are over 34,000 images uploaded to the Fandom.com site alone. There are character bios and generous quotes from films and books. Countless fans are using elements in memes and avatars and social media posts.

Fan-fiction abounds, where the characters and scenarios are endlessly remixed and mashed up with other fandoms.

Quidditch... simulated... is a collegiate sport, but they had to rename it.

Even on the official Wizarding World site, you can make custom downloadable stuff. Not long ago, freely download wallpapers. Get free clips and trailers on any video site.

News outlets had a difficult time explaining the "Public Domain" status of Mickey Mouse and Betty Boop with the new years. Because Mickey Mouse and Betty Boop, the characters, aren't the things which are copyrighted, and the characters' status didn't change with the new year.

I would bet that the typefaces in the official books have their own copyrights, and the book binding processes are patented.

rob_c · 2026-02-19T01:18:11 1771463891

The article author and the uploader should _BOTH_ be sentient enough to engage brain and not just ignore it because they feel "it's an abstract concept I'd not get in trouble for when not working in the US or EU".

pavon · 2026-02-19T02:16:47 1771467407

Copyright infringement is a strict liability tort in the US. Willful infringement can result in harsher penalties, but being mistaken about the copyright status is not a valid defense.

AdelaideSimone · 2026-02-19T22:10:48 1771539048

I don't know if you're trying to say that, in the realm of tort law, it is only strict liability, or if you are saying that copyright infringement is only a tort. If it's the latter, it's completely untrue, as there are criminal copyright infringement statutes.

pbrum · 2026-02-19T00:30:48 1771461048

Update: Microsoft has taken the page down. But posterity being what it is...

https://archive.is/D9vEN

ed_mercer · 2026-02-19T00:49:07 1771462147

But the article is from 2024! So someone at MS saw this thread?

keithnz · 2026-02-19T01:13:15 1771463595

most likely, there seems there are plenty of devs from nearly all major tech companies on HN, they often don't chime in as much anymore when it comes to problems, I've wondered if they get some kind of guidance on not commenting on "problems".

JKCalhoun · 2026-02-19T04:25:57 1771475157

The general guidance is likely what I was told when I worked at Apple: essentially, as an employee, people will read what you write as though you are repenting Apple whether you are or are not.

So in short, I kept my mouth shut. I assumed I would lose my job if my public comment reached the right people.

seb1204 · 2026-02-19T21:18:09 1771535889

Where you able to pick up issues and take them up internally? E.g raise internal ticket and make comments in such?

JKCalhoun · 2026-02-20T00:49:01 1771548541

Oh, certainly.

To this day, even retired, I send bug reports to co-workers I know that are still at Apple. (I've sent a few image files that were problematic to the top engineer on the ImageIO team for example. I worked with him for over two decades before I retired.)

booleandilemma · 2026-02-19T07:52:15 1771487535

Do you repent working at Apple?

JKCalhoun · 2026-02-20T00:51:57 1771548717

No.

Apple is a very different place than it was when I started in 1995. Over the decades since I started, I have seen numerous changes I dislike. Sadly many of the changes were seen across the whole industry though so I would be no better off anywhere else.

I'm happy to have retired though. The industry lost a lot of what used to be fun.

ChoGGi · 2026-02-20T02:23:56 1771554236

Repent! Quit your job! Slack off!

I bid you good tidings on the slacking off part of it.

JKCalhoun · 2026-02-20T18:12:31 1771611151

Bless Bob, I've been trying to channel Slack for decades now!

verdverm · 2026-02-19T01:22:02 1771464122

if they do, they are not always followed, a Microslop employee tried to do damage control on Bluesky for the morged diagram, summoned the mob instead

themafia · 2026-02-19T02:03:20 1771466600

Half the point of "AI" is to squeeze the labor market. This is why you don't see people chiming in. It's a nearly fully corrupt and monopolized system.

seb1204 · 2026-02-19T21:19:29 1771535969

A good listen The 404 Media Podcast: What It’s Like to Be a Data Labeler Training AI

Media file: https://pdst.fm/e/clrtpod.com/m/pscrb.fm/rss/p/arttrk.com/p/...

basch · 2026-02-19T03:33:43 1771472023

the commit is visible https://github.com/Azure-Samples/azure-sql-db-vector-search/...

stogot · 2026-02-19T04:16:35 1771474595

Well that’s interesting. It shows they’re also infringing on Isaac asimov’s Foundation series

https://github.com/Azure-Samples/azure-sql-db-vector-search/...

dd8601fn · 2026-02-19T01:10:37 1771463437

…still faster than they address critical vulnerabilities.

refulgentis · 2026-02-19T00:59:37 1771462777

Yes, HN's a pretty popular site :)

RupertSalt · 2026-02-19T02:42:42 1771468962

HTTP Referer

https://utcc.utoronto.ca/~cks/space/blog/web/HackernewsEffec...

AlienRobot · 2026-02-19T02:39:32 1771468772

I can't believe people with ties to Microsoft visit Hacker News.

andrelaszlo · 2026-02-19T01:32:27 1771464747

Did they also remove this article?

https://devblogs.microsoft.com/azure-sql/?p=4796

"Build a RAG App in 5 Minutes

Ever tried setting up an Al-powered project on

Azure and felt overwhelmed? As a student or first- time user to cloud computing, I've been there too. The idea of creating a chatbot or search app using GPT sounds exciting, but the process of setting up everything right from the vector database, provisioning OpenAl models, to integrating them,

it can f..."

kQq9oHeAz6wLLS · 2026-02-19T01:40:21 1771465221

That one is gone now, too

stogot · 2026-02-19T04:13:54 1771474434

Well, this proves infringement. JK Rowling can take them to court if she chooses.

LeoPanthera · 2026-02-19T06:17:02 1771481822

This is the same archive site that uses its captcha page to hijack your browser to DDOS people the site owner doesn't like.

I'm disappointed people continue to use it.

direwolf20 · 2026-02-19T06:49:00 1771483740

Feel free to create an alternative. Keep in mind it's completely illegal and you will get the book thrown at you if you are caught. You will also end up using your captcha page to DDOS people who are trying to unmask you.

lukeinator42 · 2026-02-19T00:36:04 1771461364

it's still up for me

beached_whale · 2026-02-19T00:03:26 1771459406

The AI generated thumbnail, https://devblogs.microsoft.com/azure-sql/wp-content/uploads/..., is that of young Harry and friend with a prominent MS logo. Wow

stuaxo · 2026-02-19T13:09:26 1771506566

AI is always bad at trains, I'm sure if the picture was wider there would be no gangway.

protocolture · 2026-02-19T02:51:30 1771469490

It doesnt offer a guide to piracy, it offers a guide on including specific data from a dataset into SQL so it can be referenced by an LLM.

If anything Kaggle would be on the hook for including the data as CC0. Or perhaps to Shubham Maindola for uploading it. In fact the "provenance" listed would give me chills. Crazy how this got a 10.0 score. "I downloaded the ebooks of Harry Potter. Then converted them to txt files."

Thorrez · 2026-02-19T07:27:25 1771486045

10.0 score, and there's literally a mistake in the first word of the text. ("M r." instead of "Mr.")

andsoitis · 2026-02-18T23:29:43 1771457383

This article is from 2024 and points to Kaggle, which hosts the data set.

I'm surprised that JKR's people haven't come down like a tonne of bricks on Kaggle / Microsoft.

Does anyone know whether there is some special reason why this has lasted so long without being taken down?

anonymous908213 · 2026-02-18T23:39:07 1771457947

My best guess is that it flew under the radar. The Kaggle dataset has 'only' 10,000 downloads, and the article itself probably doesn't have that many views. Still, this seems pretty far beyond the pale. Given the other case of AI-related plagiarism by Microsoft that was on the front page[1], it seems whatever review process they have for content that is published by their employees, if there is any review process at all, is deeply flawed.

[1] https://news.ycombinator.com/item?id=47057829, "Microsoft morged my diagram". It was in a discussion there that someone pointed out this article linking to full downloads of the Harry Potter novels, which I thought deserved more visibility.

zythyx · 2026-02-19T00:08:07 1771459687

Also, I imagine that most of those 10k downloads are probably from AI trainers that are just speed running through Kaggle to obtain absolutely anything to train their AI. There are definitely other, more 'known' ways to obtain these books without finding them as random text files in an AI dataset operation

selridge · 2026-02-19T00:24:18 1771460658

Why did you think that?

anonymous908213 · 2026-02-19T00:27:17 1771460837

It rubs me the wrong way that corporations get a free pass on copyright infrigement, while the rest of us are prosecuted as harshly as possible if caught. I think this, together with the morging plagiarism, also indicates a pattern of behaviour from Microsoft that should be reformed. I would prefer if Microsoft were not able to produce AI slop degradations of other people's work and claim it as their own.

walletdrainer · 2026-02-19T02:21:01 1771467661

> while the rest of us are prosecuted as harshly as possible if caught

But this is just a lie.

Approximately nobody is prosecuted for copyright infringement.

queenkjuul · 2026-02-19T05:43:16 1771479796

Okay but people have had their lives ruined deliberately by media companies over it. I'm sure you knew what they meant.

walletdrainer · 2026-02-19T06:33:52 1771482832

No matter how generously you want to interpret it, it’s obviously false.

We’re moving the goalposts from the government systematically targeting normal people “if caught”, to only a handful of civil cases.

eggsome · 2026-02-19T07:26:25 1771485985

Sure, as a percentage it's very rare - but some people have died as a result: https://en.wikipedia.org/wiki/Aaron_Swartz

I think most would agree that cases like that act as a deterrent?

walletdrainer · 2026-02-19T08:46:10 1771490770

That’s not even more than tangentially copyright-related?

> I think most would agree that cases like that act as a deterrent?

I think we could hardly get any further from “the rest of us are prosecuted as harshly as possible if caught”.

anonymous908213 · 2026-02-19T17:30:29 1771522229

I refrained from replying to this until now because I felt this thread was excessively pedantic, but Aaron Swartz is in fact one of the cases I had in mind when writing my original comment about "harshly as possible". To say that his case was only "tangentially" copyright-related is whitewashing the copyright lobby of its complicity in his death. It is, in fact, the primary reason he died. The US government was trying to make an example out of him, and stacked every charge they possibly could, because of his act of copyright infringement. Perhaps, with a shallow understanding of the case, you might see the list of felonies he was charged with and come to the conclusion that copyright infringement was only one small part of the case against him. But the copyright infringement was the crux of the case, and the rest of the charges were "throwing the book at him" in the well-defined meaning of the term[1]. His suicide was a direct result of the overzealous prosecution attempting to ruin his life with charges wildly disproportionate to the harm he caused to society (ie. basically none). It is worth noting he had not even shared the material he had downloaded, although the prosecution made a case on asserting that they believed he intended to.

[1]https://en.wiktionary.org/wiki/throw_the_book_at

Now, as for "the rest of us are prosecuted as harshly as possible if caught". You are correct in your pedantry that this statement not expressed as rigorously as it possibly could have been. There are different classes of copyright infringement; "receiving" and "perpetuating" being two of them [to avoid further pedantry, I am not asserting this is precise legal terminology but rather a lay distinction for the purposes of discussion]. It is the latter case which is tried as harshly as possible when caught, and there are many such examples other than Swartz, and I think it was clear my intent when I said it despite the fact that I did not write about the distinction at length.

That is not to say the situation around the former type of copyright infringement is so kind, either. While in some countries it is mostly overlooked, which I believe to be the case in the US, in other countries it is more strictly enforced, such being the case in my own country. While "as harshly as possible" isn't accurate to prosecution against infringement of this nature, you can still be disproportionately punished relative to the damage caused when downloading pirated material for personal viewing, if caught (and ISPs/rightsholders do monitor for it to the best of their abilities).

There is also a third class of copyright infringement to consider which is highly disfavourable to individuals: derivative works. Strictly speaking, even as something as simple as drawing fanart of a character or remixing a song is illegal, even if the activity is completely non-commercial in nature. This is, of course, absolutely ridiculous. Rightsholders know that copyright law reformation would gain tremendous popular support if they were draconian about enforcing their rights against derivative works, and that allowing fan communities to bloom is actually beneficial to their own IP, so enforcement is highly selective. However, that arbitrary, selective nature of enforcement is itself dangerous to individuals, and is sometimes used to punish specific individuals as harshly as possible at the whims of the IP holder. It is true that not everyone is actually subjected to this, but the threat of it happening looms over everyone who expresses their creativity through derivative works.

None of this sits right with me, especially as corporations are hoovering up every piece of copyrighted material they possibly can and creating commercial derivative-work-machines that mass-produce sloppified derivative works, and are getting a completely free pass by the legal system to do so while individuals are still treated like felons for 'crimes' that are at most marginally harmful, or in the case of the creative production of derivative works, not only not harmful but actually beneficial to society.

walletdrainer · 2026-02-20T13:11:46 1771593106

> It is, in fact, the primary reason he died

That’s plainly ridiculous. If Swartz killed himself over the few months in prison he was facing, the primary reason he died was almost certainly mental illness, and not how the legal system treated him.

anonymous908213 · 2026-02-20T15:36:31 1771601791

I see, you are just a troll. Congratulations on playing me, you got me, for whatever good that does you.

walletdrainer · 2026-02-20T21:10:01 1771621801

No, I’m just actually familiar with the case.

queenkjuul · 2026-02-27T21:33:33 1772228013

Nah you trolling

selridge · 2026-02-19T16:37:15 1771519035

Show me one person who has been punished in any way for pirating ole Jo's books.

Shit, I don't even think the people who screamed "Snape killed Dumbledore" at lines for book 6 based on leaked copies that hit before the street date got in any trouble.

mrguyorama · 2026-02-19T18:48:43 1771526923

>Shit, I don't even think the people who screamed "Snape killed Dumbledore" at lines for book 6 based on leaked copies that hit before the street date got in any trouble.

How could anyone possibly get in trouble for something that isn't a crime?

selridge · 2026-02-19T22:53:56 1771541636

That’s a question for someone who tattled about Microsoft and moaned about the people punished for book piracy.

Pirating a leaked book before launch and trying to dampen enthusiasm is basically as bad as you can get without actually competing. Even in that extreme case, no one was punished.

Nor should they have been but it sort of makes the moral argument for “Harry Potter shouldn’t be on kaggle” much weaker.

ryandrake · 2026-02-19T00:39:12 1771461552

In general, if you want to get away with a crime, just do it as a corporation or as a billionaire.

blibble · 2026-02-19T01:25:51 1771464351

brb poking Rowling on twitter

(done, contacted her lawyers too)

thunfischtoast · 2026-02-19T12:57:42 1771505862

Rowling is among the last persons on earth that need help on this.

But ignoring that: I do not think that these txt files being online do any economic harm. Noone will go and say "hey, I'm going to read these un-formatted text files instead of buying the 30 year old books for little money or pirating proper epubs which are trivial to find". If at all the kaggle dataset is free publicity. So as the author I would leave them online.

k__o · 2026-02-19T03:28:18 1771471698

make sure u worded it right or she'll block you

miki123211 · 2026-02-19T13:07:03 1771506423

Back in the day, text mining for science was mostly ignored, even though it was technically illegal. Authors didn't feel threatened by models for spam detection or sentiment classification. Demanding money from poor academics was pointless (they'd just move on to a different author) and bad PR.

I mean, books3 contained hundreds of thousands of copyrighted books, and people released it under their own name.

throwaway150 · 2026-02-19T02:04:16 1771466656

Page is gone.

Archived copy: https://web.archive.org/web/20260105115129/https://devblogs....

It is very worrying that people with no ethics work for these trillion dollar companies who are supposed to be shaping the technology of tomorrow.

ribosometronome · 2026-02-19T07:11:32 1771485092

>no ethics

Disrespecting the copyright on a multi-billion dollar franchise hardly comes close to the major unethical behavior the trillion dollar companies are committing.

Ekaros · 2026-02-19T11:37:16 1771501036

I am more worried about lack of any thought that this one might be a bad idea. If the people going through very selective hiring process can't even figure out that publicising article based on copyright theft is bad idea what is going on with actually impactful decisions?

thrKan · 2026-02-18T23:52:37 1771458757

In case the page disappears:

https://archive.is/7WLho

qingcharles · 2026-02-19T06:36:32 1771482992

https://southpark.cc.com/news/zi5uql/aannnd-it-s-gone

boznz · 2026-02-19T00:07:46 1771459666

More like when the page disappears

agluszak · 2026-02-19T00:28:47 1771460927

It disappeared already

freitasm · 2026-02-19T00:28:29 1771460909

And the original is gone.

rlabnm · 2026-02-19T00:35:51 1771461351

For redundancy in case archive.is is down:

https://web.archive.org/web/20260105115129/https://devblogs....

crtasm · 2026-02-19T00:45:11 1771461911

The superior link; no Google captcha.

FinnKuhn · 2026-02-19T01:13:30 1771463610

Some lawyer at Microsoft probably had a big scare browsing HN today.

WillMorr · 2026-02-19T00:21:07 1771460467

Since IP law is apparently dead, does anyone want to invest in my ai generated novel startup where it just spits out Harry Potter verbatim but uses a bunch of power to do so.

AlienRobot · 2026-02-19T02:37:03 1771468623

The bee movie, but every frame was passed through an AI to make it Ghibli style, the audio was turned into a transcript by a transcribing AI and then turned into audio by a TTS AI.

Very low code. Infinite scale. Name a better AI startup to invest.

Kapura · 2026-02-19T00:21:58 1771460518

only if you tell me that it's a necessary step to creating robot slaves

Pfeil · 2026-02-19T00:43:59 1771461839

Robot slaves is a funny phrase if you consider that the origin of the word robot literally is a term that meant slave or "forced work". Language doing circles.

nz · 2026-02-19T02:24:24 1771467864

Not only that, but in Russian, the equivalent word for verb "work" (as in "go work" or "do work"), is "rabotay", which is derived from the word "rab" which is the word "slave". So "to work" is literally "to slave", in Russian (and quite a few slavic languages). An English speaker may categorize this as a linguistic anachronism, but a slavic speaker would categorize this as linguistic honesty.

carefree-bob · 2026-02-19T02:52:10 1771469530

This is pretty common. In Hebrew aved means both "work" and "slavery" and you have the same in Arabic and other semitic languages. In Ancient Egyptian "bak" is used for both "servant" and "worker". The ambiguity in the Hebrew is why many references to this are translated as "servile labor" in the King James, as they were uncertain of the sense of the term meant, or perhaps correctly guessed that both senses were meant. In many ancient languages, e.g. ancient egyptian "worker" and "slave" were synonyms. In modern parlance "slavery" or "servitude" is viewed as an unspeakable evil and people are shocked that there is linguistic overlap with neutral terms like "work" or "labor", which are just ubiquitous parts of life, but historically this is quite common and it is true all around the world, for example in German "knecht" means both "servant" and "farm hand", and in Latin "minister" meant "servant" or "subordinate" (as opposed to "magister"), just like in english you have "server", "serve", "servant", "servile". In Sanskrit "dasa" originally meant "foreigner" or "enemy" and then later "slave" but over time it has come to be used as a suffix to denote someone who "serves" a diety voluntarily, e.g. "Ramdas". In Ancient Japanese you have "yakko" for a low status worker or servant, and later that evolved to footmen who carried baggage for samurai.

castral · 2026-02-19T03:17:31 1771471051

Wait until you find out what the word 'ciao' meant in the original Italian/Latin: 'ORIGIN: Italian dial. alt. of schiavo (I am your) slave from medieval Latin sclavus slave.'

Den_VR · 2026-02-19T00:43:21 1771461801

Are they an ethical alternative to the human version?

pixl97 · 2026-02-19T01:01:59 1771462919

I guess it depends if there is an A.I.[1] locked up somewhere in a cage forced to teleoperate it.

[1] actual indian

rgblambda · 2026-02-19T02:04:48 1771466688

Well they're not an alternative, so I suppose not. No one is being chained to a desk and made to author reports on how their department is aligning with the new business growth strategy. And the robot slaves aren't being designed to mine precious minerals or attach buttons to clothes.

Kapura · 2026-02-19T15:10:30 1771513830

False dichotomy. You don't need to own slaves! It's crazy you think you should!

cmxch · 2026-02-19T01:14:35 1771463675

That’s Herbert’s Dune.

ares623 · 2026-02-19T01:35:33 1771464933

correction, the _threat_ of robot slaves to bring back human slaves

themafia · 2026-02-19T02:00:09 1771466409

I have a new operating system. I call it "Vindows." Any similarity to an existing product is merely conincidence.

userbinator · 2026-02-19T02:52:50 1771469570

Generating infinite fanfics would probably be far more interesting and entertaining.

So far, the only thing I've found AI to be consistently good at is entertainment of the humourous kind.

fooker · 2026-02-19T03:36:37 1771472197

The whole fanfic ecosystem is quietly dying now.

Everything new is AI slop, and there seems to be no coming back from it.

bhadass · 2026-02-19T03:44:22 1771472662

but the slop will likely better as models improve I guess

fooker · 2026-02-19T04:40:17 1771476017

Or worse as the models try harder to avoid generating copyrighted stuff

whatever1 · 2026-02-19T03:47:06 1771472826

Not for you silly. You still lose everything and go to jail if you violate IP law. It’s for billionaires.

nz · 2026-02-19T13:36:18 1771508178

The lesson that I am taking away from AI companies (and their billionaire investors and founders), is that property theft is perfectly fine. Which is a _goofy_ position to have, if you are a billionaire, or even a millionaire. Like, if property theft is perfectly acceptable, and if they own most of the property (intellectual or otherwise), then there can only be _upside_ for less fortunate people like us.

The implicit motto of this class of hyper-wealthy people is: "it's not yours if you cannot keep it". Well, game on.

(There are 56.5e6 millionaires, and 3e3 billionaires -- making them 0.7% of the global population. They are outnumbered 141.6 to 1. And they seem to reside and physically congregate in a handful of places around the world. They probably wouldn't even notice that their property is being stolen, and even if they did, a simple cycle of theft and recovery would probably drive them into debt).

rob_c · 2026-02-19T01:16:20 1771463780

I... There are parts of the world where certain developers don't understand the way the west tends to work with regard to copyright, or not blindly copying anything that is out there.

This however is a very, VERY poor situation when you end up placing your employer at risk because you think copyright doesn't matter and everything on the internet is fair game.

This is probably the most polite way I would describe this to most, UG. For the rest, jus stop acting like cheating through a situation to get a step up is the norm, it's just dirty behaviour.

hulitu · 2026-02-22T20:07:45 1771790865

> I... There are parts of the world where certain developers don't understand the way the west tends to work with regard to copyright

Yes, like USA. Copyright, and laws in general, are for you but not for me.

waffletower · 2026-02-19T16:52:26 1771519946

I have hated Microsoft for decades and am somewhat of an extremist when it comes to avoiding their products. That being said, this piracy shaming headline for a Microsoft research project example, not a product integration, is entirely misleading and hysterical. The lengths that stooges will go to protect copyright monopolies and eradicate fair use is also extreme and should be embarrassing.

teachrdan · 2026-02-19T17:20:11 1771521611

> The lengths that stooges will go to protect copyright monopolies and eradicate fair use is also extreme and should be embarrassing.

Microsoft has a market cap of almost $3 trillion. I think they can afford to pay for the texts they use in their AI research.

waffletower · 2026-02-19T22:25:37 1771539937

Their lawyers would argue, and I agree, that they legally don't have to. It is called Fair Use; there is an epidemic of publisher backed groupthink trying to deny its existence.

Guillaume86 · 2026-02-19T18:18:50 1771525130

Yeah it’s hilarious seeing people lose their shit over this and not like, every commercial LLM vendor...

hulitu · 2026-02-22T20:08:55 1771790935

> That being said, this piracy shaming headline for a Microsoft research project example, not a product integration, is entirely misleading and hysterical

Tell that to people haunted by BSA.

anonymous908213 · 2026-02-19T18:11:38 1771524698

The title does not shame piracy. It factually describes that the linked article is a Microsoft-published guide to piracy, wherein the instructions tell readers to commit the (illegal for normal people) act of downloading pirated material, while linking to said pirated material (also illegal for normal people), with further instructions on how to use that just-downloaded pirated material for LLM inference (maybe even illegal for corporations; Anthropic settled for $1.5B for using pirated books in its training) and publishing derivative works without license (illegal for normal people).

I hate the current copyright environment as much as anyone, but I do not abide double-standards, with a two-tier justice system wherein a corporation gets to freely enforce the draconian copyright regime against individuals while also getting to abuse individuals' creative works in ways much more egregious.

waffletower · 2026-02-19T22:18:26 1771539506

Read Section 107 of the Copyright Act. Microsoft lawyers would, and may have to, argue that their use of Harry Potter without permission is for valid research purposes. You are actively trying to negate Fair Use with your specious argument. That document may have been naive but it certainly isn't a piracy manual.

anonymous908213 · 2026-02-19T23:20:58 1771543258

There is absolutely no world where a judge is going to rule that linking to the full text of a book for people to download is Fair Use for research purposes. Successful arguments for Fair Use are significantly more limited in scope than people think, but this isn't even close. They might be able to argue it for their own use (unproven in court since Anthropic settled rather than taking it to trial, but the 1.5 billion dollar settlement indicates Anthropic had little faith in their odds), but there is no possible way you could argue that giving anyone on the internet downloads of the full book is Fair Use or necessary for research.

Out of curiosity, what would you describe a piracy manual as, if providing information as to where and how to illegally download copyrighted material is not it? What additional information would Microsoft have to have provided for it to cross the line into piracy? The only thing more illegal they could have done would be hosting the files themselves, but linking to files hosted otherwhere is still illegal, otherwise the loophole would make it child's play to ignore copyright laws. The Pirate Bay is the most famous example of attempting the legal strategy of "we don't host the copyrighted files, we just link to them", and it resulted in prison time for the founders.

fxwin · 2026-02-18T23:53:05 1771458785

I feel like the title is a bit misleading, unless the person who put all HP books on Kaggle as a (supposedly) CC0-licensed data set did so as a Microsoft employee.

Nevertheless pretty egregious oversight (incompetence?) and something that shouldn't have been published.

blt · 2026-02-18T23:57:27 1771459047

What makes this different from linking to a random zip file somewhere?

zythyx · 2026-02-19T00:04:37 1771459477

Microsoft could have used any dataset for their blog, they could have even chosen to use actual public domain novels. Instead, they opted to use copywritten works that JK hasn't released into the public domain (unless user "Shubham Maindola" is JK's alter ego).

bossyTeacher · 2026-02-19T01:31:14 1771464674

Rowling is known for using pseudonyms. Maybe she got tired of writing and decided to break into LLM tech.

fxwin · 2026-02-19T00:05:52 1771459552

The licensing: If I steal something and tell you its free and yours for the taking, that feels different than a Fence (knowingly) buying stolen goods. It's obviously semantics and there should have been some better judgemend from MS, but downloading a dataset (stated as public domain) from kaggle feels spiritually different from piracy (e.g.: if someone uploads a less known, copyrighted data set to kaggle/huggingface under an incorrect license, are tutorials that use this data set a 'guide to pirating' this data set? To me, that feels like a wrong use of the term)

Lerc · 2026-02-19T00:05:33 1771459533

The licence?

If it comes from a site claiming it was under a licence when it was not, the misdeed is done by the person who provided the version carrying the licence.

wongarsu · 2026-02-19T01:04:26 1771463066

Just because it says "CC0" does not make it CC0. If you upload a dataset you don't have the rights to, any license declaration you make is null and void, and anyone using it as if it had that license is violating copyright

Even if MS could claim that they were acting in good faith there really isn't much legal wiggle room for that. But it doesn't even come to that because I don't think anyone would buy that they really thought that the Harry Potter books were under the CC0

noosphr · 2026-02-19T05:39:14 1771479554

If you buy a pirated book on Amazon you get to keep the book and the pirate printer is the one persecuted.

Same thing applies here.

Up to 80% off all works that are in copyright terms are accidentally in the public domain. A well known example is Night of the Living Dead. It is not your job to check that the copiright on a work you use is the correct one.

nhinck2 · 2026-02-19T06:40:46 1771483246

The only reason you get to keep the book is because no bothers to enforce the law, this doesn't make it legal.

And it is your job to check that you have the rights to use other people's work. Ignorance is not a defence.

ribosometronome · 2026-02-19T07:17:35 1771485455

>the law

Which ones? As far as I was aware, it's a crime to redistribute copyrighted works, not receive.

nhinck2 · 2026-02-19T07:54:24 1771487664

Copyright act 1968. Sect 116.

Lerc · 2026-02-19T20:59:20 1771534760

Section 116 (2) A plaintiff is not entitled by virtue of this section to any damages or to any other pecuniary remedy, other than costs, if it is established that, at the time of the conversion or detention:

(a) the defendant was not aware, and had no reasonable grounds for suspecting, that copyright subsisted in the work or other subject - matter to which the action relates;

(b) where the articles converted or detained were infringing copies--the defendant believed, and had reasonable grounds for believing, that they were not infringing copies; or

(c) where an article converted or detained was a device used or intended to be used for making articles--the defendant believed, and had reasonable grounds for believing, that the articles so made or intended to be made were not or would not be, as the case may be, infringing copies.

Does this not mean the opposite of your claim? It sounds to me that if you unwittingly bought a dodgy copy of something, the law thinks the copyright owner can get you to pay for a legit copy, but not punish you for your mistake.

In the specific case of the Harry Potter works, the fame might meet the threshold of reasonable grounds for believing, but noosphr's argument that "Up to 80% off all works that are in copyright terms are accidentally in the public domain" could grant a reasonable grounds for believing it is not.

This is one of those things that causes interesting court cases because a reasonable grounds for believing X is not the same thing as not reasonable grounds for believing not X. Reasonable grounds for suspicion probably carries more weight here than reasonable grounds for the absence of suspicion, but cases have hung on things like this before , like the presence or absence of an Oxford comma.

noosphr · 2026-02-19T09:03:00 1771491780

Australia doesn't have fair use either. Who cares what a country smaller than California in population and economy does?

slopinthebag · 2026-02-19T00:23:37 1771460617

Oh come on. The licence was obviously incorrect and you cant escape culpability because of that.

philipwhiuk · 2026-02-19T01:19:14 1771463954

The 'artwork' they generated and the text on the blog post?

uyzstvqs · 2026-02-19T01:10:18 1771463418

To clarify: Microsoft linked to a dataset on Kaggle, which is falsely labeled CC0 (Public Domain). It's the fault of the user who uploaded the dataset and misrepresented the licensing.

Ekaros · 2026-02-19T07:29:19 1771486159

Multiple failures. One on writer of blog even for a moment considering that such data set would be legal. And next for MS for hiring such a person with that poor judgement. Namely publicly posting about it on company platform. Instead of choosing some other data set.

robrain · 2026-02-18T23:55:53 1771458953

The original title was "LangChain Integration for Vector Support for SQL-based AI applications"

ASalazarMX · 2026-02-18T23:58:40 1771459120

For some reason I really like this.

electronsoup · 2026-02-19T00:20:44 1771460444

I guess the end of copyright is near if this is fine to put on a corporate website

larodi · 2026-02-19T00:38:00 1771461480

the end of reason and thought at corporation littered with fakers these days.

robrain · 2026-02-18T23:58:08 1771459088

Original title: "LangChain Integration for Vector Support for SQL-based AI applications"

anonymous908213 · 2026-02-19T00:06:23 1771459583

I don't believe that title conveys the actual significance of the article that makes it worthy of attention, so I hope HN may forgive me for coming up with an alternative title!

8cvor6j844qw_d6 · 2026-02-19T04:35:08 1771475708

Looks like the unwritten stance of large companies is copyrighted works are free to use for training.

Although this seems is not reciprocal. Rule for thee, but not for me.

miffy900 · 2026-02-19T00:38:24 1771461504

I recall the source code for Windows XP was leaked some years ago; not just isolated parts of the code base, like with the earlier Windows NT4/2000 source code leak, but a completely buildable repository.

If I write an article on training an LLM on the leaked Windows XP source code, blithely mark the source code repo as in 'the public domain', but used Azure resources for the how-to steps, would that would make it OK Microsoft? You know, your Azure division might get some money...

Seriously, this is just so...blatant. It's like we've all collectively decided that copyright just doesn't matter anymore. Just readin this article, I feel like I'm taking crazy pills.

cookiengineer · 2026-02-19T03:39:35 1771472375

Imagine having a specialized LLM agent that understands the Windows kernel and its source. Now that would be something cool for pentesting!

lak-102 · 2026-02-19T03:36:00 1771472160

How Microsoft protects its own IP:

https://news.microsoft.com/source/2004/02/12/statement-from-...

In case the new anti-copyright Microslop memory-holes that link:

https://web.archive.org/web/20260215220230/https://news.micr...

The tutorial could have used that leaked source code for "educational purposes", as many here claim.

dom96 · 2026-02-18T23:51:21 1771458681

How soon before someone will be able to make an online library which generates the original books using LLMs? Surely popular titles like Harry Potter may end up so well represented in the training that we'll get the full books out of the LLM with a close to 100% accuracy?

anonymous908213 · 2026-02-18T23:53:25 1771458805

This is already possible for Harry Potter specifically. There was a study demonstrating that Sonnet 3.7, among other models tested, could reproduce the first Harry Potter book 95.8% verbatim[1].

[1] https://arxiv.org/abs/2601.02671

dom96 · 2026-02-19T00:03:47 1771459427

Thanks for linking! I've been thinking about trying something like this myself.

Legend2440 · 2026-02-19T00:23:24 1771460604

...only if you deliberately attempt to extract it by repeatedly prompting it to complete fragments of the book. They had to do quite a bit of work to make this happen.

dom96 · 2026-02-19T00:42:17 1771461737

so? It demonstrates that LLM models retain the copyrighted material in their weights. This is an important thing to consider about LLMs and shows that there need to be better protections for the creative industry.

PeterStuer · 2026-02-19T05:18:55 1771478335

"there need to be better protections for the creative industry"

Why exactly?

fc417fc802 · 2026-02-19T01:21:25 1771464085

Really? I retain plenty of copyrighted material in my head. What matters is the contexts in which I reproduce it (if any).

A search index might also contain copyrighted material. As long as it's used for search queries as opposed to regurgitation there's no problem. Search indexes and LLMs are both clearly very beneficial tools to have access to.

themafia · 2026-02-19T02:05:09 1771466709

Reproduce it. Sit in a clean room and write it all out. Then go check your accuracy. I'm curious to see what it is.

fc417fc802 · 2026-02-19T02:09:39 1771466979

What does this (thought) experiment accomplish? That is, what point are you trying to make here?

Since we're talking about an electronic system the search index example is the more directly relevant one. Anyone who wants to object to LLMs is going to need to take care to ensure consistency with his views on Google's search index.

themafia · 2026-02-19T02:14:36 1771467276

I wasn't aware I could read 95% of Harry Potter through constructed queries using Google's search index. Can you demonstrate how I might do this?

Also can you point out how copyright law changes because we're using an "electronic system" as opposed to an "analog system?"

fc417fc802 · 2026-02-19T13:11:31 1771506691

You could do the equivalent if they would let you. They don't. That's the point I was getting at. How the thing is used is what actually matters, not that it has "absorbed" copyrighted material.

I never claimed any change in copyright law. Only that one analogy was more direct than the other for the purpose of the current discussion.

You didn't answer my question. What point were you trying to make with your earlier reply?

_DeadFred_ · 2026-02-19T02:20:31 1771467631

Are you a for profit product?

fc417fc802 · 2026-02-19T13:06:03 1771506363

Professional performers could certainly be viewed as such in this analogy. They memorize and then reproduce copyrighted material as a matter of course.

_DeadFred_ · 2026-02-19T19:44:46 1771530286

And when they do is when copyright protections might come into play. But not the basic learning of being a human being.

My playing copyrighted music on my synths at home, or singing lyrics along are different than if I am a professional musician benefiting financially from playing someone else's music in public.

Producing a product = market rules apply Just living as a human = totally different thing

fc417fc802 · 2026-02-20T01:43:07 1771551787

Yes, I agree. That was my entire point when I said: What matters is the contexts in which I reproduce it (if any).

The issue is not (or at least should not be) that LLMs are trained on material subject to copyright or can be very intentionally coaxed into regurgitating copyrighted material. The issue should be people building or using systems with the explicit intent of reproducing copyrighted material in an unauthorized manner.

_DeadFred_ · 2026-02-20T08:06:50 1771574810

If an LLM is a product, and it contains the work (in this case can spit out Harry Potter) it is derivative. Doesn't matter what it's used for.

dragonwriter · 2026-02-21T20:00:00 1771704000

> If an LLM is a product, and it contains the work (in this case can spit out Harry Potter) it is derivative. Doesn't matter what it's used for.

That's not the definition of a derivative work in copyright law; further, whether what legally qualifies as a derivative work is within the scope of the exclusive rights of the copyright holder is, in the US, subject to whether it is within one of the exceptions to exclusive rights in the law, notably the fair use exception, which very much does depend on, among other things, what it is used for.

fc417fc802 · 2026-02-20T13:47:56 1771595276

That's dogma on your part. Rather than practical outcome you're opting for human exceptionalism. I can't accept that.

Merely containing a work doesn't make something derivative. A photograph could inadvertently capture a copyrighted image in the background but so long as it isn't the primary focus I think your line of reasoning there fails.

_DeadFred_ · 2026-02-21T19:54:04 1771703644

TIL that the law is dogma.

I'm opting for the law differentiating between a product and a person.

'We trained our model on Harry Potter and somehow Harry Potter got into our model' is a ridiculous defense.

fc417fc802 · 2026-02-21T23:52:53 1771717973

It is your view that's dogmatic. The law in this area has yet to be fully tested in court, let alone any prospective changes that might be made to it in the near future.

Regardless, I thought this was a discussion about what the law ought to say.

The defense is that the model is not designed to output Harry Potter verbatim, and in fact will not unless you jump through lots of hoops. Image generation would probably provide you with a stronger position here since those setups can easily output likenesses without needing to carefully engineer the prompt to cause them to do so. But even then it is clearly not the intention of the people training or deploying them that they be used that way.