I encourage everyone with even a slight interest in the subject to download a ra...

dist-epoch · 2025-10-21T20:29:50 1761078590

Karpathy made a point recently that the random Common Crawl sample is complete junk, and that something like an WSJ article is extremely rare in it, and it's a miracle the models can learn anything at all.

andai · 2025-10-21T22:39:33 1761086373

>Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information.

>The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all. You'd think it's random articles but it's not, it's weird data dumps, ad spam and SEO, terabytes of stock ticker updates, etc. And then there are diamonds mixed in there, the challenge is pick them out.

https://x.com/karpathy/status/1797313173449764933

Context: FineWeb-Edu, which used Llama 70B to [train a classifier to] filter FineWeb for quality, rejecting >90% of pages.

https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb...

WA · 2025-10-22T05:32:58 1761111178

Don‘t forget the terabytes of torrented ebooks.

https://www.tomshardware.com/tech-industry/artificial-intell...

https://www.classaction.org/news/1.5b-anthropic-settlement-e...

nativeit · 2025-10-27T01:35:08 1761528908

I suspect this is the "secret sauce" to making a coherent model.

jojobas · 2025-10-21T23:30:06 1761089406

From the current WSJ front page:

Paul Ingrassia's 'Nazi Streak'

Musk Tosses Barbs at NASA Chie After SpaceX Criticism

Travis Kelce Teams Up With Investor for Activist Campaign at Six Flags

A Small North Carolina College Becomes a Magnet for Wealthy Students

Cracker Barrel CEO Explains Short-Lived Logo Change

If that's the benchmark for high quality training material we're in trouble.

anigbrowl · 2025-10-22T03:45:13 1761104713

In general I find WSJ articles very well written. It's not their fault if much of today's news is about clowns.

dclowd9901 · 2025-10-22T05:12:31 1761109951

Their editorial department is an embarrassment imo. Sycophancy for conservatism thinly veiled as intellectualism.

anigbrowl · 2025-10-22T06:38:08 1761115088

I also hate their editorial department, I'm just saying that the news articles are well written in a technical sense rather than because I like their editorial positions or choice of subject mattter.

stocksinsmocks · 2025-10-22T00:40:24 1761093624

There is very, very little written work that will stand the test of time. Maybe the real bitter lesson is that training data quality is inversely proportional to scale and the technical capabilities exist but can never be realized

throwaway314155 · 2025-10-21T18:16:41 1761070601

> But I assume the data cleaning process removes such content before pretraining? ;)

I didn't check what you're referring to but yes, the major providers likely have state of the art classifiers for censoring and filtering such content.

And when that doesn't work, they can RLHF the behavior from occurring.

You're trying to make some claim about garbage in/garbage out, but if there's even a tiny moat - it's in the filtering of these datasets and the purchasing of licenses to use other larger sources of data that (unlike Common Crawl) _aren't_ freely available for competition and open source movements to use.

jedimastert · 2025-10-22T11:25:40 1761132340

> purchasing of licenses to use other larger sources of data

https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-...