Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What are companies needing all of these hard drives for? I understand their need for memory, and boot. But storing text training data and text conversations isn't that space intensive. There's a few companies doing video models, so I can see how that takes a tremendous amount of space. Is it just that?
 help



Hearing about their scrapping practises it might be that they are storing same data over and over and over again. And then yes, audio and video is likely something they are planning for or already gathering.

And if they produce lot of video, they might keep copies around.


All the latest general purpose models are multimodal (except DeepSeek I think). Transfer learning allows to improve results even after they exhausted all the text in the internet.

I am surprised by that too. I thought everyone moved to SDDs or NVMe ?

I was toying with getting a 2T HDD for a BSD system I have, I guess not now :)


Everyone moved to SDDs or NVMe. If you're right, that includes manufacturers. HDDs still have advantages over SSDs for specific needs, like more reliable long-term unelectrified storage. It's also possible that the high price of SSDs made HDDs an option again.

Really if you're writing large solid files hard drives aren't that bad. If you can have the system split out one file per drive at a time then you'll avoid a lot of the fragments

Storing training data: for example, Anthropic bought millions of second hand books and scanned them:

https://www.washingtonpost.com/technology/2026/01/27/anthrop...


All of Annas archive can be put on 40 drives

Not if you "scan" them by recording 4K video of someone flipping page after page, you know to teach multi modal models.

Facts. Anything less than 4K/120fps simply won't cut it in '26. Anthropic ain't just flipping pages, they're flipping the world.

Speaking from personal experience.. we treat cloud storage like an infinitely deep bucket. At rest data efficiency is not really a consideration because compute costs are so absurd. Why worry about a $2M year storage bill when your compute bill is $500M? It’s not worth the engineering time to optimize

I think the somewhat hallucinatory canned response is that they distribute data across drives for a massive throughput. Though idk if that even technically makes sense...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: