DeepSeek's smallpond: Bringing Distributed Computing to DuckDB

OutOfHere · 2025-03-04T11:30:33 1741087833

Deepseek is the real "open<something>" that the world needed. Via these three projects, Deepseek has addressed not only efficient AI but also distributed computing:

1. smallpond: https://github.com/deepseek-ai/smallpond

2. 3fs: https://github.com/deepseek-ai/3FS

3. deepep: https://github.com/deepseek-ai/DeepEP

swyx · 2025-03-04T13:57:10 1741096630

how many companies will actually adopt 3FS now that it's open source?

not a hater, just know that theres a lot of hurdles to adoption even if something if open source - for example not being an industry standard. i dont know a ton about this space - what is the main alternative?

skeeter2020 · 2025-03-04T16:19:34 1741105174

to me this seems to target a pretty small audience: very big data and specific problem domains, you need killer devops chops, expensive & specialized infrastructure and a desire to build out on bleeding edge architecture. I'd suspect most with these characteristics will stick with what they've got, "medium Big Data" companies should probably go with hsoted services and the rest of use stick with a single node DuckDB.

0cf8612b2e1e · 2025-03-04T17:40:19 1741110019

Bingo. Very few organizations have petabytes of data on which they are trying to efficiently process for machine learning. Such organizations already have personnel and technology in place offering some kind of solution. Maybe this is an improvement, but it is quite unlikely to be offering new capabilities to such teams.

datadrivenangel · 2025-03-04T20:04:07 1741118647

And the organizations that get large enough to be sad with DuckDB performance will have options like MotherDuck for cloud hosting

huntaub · 2025-03-04T16:03:15 1741104195

For example, in AWS, you can get a similar FSx for Lustre file system for just 11% more cost, which could be worth it to avoid the management costs of running your own storage cluster.

dkdcwashere · 2025-03-04T13:58:32 1741096712

thank goodness, we’ve had nothing open to do efficient distributed computing with for years!

OutOfHere · 2025-03-04T22:39:40 1741127980

At least there hasn't been anything for distributed DuckDB before it afaik. For anyone with a substantial DuckDB project, they might now go distributed without having to rewrite it in something else.

jakozaur · 2025-03-04T12:52:58 1741092778

It was already on HN recently:

https://news.ycombinator.com/item?id=43200793

https://news.ycombinator.com/item?id=43232410

ogarten · 2025-03-04T07:08:16 1741072096

Looks like we are approaching the "distributed" phase of the distributed-centralized computing cycle :)

Not saying this is bad, but it's just interesting to see after being in the industry for 8 years.

antupis · 2025-03-04T07:49:48 1741074588

Was it already happening when platforms started supporting stuff like Iceberg? But is kinda nice to see things like Snowflake have definitely their place on the ecosystem but too often at margins especially with huge workloads Snowflake creates more issues than solves them

greenavocado · 2025-03-04T16:16:17 1741104977

Were you there when we had to work with our data in Teradata and SAS and hundreds of multi hundred MB Excel spreadsheets containing analytical data? 30+ minute queries were the norm. Snowflake was a breath of fresh air.

data_marsupial · 2025-03-04T18:24:07 1741112647

I work with Teradata every day and can query years of event data in seconds.

ogarten · 2025-03-04T07:53:45 1741074825

Yes, not saying this is bad at all, just kind of funny. When you think about it it makes sense though. Why wouldn't want someone have a possibility to distribute an efficient engine.

nemo44x · 2025-03-04T19:12:50 1741115570

Isn’t the whole point of DuckDB is that it’s not distributed?

this_user · 2025-03-04T20:57:38 1741121858

1. Our technology isn't powerful enough, we need to scale by distributing it.

2. The distributed technology is powerful but complex, and most user don't need most of what it offers. Let's build a simple solution.

3. GOTO 1

calebm · 2025-03-05T19:13:24 1741202004

I had the same question - what does this add beyond normal DuckDB?

biophysboy · 2025-03-04T20:46:19 1741121179

I thought the same thing; perhaps its distributed into fewer chunks.

benrutter · 2025-03-04T19:56:51 1741118211

I'm not massively knowledgable about the ins and outs of DeepSeek, but I think I'm in the right place to ask. My understanding is DeepSeek:

- Created comparable LLM performance for a fraction of the cost of OpenAI using more off-the-shelf hardware.

- Seem to be open sourcing lots of distributed stuff.

My question is, are those two things related? Did distributed computing allow the AI model somehow? If so how? Or is it not that simple?

zwaps · 2025-03-04T20:20:16 1741119616

These type of models need to be trained across thousands of GPUs, which requires distributed engineering on a much higher level than "normal" distributed systems.

This is true for DeepSeek as well as for others. There are a few companies giving insights or open-sourcing their approaches, such as Databricks/Mosaic and, well, DeepSeek. The latter also did some particularly clever stuff, but if you look into details so did Mosaic.

OpenAI and Anthropic likely have distributed tools of even larger sophistication. They are just not open source.

benrutter · 2025-03-05T09:46:00 1741167960

Thanks, that's a really great/helpful explanation!

maknee · 2025-03-04T20:36:25 1741120585

Does anyone have blogs with benchmarks to show the performance of running smallpond let alone 3fs + smallpond?

A lot of blogs praise these new systems, but don't really provide any numbers :/

cmollis · 2025-03-04T15:49:41 1741103381

spark is getting a bit long in the tooth.. interesting to see duckdb integrated with Ray for data-access partitioning across (currently) 3FS. probably a matter of time before they (or someone) supports S3. It should be noted that duckdb (standalone) actually does a pretty good job scanning s3 parquet on its own.