[WarpStream co-founder and CTO here] 1. Each WarpStream Agent flushes a file to ...

poorman · on Aug 7, 2023

It sounds like there's a sweet spot here. If you are not ACKing Produce requests for 100ms then there's a huge amount latency. If the user want's to reduce that latency from 100ms to say 1ms then their S3 GET requests cost just went up by 100x.

ryanworl · on Aug 7, 2023

[WarpStream co-founder here]

We've done lots of customer research here and, combined with the experience my co-founder and I have, we can confidently say most Kafka users (especially high-throughput users) would happily make a trade off of increased end-to-end latency in exchange for a massive cost reduction and the operational simplicity provided by WarpStream.

richieartoul · on Aug 7, 2023

Replying here instead of below because we hit depth limit. WarpStream definitely isn’t magical, it makes a very real trade off around latency.

On the read side, the architecture is such that you’ll have to pay for 1 GET request for every 4 MiB of data produced for each availability zone you run in. If you do the math on this, it is much cheaper than manually replicating data across zones and paying for interzone networking.

RE:deletes. Deleting files in S3 is free, it can just be a bit annoying to do but the WarpStream agents manage that automatically. It’s creating files that is expensive, but the WarpStream storage engine is designed to minimize this.

I will do a future blog post on how we keep S3 GET costs minimal, it’s difficult to explain in a HN comment on mobile. Feel free to shoot us an email at founders@warpstreamlabs.com or join our slack if you care for a more in depth explanation later!

mtremsal · on Aug 8, 2023

Very interesting trade-off! I was curious what you and Ryan were cooking post DDOG. "cost-effective serverless kafka" is a very interesting play. And congrats on the public announcement for "shipping Husky", finally. --Marc

chaotic-good · on Aug 8, 2023

It could be easy to operate when everything is fine but what's about incidents? If I understand correctly, there is a metadata database (BTW, is it multi-AZ as well?). But what if there is a data loss incident and some metadata was lost? Is it possible to recover from S3? If this is possible, then I guess that can't be very simple and should require a lot of time because S3 is not that easy to scan to find all the artefacts needed for recovery.

Also, this metadata database looks like a bottleneck. All writes and reads should go through it so it could be a point of failure. It's probably distributed and in this case it has its own complex failure modes and it has to be operated somehow.

Also, putting things from different partitions into one object is also something I'm not very keen about. You're introducing a lot of read amplification and S3 bills for egress. So if the object/file has data from 10 partitions and I only need 1, I'm paying for 10x more egress than I need to. The doc mentions fanout reads from multiple agents to satisfy a fetch request. I guess this is the price to pay for this. This is also affects the metadata database. If every object stores data from one partition the metadata can be easily partitioned. But if the object could have data from many partitions it's probably difficult to partition. One reason why Kafka/Redpanda/Pulsar scale very well is that the data and metadata can be easily partitioned and these systems do not have to handle as much metadata as I think WarpStream have to.

ryanworl · on Aug 8, 2023

[WarpStream CTO here]

I'm not going to respond to your comment directly (we've already solved all the problems you've mentioned), but I thought I should mention for the sake of the other readers of this thread that you work for Redpanda which is a competitor of ours and didn't disclose that fact. Not a great look.

https://github.com/Lazin

chaotic-good · on Aug 8, 2023

I'm not asking anything on behalf of any company and just genuinely curious (and I don't think that we're competitors, both systems are designed for totally different niches). I'm working on tiered-storage implementation btw. Looks like the approach here is the total opposite of what everyone else is doing. I see some advantages but also disadvantages to this. Hence the question.

cgio · on Aug 7, 2023

I don’t want to disagree with the research here, but what is not evident from the article is that this is not a magical solution that improves upon Kafka hands down, but rather a solution that addresses trade offs someone might be willing to entertain. I think on the query side things may be quite suboptimal in this setup if I understand it correctly. Correct me if I am wrong but if two agents write on a single topic, I would need to read two files to consume. Also I remember infamous stories about the cost of deleting data from S3, how do you tackle that if you have that many individual files? With these trade offs how does the solution compare to using Aurora?

huac · on Aug 8, 2023

Is it possible to have a 'knob' here? some topics might need low latency even if most don't. My sense, reading this, is that while most topics / use cases will be fine on Warpstream, that some will not be.

eldenring · on Aug 8, 2023

Yes kafka is definitely in an awkward latency spot.

guru4consulting · on Aug 8, 2023

won't that be a problem for high-traffic topics? Kafka latency is usually in single digit milliseconds. For a topic with high throughput, a typical java client instance can send thousands of messages per second. When the acknowledgement latency increases to 1000ms, then the producer client would need to have multiple threads to handle the blocking calls. Either producer will have to scale to multiple instances, or else risk crashing with out-of-memory errors.

richieartoul · on Aug 8, 2023

(WarpStream cofounder)

Yeah you have to produce in parallel and use batching, but it works well in practice. We’ve tested it up to 1GiB/s in throughout without issue

theikkila · on Aug 7, 2023

Does WarpStream guarantee correct order inside partition only for acknowledged messages or also among the acknowledged messages (in different batches)? If so how do you keep clocks synchronized between the agents?

richieartoul · on Aug 7, 2023

(WarpStream founder)

It guarantees correct ordering inside a partition for all acknowledged messages regardless of which batch they originated from. We don't synchronize clocks, the agents call out to our cloud metadata store which runs a per-cluster metadata store that assigns offsets to messages at commit time. Effectively "committing" data involves two steps:

1. Write a file to S3 2. "Commit" that file to the metadata store which will then assign the partitions at commit time 3. Return the assigned partitions to the client

wdwind · on Aug 8, 2023

Then the order is on batch level? Say batch1 is committed with a smaller timestamp than batch2, then all messages in batch1 are considered prior/earlier than any message in batch2?

richieartoul · on Aug 8, 2023

It’s getting late and I’m not 100% confident I’m sure what you’re asking, but I believe the answer to your question is yes. When you produce a message/batch you get back offsets for each message you produced. If you produce again after receiving that acknowledgement, the next set of offsets you receive are guaranteed to be larger than any of the offsets you received from your previous request.

Fiahil · on Aug 8, 2023

> We have a custom metadata database running in our cloud control plane which handles ordering.

This is the secret sauce, right there. Anybody can host a bunch of topics and artefacts on S3, and have them locally-replicated if they want. However, there is no world where you can have that without a synchronisation service that ensure cursors are uniques and properly ordered.

tiew9Vii · on Aug 8, 2023

Flushing every 100ms means you would end up with lots of tiny files (bytes) in s3 unless you have something out of process re-writing them in to larger blobs similar to Delta lakes optomize?

The lots of tiny files would be really inefficient from throughput and api call perspective in blob storage.

With the acks, you have up to 100ms waiting for the buffer to fill, + s3 put request + your metadata request/response. For high throughput that must have very high latency putting back pressure on partitions?

ckdarby · on Aug 8, 2023

Behind the scenes if they're sinking to S3 using Iceberg it handles compaction via it's maintenance API.

cbsmith · on Aug 8, 2023

This is continuing the trend of cloud pricing driving system designs more than the underlying hardware. AWS overcharges for inter-AZ traffic between EC2 instances, but undercharges for inter-AZ traffic between EC2 instances and S3.

It makes perfect sense to design this way, and as your blog post mentions, people have made similar realizations for columnar databases, and map-reduce frameworks.

theikkila · on Aug 7, 2023

Related to 1. If I understood corrently the agent generates single object per each flushing interval containing all data accross all topics it has received. Does this mean that when reading the consumer needs to read multiple partition data simultaneously to access just single partition? How about scaling consumers horizontally how does WarpStream Agent handle horizontal partitioning of the stream from consuming side?

ryanworl · on Aug 7, 2023

[WarpStream co-founder here]

That is correct about flushing. RE: consuming. The TLDR; is that the agents in an availability zone cluster with each other to form a distributed file cache such that no matter how many consumers you attach to a topic, you will almost never pay for more than 1 GET request per 4MiB of data, per zone. Basically when a consumer fetches a block of data for a single partition, that will trigger an "over read" of up to 4MiB of data that is then cached for subsequent requests. This cache is "smart" and will deduplicate all concurrent requests for the same 4MiB blocks across all agents within an AZ.

It's a bit difficult to explain succinctly in an HN comment, but hopefully that helps.

derefr · on Aug 7, 2023

Is there a reason you built that cache layer yourself (rather than each node "just" running its own sidecar MinIO instance, that write-throughs to the origin object store?)

richieartoul · on Aug 7, 2023

(WarpStream co-founder)

The cache is for reads, not writes. There is no cache for writes.

We built our own because it needed to behave in a very specific way to meet our cost/latency goals. Running a MinIO sidecar instance means that every agent would effectively have to download every file in its entirety which would not scale well and would be expensive. We also have a pretty hard and fast rule about keeping deploying WarpStream as simple as rolling out a single stateless binary.

CyanLite4 · on Aug 8, 2023

Does Azure’s Append Blob support (and AWS’ lack thereof) provide any inherent performance advantages for Azure vs. AWS?