Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How Palo Alto Networks Replaced Kafka with ScyllaDB for Stream Processing (scylladb.com)
126 points by cyndunlop on June 15, 2022 | hide | past | favorite | 55 comments


Well, that was very uninformative.

> To meet our performance expectations, Kafka must work from memory, and we don’t have much memory to give it... ...Even the smallest customer required two or three Kafka instances

A) What performance expectations? Timeliness?

B) What's "not much" memory?

C) And why don't you have that much?

D) When you say instance, do you mean "broker", or actual clusters? Why did the smallest customer need 2 to 3 of them?


D) Sounds like clusters to me: "Palo Alto Networks provides each customer a unique deployment."


Why does scylla not need clustering when kafka does?


Err, Scylla does... here's some link text from their docs.

> Provision and manage a Scylla cluster in your environment.


I get not wanting to add yet-another-system to reduce operational complexity but it seems more economical to use a system like Flink to do a time windowed join and emit single records to be written to a persistence store. The Flink time window can be sufficiently large to encompass the disparity between ingest and event time without much RAM consumption by using a RocksDB state backend on the operator. Let me know if I miss something, every use case is different :)


I have written Flink pipelines that solved similar problems to this post: sessionization of time-skew data points from multiple sources with variable and possibly large delays; simple and economical is not how I would describe it :) .

If one of your datasource get lag behind, Flink would buffer a huge amount of data waiting for it to catch up due to how watermark work when joining 2 streams, and you would still encounter out-of-memory error even with RocksDB from time to time if your session window get too large. In addition, with our state size frequently reached hundred of GBs, recovering from failure was not exactly fast either.


As I understood, they wanted to eliminate the MQ layer. If they used Flink, they would still need to keep Kafka and they would be introducing another layer of complexity with Flink.

So, instead of simplifying they would make the stack more complex.


This, especially when some records can be "several megabytes"


They don't go into much of the detail of their "event consumers" but it certainly sounds like something Flink can handle, although Flink itself can be yet another operational set of headaches on top of Kafka. It also seems like something simple enough (simple processing-time windows) that they might not need Flink for the consumer.

I have a hard time recommending Flink over dumber consumers or a more micro-batch vs true streaming approach unless you're doing something that really needs the long-lived in-worker keyed state and the ability to do things like streaming joins and all. Otherwise the Flink gotchas and nasty surprises can outweigh the ease of which it lets you do what you want to do.


From earlier in the article:

> Clock skew across different sensors: Sensors might be located across different datacenters, computers, and networks, so their clocks might not be synchronized to the millisecond.

And later on in their final solution

> Implementation4 cons: Producers and consumers must have synchronized clocks (up to a certain resolution)

How do they reconcile this skew in their final solution?


It's simple, and the same in all implementations: `Wait some amount of time to allow related events to arrive`.

For implementations 4, I suppose it's done by reading only data point with `insert_time < NOW() - some_delays`.

I think the requirement `Producers and consumers must have synchronized clocks (up to a certain resolution)` is simply so that there's no unnecessary lag between producers and consumers.


Can you walk me through how waiting some amount of time actually improves their guarantee? It's not like they're using Google's TrueTime api that provides bounds on time. How is "unnecessary lag" avoided between Producers & Consumers?


> Clock skew across different sensors: Sensors might be located across different datacenters, computers, and networks, so their clocks might not be synchronized to the millisecond.

We have to assume that although the sensors time are not synchronized to the milliseconds, but their time are still accurate to a certain degree (say ~1 seconds). Then factoring in the time for data to travel from sensors to the collector, if you wait for 5 secs, most events would have arrived and get stored in database. Events that arrive after that watermark would probably be safe to discard or become irrelevant. Waiting also solve the out-of-order problem, since you can sort the events again.

> Implementation4 cons: Producers and consumers must have synchronized clocks (up to a certain resolution)

So let's say the consumer's clock is slower than the producer's by 5 seconds, i.e when the producer thinks the time is 10, the consumer thinks the time is only 5. In this case, even if the producer has already written `insert_time = 10`, consumer will only read up to `insert_time = 5` and there's an extra 5 seconds worth of lag on top the of wait introduce by the algorithm.


Does any one on here have some real world experience with scylla?

We currently make heavy use of dynamo and are interested in something cheaper/faster. The marketing material is pretty compelling but I'm unsure of how hard scylla is to operate at scale.


Used it on my past adtech startup handling billions of http requests per day (backend having multiple more calls).

Solid product, it's since reached parity with Cassandra and added even more features. Great team, very helpful. It's far easier to run than Cassandra because of its performance so you need less nodes overall (which is the single biggest issue in C* ops). The enterprise edition also has better compaction strategies and an automated system to schedule it all.

What scale are you looking at?


> far easier to run than Cassandra

I know precisely nothing about Scylla, but somehow I still agree with this. Cassandra is far and away the most horrendous software I've ever had to work with (Kafka coming in a close second). This was true even when we hired a major core contributor to help manage it; even he wasn't 100% comfortable with it. The word 'anticompaction' is still enough to give me nightmares.

I very much welcome – not that it's Cassandra's only problem – this trend of rewriting '00s/early-'10s Java software in simpler languages. (And, languages aside, I think Cassandra - certainly inter alia - simply sits at a level of complexity which is beyond the comprehension of any one human being. And software which is beyond the comprehension of any one human being inevitably begets bugs, even as an operator rather than a developer.)


I'm intrigued: The Apache Cassandra developer community is quite small and well known to each other, and the only major contributor I know to have left the community to run a cluster did so almost a decade ago?

Cassandra had a lot of rough edges until very recently, and was only really suitable for sophisticated users, as most contributors would have attested. The project has matured a lot over the past couple of years in particular, as the broader community stepped up in the face of DataStax's (temporary) withdrawal. The investment by some large scale users has transformed it, and the next couple of years will do so even more.

None of the issues the project faced were really related to the language chosen, in my opinion, rather than the maturity of the project and how it grew at the time.


His initials were LT, if that helps (if not, I can clarify his whole name over email or however people privately communicate on HN?). I may well be exaggerating his 'major'ness - I'm not that familiar with the Cassandra contributorship, but that was my understanding from my team!

As for Cassandra itself, we were certainly quite sophisticated users. There's a decent chance you'd know the company in question. It's a fair point about the language: I'm not a fan of Java and it probably colours my opinion a little; I was speaking more from a philosophical standpoint about reducing the theoretical complexity of software to make it more deterministic & understandable (out of the tar pit and all that), more than to any specific deficiencies in Java that caused any actual issues for us, of which there were no direct examples I can recall. (Pathological GC did cause some occasional degradation, I suppose, though at best that's semi-specific to Java.)

I think most of the actual issues, from my on-call years, were as a result of stuff like: (as I mentioned) anticompaction and suchlike causing pathological performance; our own misconfiguration of things like asynchronous replication / bootstrapping (which once caused a very severe incident, as in 'endemic data corruption and loss' severity); application-layer issues from product engineers misconfiguring consistency, choosing poor keys for partitioning, constructing poor data models that require table scans, all the ordinary stuff for which Cassandra is at most very obliquely to blame.

Also, I agree it was stupid of us to use Cassandra in (what was) a very serious environment, in probably one of the most safety-critical sectors outside of medicine and rockets. We knew that. We did the same with several technologies. Literally, we had a diktat saying no engineers could mention it in blog posts. On reflection it's quite unfair to blame Cassandra for our decision - to a large extent, yeah, we were holding it very wrong. I would not have made that choice myself, at all.


I probably know who you are referring to, but I don't feel it would be appropriate to discuss a specific identifiable individual in a public forum, no matter how innocuous that discussion might be! My fault for bring it up.

Cassandra was super easy to shoot yourself in the foot with, and it remains quite easy. You mention a few foot guns that have gotten better, a few that remain but that will get better, and a few that are sort of inherent to distributed databases.

Anti-compaction for instance should now not be a huge issue if you're running regular incremental repairs, and I hope even the few remaining caveats will be alleviated soon. Bootstrap is something you mention that is also going to get much easier for users this coming year, so that unsophisticated users can manage their cluster membership safely.

Application-misconfigured consistency levels is a really obvious one that isn't strictly Cassandra's fault but for which much better help could be given to the user, and I expect some major improvements here in the next year or two, so that users can configure tables with consistency properties that the database guarantees (to some extent, the user will always be able to screw it up by providing the wrong consistency identifier, but at least the scope is reduced to accidental misconfiguration rather than misunderstanding). This is something we're considering as part of the introduction of general purpose transactions later this year.

Poor data models and partition keys are things the database can offer less help with, though I anticipate much better support for ordered partitioning in future, that would help poorly-selected partition keys, as clustering keys can be used for partitioning there too.

Regarding the choice of language, Java has upsides and downsides. GC spirals are something we have control over at the end of the day, and we continue to do better at (as does the JVM), but guaranteeing no segfaults (and not worrying about the ABA problem) is a big benefit we get in return. I wish we had more control over things like memory placement and execution, but these things may be coming to the JVM to some extent (Loom I expect to benefit Cassandra hugely, and value types later), but equally distributed systems problems often give you enough things to worry about.

The visibility you have into a Java process is fantastic, however, and we are starting to make use of the ease with which you can modify the code Java runs for system validation, using byte weaving to permit us to simulate clusters as they are run, with adversarial event orderings, to ensure those notoriously hard distributed systems problems are correctly solved.

If you do want to speak privately, about anything Cassandra related, the lowercase part of my username (i.e. without the _ prefix) at apache.org reaches me.


Yeah, I 100% agree that many of those problems are inherent to distributed databases. There are an interesting few which kinda straddle the line in that respect – stuff like counters and the aforementioned individually tunable consistency, where it makes it too easy for (in practice) individual engineers to trigger classic dist-sys failure modes – but largely its problems are the problems of distributed systems. Lots of its other problems are the problems of using an over-complex eventually-consistent write-optimal (etc) distributed system for a problem that doesn't require it (where e.g. Redis Cluster would be far better). I'd submit to throw maybe a few on top: it feels like a general theme of many incidents we encountered was around "we did something complex/accidentally-pathological and Cassandra froze up entirely due to [consistency / compaction / repair / GC] stuff". It did feel from many of those issues like it was a victim of its own complexity, more than anything else. (That theme also applies to lots of the 'operator errors'.)

Also, sorry, I think I was a bit unfair to Java. I'm not an anti-GC militant. I'd be the first person to point out the haziness of the distinction between tracing and (say) reference counting in the first place, or indeed with the indexing/defragmentation/etc space + work required of a malloc implementation. I'd consider a GCed language like Go - though I personally hate it and feel it utterly joyless - to be an improvement. It's more about the inherent complexity of adding a p-code machine like the JVM on top of the already-colossal complexity of a modern database. For what it's worth, for clarity, I've barely written any Java and I'm intimately unfamiliar with Java development, and despite giving my opinion I'm well aware it's not a very informed one. I do agree with your point about its isolating the 'unit' of your software from the particularities of any given hardware and making it more easily jepsenable - I hadn't considered that. And some of the stuff happening in the Java space, like Graal and (as you say) Loom, is very impressive.

I'll amend my original comment to make it a bit clearer that most of this is not really Cassandra's fault, and its faults aren't really more numerous and more severe than those of any other database. It's evidently a huge success and its value to people is undeniable - I don't mean to depreciate your work. I forget that there's a non-negligible chance of relevant people reading my comments on here (like, less congenially, the time I accidentally summoned Br*ndan E*ch: https://news.ycombinator.com/item?id=28792436). I don't work with Cassandra any more, so I'm probably unlikely to have many practical questions, but thanks for the offer and I'll certainly reach out if I find myself in that space again! Really appreciate your being so magnanimous about my not-very-magnanimous (pusillanimous?) comment.


Please, no need to apologise! Your criticisms were all entirely well founded and the pain points you mentioned very real. I thought you expressed them considerately (I have seen plenty of vents that did not). I may reflexively defend Cassandra, but healthy and honest discussions around these things is great IMO.

> It did feel from many of those issues like it was a victim of its own complexity

I think there’s some truth in this, but I think the bigger problem was failing to give this complexity its proper respect (which would have been very costly and slowed feature development - perhaps consigning Cassandra to an also-ran position like others in the space, who knows?)


> I think Cassandra - certainly inter alia - simply sits at a level of complexity which is beyond the comprehension of any one human being

I think there are a handful of humans that actually get it. But it is a true handful, not more than 5.


The underlying Dynamo architecture is great, and original Facebook and community releases did deliver on the overall promises. Unfortunately the implementations suffered, partly because of the language and baggage from Java and from the execution model.

Scylla using C++ helped in bringing the low-level planning and development experience that is common in that area (and harder to do in Java), but also the ability to use all the low-level libraries like DPDK and advanced threading/reactor systems (like the underlying Seastar framework that Scylla made).

The opensource Cassandra versions are now implementing the same model, although still using Java.


NOTE: I can't now edit this comment, but I wanted to clarify, per below, that many of these issues are more about my experience of Cassandra in the specific context in which I experienced it, and not the database itself. It's a complex tool, but, for the constituency of users for whom it's appropriate, it's as competently architected as just about any other database. My criticisms mostly centre on the culture of engineering teams which choose to use a very complicated tool that requires significant skill, in contexts where it's simply not needed.


It sounds like you might be a future version of me!

We are an adtech startup in the hundreds of millions per day right now, expecting billions per day by the end of the year. Management would be happy if I said I have a plan for tens of billions.

The main difference is that we would probably lean towards the cloud version instead of self hosted (do we still need to think about clusters then??)


If you use the cloud version then there are no ops to worry about.


Not sure why you would use it over Cassandra, which scylladb last I checked was a C++ rewrite that had impressive initial feature releases but has since lagged in feature support badly.

I would migrate to cassandra first, and then prototype your workloads on scylladb. You'll want loadtest execution ability regardless so you can tune for your workloads.

I'd guess ScyllaDB and Cassandra are comparable for headaches. Our clusters never go down, but there's always SOMETHING to look at. But cassandra has more community and commercial support.


1. Performance https://www.scylladb.com/2021/08/24/apache-cassandra-4-0-vs-...

2. Features https://www.scylladb.com/2021/11/17/cassandra-and-scylladb-s...

ScyllaDB used to chase Cassandra's feature set. Now, in many ways [MV, compaction strategies, workload prioritization, LWT, CDC] we can either do things Cassandra cannot do, or we're a better implementation of Cassandra's same or similar functionality.

Cassandra c. 2018-2020 was a pretty moribund project, not shipping a major new release until last year [2021]. I am pretty excited to see the new work done on 4.0 and now 4.1. During the same interval ScyllaDB went from 2.0 to now 5.0 which is coming out Very Soon Now.

The CEP process Cassandra maintains shows some very interesting directions they are heading in. And we showed off some of our roadmap at Scylla Summit 2022 describing where we're taking our product next.

Whichever path users decide, the good news is that year-over-year users have increasingly better databases to choose from.

[Disclosure: I work at ScyllaDB.]


Re: your comparison, LWTs in Cassandra in 4.1 (about to release) offer global 1RT reads, and 2RT for writes. Your comparison suggests you take 3RTs? I suspect you may be moving to offering 1RT for those in the local region, and 2RT for all operations in another region?

You make a point of mentioning your MVs (and presumably your global secondary indexes) may get out of sync with the base table, which is the reason the Cassandra community decided to mark them experimental. The Cassandra community has become very conservative since it began being driven by users of the software. I should not draw any conclusions around the relative merits of features from this conservatism.

The Cassandra community paused feature development for several years, due to this very conservatism, in order to focus on delivering safety and reliability at scale. Now that's done, you can expect a lot more visible activity.


There will definitely be a lot of jostling as Cassandra gets ahead here, and ScyllaDB there. Yes, we've talked about our plans to get to 1RT LWTs.

And you are totally correct to point out that there are needs for conservatism in people running at scale. Hence why we recently released making our Enterprise offering available as either a short term feature release or a long-term support release.

The MV issues have been hardened in ScyllaDB so the likelihood of them getting out of sync are minimized. And we're implementing Raft into the architecture to make them even more consistent with the base table. [Not available yet, but part of our roadmap.]


I believe that when you get to 1RT, your LWTs will still effectively be 2RT if you include the coordinator's latency to the leader (except in the leader's region).


They have some high a availability functionality in the open source version but not commercial. If the experts won’t run it in that mode, why do you think you’ll do a better job?


What are you talking about?

High-availability is a core part of the Cassandra data model and node architecture. Scylla is designed to be highly available, and the enterprise version only adds features to make it easier to operate.


Just to note that ScyllaDB Open Source sometimes is some months to a year ahead of the Enterprise release in terms of features. Though I am not sure what specific "high availability" feature poster above is referring to.

Otherwise, as you note, ScyllaDB Enterprise will be a superset of our Open Source offering.

Also to minimize lag between Open Source an feature availability, we just announced new Feature releases for Enterprise which will be coming later this year and into the future.

https://www.scylladb.com/2022/06/08/new-scylladb-enterprise-...


If a general system becomes good enough, you see it displace specialized systems. In this case the Kafka paradigm can be replaced because there is such a performant NoSql DB.

It's kind of like how standalone cameras became less and less desirable as phone cameras got better. Standalone could do better quality - but this matters less once both options are really good. There is some 'good enough' point where you hit vastly diminishing returns & simplifying into just phones became worthwhile.

Databases (certainly Scylla) may be hitting a point where specializing, actively optimizing, etc. are less desirable than just reusing one good system.


Im not seeing the "stream processing" piece here.

Looks like they went from polling an RDBMS to some triggered querying of scylla, and then on to polling scylla.

i.e. they went from polling an RDBMS to polling Scylla. They didnt replace kafka with anything so now their implementation isnt reactive.

This is effectively no different that implementing a message queue in a database, with all the negatives that brings.

They are sharding for each consumer to prevent multiple consumption due to lack of locks. What if a consumer goes down? How does it manage its own state? All things managed by kafka (or pretty much any MQ) out of the box, and now they have to implement ALL of that themselves - none of which is mentioned in the article.


This reads like sales copy. It's freemium FOSS-washed crippleware with a radioactive license (AGPL). Hard pass.

I'll stick to FOSS solutions that don't require licenses to unlock closed-source components and can be patched by a community and/or yourself.

Edit: Previously, there are other commercialish OSS NoSQL solutions for large-scale apps that are less proprietary with better licenses like Couchbase (not Cassandra CQL).


ScyllaDB Open Source is a fully functional NoSQL database. It is far from "crippleware" as many of our FOSS users can and have publicly attested.

Contributors can submit their own patches to ScyllaDB.

Contributing Guidelines: https://github.com/scylladb/scylla/blob/4bd4aa2e887278c4912f...


You forgot your

[Disclosure: I work at ScyllaDB.]


Well, it's in their bio (but I agree that it would have been nice). Full disclosure: 3rd time I hear abut ScyllaDB.


Ah! Had that on a prior post. But yes. It me.


Why are you so hostile to AGPL? If you are a product user, you have enough skills internally to modify the software code, what's the issue in contributing it back? Also in the end they are basically trying to force AWS etc to give back to community even if it's just internally used. I don't think it's that bad.


Then you don't understand how software licenses work in the real world. AGPL means it's radioactive and, furthermore, it can't be made into a cloud service.

BSD-2 clause is a far more permissive license.

Please come back and speak when you understand how these things work.


I perfectly understand how those things work, and I also dare to guess that being radioactive for cloud services was PRECISELY the motivation to choose AGPL for ScyllaDB developers. Redis, Elastcsearch, Cortex etc taught that lesson already.


why the hostility? they're trying to build a sustainable business so their product can have a future, which in turn benefits those who can avail themselves of their open source

is paying for support really such a raw deal if you are using their produce to generate your own revenue?


You don't understand AGPL and have a consumer, pay-to-play mindset... sorry, but no sale.

Come back when all of the features are available to everyone and it's a BSD-2 license with optional support.


ScyllaDB and its related parts like Seastar always struck me as real performance-oriented programming, though it was based on leveraging language tech (C++14 early on) that was painful. I wonder if a nicer approach is possible nowadays.


Is there any more recent technical review of ScyllaDB than this?

https://jepsen.io/analyses/scylla-4.2-rc3


So they're using Scylla to manually do what kafka does, basically processors polling for new records in a shard and updating their watermark once its done processing. I'm surprised that this is faster than just using kafka alone, though one of the reasons why they wanted to avoid kafka is dealing with deployment complexity and memory usage of kafka clusters.


>Like the first solution, normalized data is stored in a database – but in this implementation, it’s a NoSQL database instead of a relational database.

What means data normalization in a NoSQL context? I think most normal forms make sense in a context where we have tables, rows and Relational algebra.


Interesting to see space for multiple commercial backers of Cassandra

Anyone seeing Cassandra adoption for new use cases in the public cloud?


[flagged]


It's a 27 min video from an actual customer doing real things with this product. The product in particular has a reputation for being obsessive about performance.

Personally, I think this could be interesting and useful for those with similar needs.


And it was a conference talk. If the company didn't actually believe in the product, they wouldn't have sent someone to do the talk.


You have a fairly naive view of “conference” talks. Thee are not conferences in the sense most understand them.


This is clearly just an advertisement, and not even an informative one!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: