At reddit it's much easier for us to stand up a new Cassandra column family than...

fidget · on April 13, 2017

Are you suggesting that you had a concurrency bug that was solvable without changing your entire storage layer? Heresy..

developer2 · on April 14, 2017

Your parent commenter seems to have no idea as to the true scale you planned for. Most of the criticism I've read here on HN and on Reddit threads regarding your implementation seems to have come from people who have never had to code something that has real-world scaling requirements. This wasn't some pet project initially launched to 100 concurrent users, with the ability to slowly and incrementally scale to millions of users over a period of weeks or months. You had one shot to get it right. A majority of those criticizing would have crashed their entire production stack upon deploying. Hundreds, possibly thousands, of queries per second returning one million rows each? Not going to happen, no matter which database backend you choose. The foresight you had to get it right the first time was well played on your part.

Ideally, you would have also used redis to limit the per-user activity without having to hit Cassandra. Also not sure why you hit Cassandra instead of redis for the single-pixel fetch endpoint (redis GETBIT operation rather than a database hit); if you already conceded to not-quite-atomic operations across the entire map, a GETBIT would have rarely returned a stale data point. But these are minor nice-to-have criticisms that would have pushed the scaling capabilities even further beyond your expected requirements. All in all, again I highly commend your results. You had one minor snafu, and managed to overcome it. Well done!

Aside: my brain is spinning as to how I would provide a 100% guaranteed atomic version of /r/place - without any point of failure such as a redis server not failing/restarting, or a single-server in-memory nodejs data structure. Really tough to do so without any point of failure or concession to atomicity. :)

Second aside: more than anything, I am surprised you have a CDN that allows 1-second expiries. While perfect for this kind of project, too many CDNs find a 1-second expiry as a risk to permit, as they tend to expect too much abuse/churn. ie: How is a CDN supposed to trust you enough to use a 1-second expiry for reasonably high traffic, rather than cycling so much caching effort for something that could have used a 5 minute expiration? I can't imagine being the developer of a CDN that trusts its users to use a 1-second expiry that wastes an insane number of CPU cycles for an origin that is not legitimately sustainable.

tldr (still long, but on point): You guys did an amazing job for something that lasted, what was it, 3 days? Great job! Many of your critical audience members would not have managed any better, let alone being viable and functional. I would submit my résumé to work for you, but I fear my personality is far too... um... abrasive... to get along with the organisation as a whole. In any case, your team as a cohesive unit - design, backend, and frontend (especially the mobile support) - did an incredible job. +1 to the Reddit team here, you should be immensely proud of yourselves for pulling this off.

gooeyblob · on April 18, 2017

Thanks for the support!

We don't have a normal CDN, we have Fastly. They are really incredible at what they do, and this would not have been possible with our previous CDN partners.