Relevant amusing bit from the Amazon FAQ: "S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years."
I think my favorite part of that is "on average", as if you will be making repeated ten-million-year trials of this effectively brand new technology.
The point is that once you get into several nines of reliability, really rare events that are impossible to model start to dominate your risk budget.
Many people have far more than 10k objects though. Based on that math, if you stored 10 billion objects in S3 (certainly well within the realm of possibility), you'd lose an object on average every 10 years, which is a length of time one can think about.
Your main point (that making high reliability systems more conventionally reliable is building your high wall yet higher still) is definitely valid. But the lossage rate is actually a meaningful number given the extremely large number of objects stored in S3.
I completely agree with you. Where they are intentionally misleading is in pretending that keeping one object for a zillion years is the same as keeping a zillion objects for one year.
The Long Now foundation (http://longnow.org) has some other interesting projects besides the clock.
If you like this kind of thing, I'd suggest reading Neal Stephenson's Anathem. It might be hard to get into it, it might be his most impenetrable book yet but it really pays off later.
Several of Neal Stephenson's books are like this ... I finished the Cryptonomicon earlier this year and have just started to read REAMDE. I often find myself reading the same sentence several times because they are so convoluted ...
This is S3 which isn't comparable to Google's persistent disks. S3 is equivalent to Google Cloud Storage which has "99.999999999%" durability as per https://cloud.google.com/storage/
To accurately compare them you'd need to look at AWS EBS: "Amazon EBS volumes are designed for an annual failure rate (AFR) of between 0.1% - 0.2%, where failure refers to a complete or partial loss of the volume, depending on the size and performance of the volume."
I don't think so, PD and GCS are fundamentally different products.
PD is like the local disk, and you cannot assume it last forever.
while GCS is durable storage and it is safer.
"
This outage is wholly Google's responsibility. However, we would like to take this opportunity to highlight an important reminder for our customers: GCE instances and Persistent Disks within a zone exist in a single Google datacenter and are therefore unavoidably vulnerable to datacenter-scale disasters. Customers who need maximum availability should be prepared to switch their operations to another GCE zone. For maximum durability we recommend GCE snapshots and Google Cloud Storage as resilient, geographically replicated repositories for your data.
"
In a cloud based infra, one still needs to know the difference. And for the ultra-low numbers, they are from mathematics I think. (# replicas, possibly cross DC/geo)
My recollection: around 20GB... however, they clearly were willing to try to get some of it back if that was helpful. It was just a redundant offsite backup for me, so there was no need.
Or the bigger problem: assuming your future reliability looks just like your past. For example, say they have a problem with year 2038. Everything works great until everything doesn't work at all.
Indeed "on average" covers the entire period and from initial local-node intake of the data it has to then be propergated to another site to increase integrity probability. So if it takes even 5 seconds to propergate to several sites at different locations then the durability scales from when the clock starts.
Heck once in every 10,000,000 years, the odds of winning the USA lottery are around one in 175,000,000 and yet people partake and somebody wins. So it is always worth looking at odds from another perspective and accept it is either impossible, probably or possible and as we know the return from investment in durability is logarithmic in cost going up and return getting smaller.
Still one of the biggest issues in any datacentre is not just the power, but also the quality of the power and even the slightest noise could and does increase the odds of some electronics going wrong. UPS's (though been many years) will put out a square wave in modulation and the ideal is a perfect sine-wave. It is truly educational to see the quality of many power sources on an oscilloscope. Also equipment on the same circuit can also induce noise, so can vary rack to rack.
Do they spell out what assumptions they're working under? Because, for example, I'm pretty sure that the odds of a civilization-ending asteroid or comet strike in the next year, while quite low, are higher than what's implied by 99.999999999% durability on trillions of objects."
Well, the guarantee needs to be reasonable only as far as you and the courts are alive to sue them. (Corollary: you only need to sustain DDOS attacks when the rest of the internet infrastructure can sustain the traffic).
Why do you specify trillions of objects? They practically promise data loss at that scale.
Also, the typical civilization-ending comet or asteroid is going to destroy 0 or 1 geographically-diverse data centers. Remember that this is a retention promise, not an uptime promise.
Just because it's how many objects they store. But now that you mention it, it has no relevance to the probability of data loss due to global catastrophe, since it cancels out.
If you're proposed reliability is data loss of one in 100 billion per year, that assumes the risk of e.g. global thermonuclear war is no more than that amount, or you're ignoring it for the purposes of the calculation. Which is why I wonder about their assumptions.
As for a comet destroying at most one data center, if it really ends civilization then the data will probably be lost before too much longer even if the facility physically survives.
That's one of the reasons that I'm excited about decentralized storage. Data stored on dozens or hundreds of nodes spread across multiple countries and jurisdictions is much more robust to things like earthquakes, storms, government intervention, and companies going out of business or changing their profitability model. It has a much stronger defence against black swan events.
When you are storing many billions of files consuming many millions of disks, you become vulnerable to black swan events, simply because you are rolling the dice so many times.
The irony is that, by that time, Amazon will almost certainly not be around. So doing it in terms of time is actually a different calculation, and this is based on wrong assumptions :)
Well the expectation is different for different players. If Facebook lose 1 photo for 1 customer out of 10M customers, FB wouldn't care. The user may just assume a glitch. For a small business, losing 1 photo for 1 customer out of 10,000 customers can be a big deal, but nonetheless not the worst case. Offer sincere apology, do extra data backup/replication if those data are extremely valuable to customer, and roll out a new service.
Another problem is that any errors in assumptions or omissions made when calculating the odds will be enormously magnified.
I wouldn't trust any of these figures unless they have ongoing efforts to test them empirically. E.g. create distributed databases of 100 trillion objects, mess with them in various ways, and perform correctness checks on them.
Which would be embarrassing if this had happened to a product offering such an SLA. Instead, this is just zone-local HDD storage. It will only affect customers that decided to run a system in a single site, making no snapshots or using cross-region replication, etc. Which may be quite a few seeing as there seems to be this misunderstanding that VMs "in the cloud" have physics-defying redundancy attributes. (Not saying you have this idea, just that multiple clients have expressed surprise when their single-VM setup goes down because "isn't this what the cloud is for?".)
Well, single VM is not "the cloud". It's just a VM on some dude's computer. Your clients seem to have been mislead by the incredibly content-free marketing around "the cloud".
Sounds like this wasn't caused by power surges reaching the equipment but rather an effect of repeated power loss to drive arrays not fully designed to handle it. The article is pretty unclear. Still sounds like an infrastructure problem though.
AFAIK google doesn't use hardware-based arrays in their servers.
I think the reference to batteries would more likely be in reference to a DRUPS (Diesel Rotary UPS) which most datacenters run, sometimes with some form of battery in combination with the flywheel. Typically the combination of kinetic energy in the flywheel (and potentially batteries) only hold enough power to last for 30 seconds or so (often as low as 10 seconds), which gives the diesel power generators enough time to come online and take over from there.
What I guess is that might have happened is that grid power was lost, they switched over to UPS fine the first time, grid power came back so they swapped back, repeat a few times and on one of those times the batteries didn't have enough charge to keep things going for the generator swap-over.
A so UPS on the cheap a real UPS is always running of the battery Telco style and you are testing your gensets at least weakly and checking the tanks every month or so.
At BT they where tested every day for the really sensitive systems.
A proper storage array refuses to boot if the batteries are not sufficient to de-stage cache in a timely fashion, or at the very least keep cache available for the advertised (generally 72 hours) time window if it doesn't have the ability to de-stage the cache to a more permanent medium.
I work on cell sites, grounding system design and repair is a primary design element, even then, the presumption in the industry is that if a site takes a direct hit - or for that matter a nearby strike - the equipment is a total loss.
The surge suppression gear we put in (lead ins at power feeds, RF feed, etc) is mostly to prevent a fire and to ensure the extra energy goes largely to ground.. but it won't prevent dead gear.
Are you saying essentially that "There is no such thing as a surge protector, they don't physically exist. Only surge reducers exist." Because that's what it sounds like to me.
EDIT:
All right, I'll rephrase. According to Google's infobox from nat'l geographic, lightning generates up to 1 billion volts.
-> Are surge protectors at even the highest-end data centers simply not rated to a billion volts of surge protection?
There's a "protector", but the protection isn't guaranteed to be 100% effective.
That's an OK use of the language in this context and has many parallels in other fields. E.g. using a condom as protection against pregnancy and disease. Many people have learned the hard way that it's not 100%.
This raises a relevant concern that's been on my mind: what's the best way to back up cloud services? Given that services like S3 and Google Drive have many more nines of durability than any local storage system I could devise, are backups even worth the trouble?
There are a lot of cloud-to-cloud backup services out there, but to me that seems like the blind leading the blind, especially with regards to malicious data destruction. For instance, I've recently been experimenting with Cloudally to automatically back up Google Drive, which seems like a good solution at first- until you think about the fact that Cloudally uses Google accounts for authentication (and doesn't use 2FA for native authentication). In other words, an attacker with access to my primary data (Google Drive) would also have access to my backups. Worse than that, Cloudally actually increases the attack surface, since its lack of 2FA presumably makes it easier to crack than my Google account.
Similarly, I'm guessing a lot of cloud backup services share data centers with the services they are backing up.
If you really care about durability, your best best is erasure-coding + a wide geographic distribution of shards. For example, you could encode 1 TB of data into four shards, each shard containing 500 GB. You distribute these to servers in SF, NYC, Berlin, and Sydney. The key here is that you only need two shards to recover your 1 TB of data, and they can be any two shards. So if lightning strikes Berlin, and the Big One hits SF, your data is still safe. And thanks to erasure-coding, you can achieve this with only 2x redundancy (instead of 4x).
Well, I have to shamelessly plug my startup, www.siacoin.com, which implements the scheme I described above using a peer-to-peer network, with payments made on a blockchain. We are using Klaus Post's excellent pure-Go implementation (https://github.com/klauspost/reedsolomon) which exceeds 1GB/s throughput.
durability is not same as "protection". It doesn't include items like government seizure, hacking. Your own drives, in your physical possession fill in that gap (somewhat).
This is how, for example, you can know whether a wildfire was started by lightning - once a point of origin is determined, simply check the data for strikes.
Assuming the wild 1PB guess above... that could be anything from 100mb of a single volume lost, all the way up to 100 million volumes each losing one byte of data. There isn't enough information to determine the number of volumes affected. 0.000001% of disk space isn't very informative. Publishing the percentage and/or number of persistent volumes affected would give us an idea of the scale of the problem.
So Google said: "...although... the storage systems are designed with battery backup, some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain. In almost all cases the data was successfully committed to stable storage, although manual intervention was required in order to restore the systems to their normal serving state. However, in a very few cases, recent writes were unrecoverable, leading to permanent data loss on the Persistent Disk."
I thought battery is supposed to cover writing the entire write buffer cache to disk in case of power loss. Sounds like they had some badly designed gear which did no account for partial battery charge which should downsize the cache to battery's capacity.
Quoting the data lost as a percentage of disk space is both accurate and misleading. It makes the impact sound tiny because only recent writes were affected. Obviously writes that were in flight at the time of the incident are going to be a tiny percentage of overall storage. What they don't tell us is what percentage of persistent disks which were in use at the time were affected. That percentage is likely far higher. If only 0.000001% of volumes in use were affected it would never have made the news.
I think it only impacted people who were using Compute Engine and storing data on a hard drive instead of using Cloud Storage. I suspect that Drive data has redundant copies stored in multiple data centers in addition to frequent backups.
Not for block storage (aka persistent disks): "we would like to take this opportunity to highlight an important reminder for our customers: GCE instances and Persistent Disks within a zone exist in a single Google datacenter and are therefore unavoidably vulnerable to datacenter-scale disasters. Customers who need maximum availability should be prepared to switch their operations to another GCE zone. For maximum durability we recommend GCE snapshots and Google Cloud Storage as resilient, geographically replicated repositories for your data."
Achieving RPO=0 generally requires synchronous replication to a different datacenter which adds significant latency.
Amazon builds datacenter redundancy in the same geographic locale. You can then setup synchronous replication between datacenters without atrocious latency, as all the disks are fairly close by, albeit powered [in an emergency] by independent generators.
OTOH, Google does datacenter redundancy across different locales, which make synchronous replication perform much worse, like you noted.
I stand corrected. Looking at https://cloud.google.com/compute/docs/zones?hl=en, it seems that GCE has multiple zones in the same locale (region). I might have confused with Google's internal setup, which during my tenure there, was heavy on using replication across locales, up to systems like Megastore and the 200ms synchronously replicated commit.
This could potentially explain a lingering error on my Google Drive. Might there be movement within Google to contact the owners of data that was lost?
The article mentions Compute Engine, which is an external offering that isn't really used internally. It's hard to say what else might have been affected, if anything, but going by the article this shouldn't be causing your problem.
Well image if file systems, file formats and version control systems were built from the ground up together. Currently file syncing is harder because we expect our file server to handle an infinite variety of legacy file formats and the file sever cannot try to know anything about the document structure (whereas a version control system can examine the contents of plain text files).
Surely this is why Google docs can manage collaborative working more easily -- because they can control of the whole stack.
Better than the OP would be file syncing of arbitrary files between arbitrary systems is hard.
What if I edit the same pixel, in the same image, in the same nanosecond on two different devices, and then sync?
There are situations where conflicts are necessary. The focus should be on making conflict resolution easily accessible, not on trying to be smart and overriding files at random
Actually quite a lot of serious by non-power-users do want this. You only have to look at a random persons laptop and see and the file names with "final version.docx", "final version 2.docx" "final final version.docx" etc to see this. They certainly don't want a console based solution but a cross between git and word style track changes would make lots of people pretty happy.
I've lost attachments to old gmail messages before, I never thought it was impossible or unlikely for Google to lose data.
I'm sure the data wasn't truly lost. If I'd called them up and they'd made it their priority to find my old files, they could have done so, having so many redundant backups. But of course no one at Google is taking calls like that or acting on individual requests. The data was effectively lost, not technically lost. But I'm sure it's uncommon.
As mentioned in the article, it only affected recently written data of 'Google Compute Engine' services. GCE allows user to launch VMs and generate arbitrary data on the server.
Normally, Google redundantly distributes out data to at least 3 different geographically distinct locations. Check out the 'BigTable' white paper [0] for more info.
For 99% of cases (and pretty well all user cases), this would not cause data lose. The key here is that the data was generated on the servers and did not have a chance to duplicate before the event.
I can't find the presentation anywhere, but at HBaseCon 2014 one of the lead developers of Bigtable stated that they went to RS. Even for their older databases.
> Beyond security, this highlights one of the main issues with the cloud. Was there no backup?
Unless I am specifically paying for backup service, I wouldn't expect/want them to do that for me. And even if I was, I'd still have offsite backups if the system was important.
If you aren't being responsible with your backups in a noisy environment like Google Cloud/AWS, understand that you are vulnerable to freak accidents like this. Google/AWS's job in all of this is to try to reduce the frequency of issues and to minimize the impact.
If you were running the data center instead of Google, it'd suddenly stop being cloud. How would that change lightning and power loss?
Google Compute Engine offers customers the option to make snapshots for backup, or use a true "cloud" storage engine. If anyone lost data here it was customers explicitly not doing backups and only using a single zone. I don't know why anyone would expect different. GCE easily allows you to network machines in multiple data centers, but close geographically. So you'd only need to handle region-wide disasters.
The article states that it was "some recently written data [...] located on storage systems which were more susceptible to power failure". Perhaps it hadn't made it to a backup.
This was closer to a colocated server than a cloud service. GCE gives you a VM that you install your favorite OS on. Google keeps the VM running, but the rest (such as whether the server backs itself up) is up to you.
I think my favorite part of that is "on average", as if you will be making repeated ten-million-year trials of this effectively brand new technology.
The point is that once you get into several nines of reliability, really rare events that are impossible to model start to dominate your risk budget.