Google loses data as lightning strikes

idlewords · on Aug 19, 2015

Relevant amusing bit from the Amazon FAQ: "S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years."

I think my favorite part of that is "on average", as if you will be making repeated ten-million-year trials of this effectively brand new technology.

The point is that once you get into several nines of reliability, really rare events that are impossible to model start to dominate your risk budget.

emmett · on Aug 19, 2015

Many people have far more than 10k objects though. Based on that math, if you stored 10 billion objects in S3 (certainly well within the realm of possibility), you'd lose an object on average every 10 years, which is a length of time one can think about.

Your main point (that making high reliability systems more conventionally reliable is building your high wall yet higher still) is definitely valid. But the lossage rate is actually a meaningful number given the extremely large number of objects stored in S3.

idlewords · on Aug 19, 2015

I completely agree with you. Where they are intentionally misleading is in pretending that keeping one object for a zillion years is the same as keeping a zillion objects for one year.

devindotcom · on Aug 19, 2015

Keeping one object for a zillion years sounds like an interesting project, now that you bring it up.

Our data integrity is such that it will survive universal cataclysm and the reduction of all data to a single point of infinite energy.[1]

[1]guarantee applies to a single bit of data[2] [2]bit value must be 1

pilsetnieks · on Aug 19, 2015

THERE IS AS YET INSUFFICIENT DATA FOR A MEANINGFUL ANSWER

Natsu · on Aug 19, 2015

Ref: http://www.physics.princeton.edu/ph115/LQ.pdf

pilsetnieks · on Aug 19, 2015

Every time this story is posted somewhere, I just have to read it end to end.

nl · on Aug 20, 2015

Relevant: The thermodynamic arrow would reverse during a contracting phase of the Universe or inside black holes.

http://journals.aps.org/prd/abstract/10.1103/PhysRevD.32.248... (S. W. Hawking, 1985 <- Just so you know it isn't a crackpot writing this. It isn't a widely accepted view though)

Natsu · on Aug 20, 2015

That's funny, because I had to read it again too.

forrestthewoods · on Aug 19, 2015

http://www.10000yearclock.net/learnmore.html

tsunamifury · on Aug 19, 2015

Funny enough, its funded by Bezos

pilsetnieks · on Aug 19, 2015

Bezos isn't the only one.

The Long Now foundation (http://longnow.org) has some other interesting projects besides the clock.

If you like this kind of thing, I'd suggest reading Neal Stephenson's Anathem. It might be hard to get into it, it might be his most impenetrable book yet but it really pays off later.

fho · on Aug 20, 2015

Several of Neal Stephenson's books are like this ... I finished the Cryptonomicon earlier this year and have just started to read REAMDE. I often find myself reading the same sentence several times because they are so convoluted ...

wedesoft · on Aug 20, 2015

Bezos can even make raw matter work extra hours for him.

toomuchtodo · on Aug 19, 2015

All I could think of was Hitchhiker's Guide To The Galaxy.

dmytton · on Aug 19, 2015

This is S3 which isn't comparable to Google's persistent disks. S3 is equivalent to Google Cloud Storage which has "99.999999999%" durability as per https://cloud.google.com/storage/

To accurately compare them you'd need to look at AWS EBS: "Amazon EBS volumes are designed for an annual failure rate (AFR) of between 0.1% - 0.2%, where failure refers to a complete or partial loss of the volume, depending on the size and performance of the volume."

idlewords · on Aug 19, 2015

The point isn't to compare competing cloud services, but draw attention to the fantastical nature of these durability promises.

azurezyq · on Aug 19, 2015

I don't think so, PD and GCS are fundamentally different products. PD is like the local disk, and you cannot assume it last forever. while GCS is durable storage and it is safer.

" This outage is wholly Google's responsibility. However, we would like to take this opportunity to highlight an important reminder for our customers: GCE instances and Persistent Disks within a zone exist in a single Google datacenter and are therefore unavoidably vulnerable to datacenter-scale disasters. Customers who need maximum availability should be prepared to switch their operations to another GCE zone. For maximum durability we recommend GCE snapshots and Google Cloud Storage as resilient, geographically replicated repositories for your data. "

In a cloud based infra, one still needs to know the difference. And for the ultra-low numbers, they are from mathematics I think. (# replicas, possibly cross DC/geo)

cjensen · on Aug 19, 2015

Glacier is also 99.999999999%.

But those nines are only true if Amazon's software is bug-free. I've lost data on Glacier on my home account. They personally called me to apologize.

Tloewald · on Aug 19, 2015

So empirically speaking their reliability is far, far worse than stated just based on that one data point.

onethumb · on Aug 19, 2015

How so? We'd need to know both how much data cjensen lost and how much data Glacier has in it to tell, neither of which I know from this conversation.

cjensen · on Aug 19, 2015

My recollection: around 20GB... however, they clearly were willing to try to get some of it back if that was helpful. It was just a redundant offsite backup for me, so there was no need.

EduardoBautista · on Aug 20, 2015

Would have been interesting to see if they had been able to.

pdabbadabba · on Aug 19, 2015

Who knows. Maybe his is the last data Glacier will lose in the next [suitably large number] years. That's bad luck, buddy!

megablast · on Aug 20, 2015

I too choose to extrapolate all my losses over the next 100 billion years.

bbcbasic · on Aug 19, 2015

Not now they have fixed the bug.

Tloewald · on Aug 20, 2015

And of course there are no other bugs.

cjensen · on Aug 20, 2015

Yep. They confirmed they understood and fixed it.

pjc50 · on Aug 19, 2015

At this point, correlation matters. In the unlikely event that I lose at least 1 object, what is the conditional probability of losing another?

(Forgetting about correlation was a big part of the MBS and LTCM financial failures)

mikecb · on Aug 19, 2015

They had a bigger problem: modeling with the wrong distribution.

Tloewald · on Aug 19, 2015

And Fukushima.

Assumptions that variables are independent are often very mistaken.

dockd · on Aug 19, 2015

Or the bigger problem: assuming your future reliability looks just like your past. For example, say they have a problem with year 2038. Everything works great until everything doesn't work at all.

cos2pi · on Aug 19, 2015

I always think of Feynman's minority report [0] on the Challenger Disaster when these types of risk assessments pop up.

[0] http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/roger...

Zenst · on Aug 19, 2015

Indeed "on average" covers the entire period and from initial local-node intake of the data it has to then be propergated to another site to increase integrity probability. So if it takes even 5 seconds to propergate to several sites at different locations then the durability scales from when the clock starts.

Heck once in every 10,000,000 years, the odds of winning the USA lottery are around one in 175,000,000 and yet people partake and somebody wins. So it is always worth looking at odds from another perspective and accept it is either impossible, probably or possible and as we know the return from investment in durability is logarithmic in cost going up and return getting smaller.

Still one of the biggest issues in any datacentre is not just the power, but also the quality of the power and even the slightest noise could and does increase the odds of some electronics going wrong. UPS's (though been many years) will put out a square wave in modulation and the ideal is a perfect sine-wave. It is truly educational to see the quality of many power sources on an oscilloscope. Also equipment on the same circuit can also induce noise, so can vary rack to rack.

mikeash · on Aug 19, 2015

Do they spell out what assumptions they're working under? Because, for example, I'm pretty sure that the odds of a civilization-ending asteroid or comet strike in the next year, while quite low, are higher than what's implied by 99.999999999% durability on trillions of objects."

tgma · on Aug 19, 2015

Well, the guarantee needs to be reasonable only as far as you and the courts are alive to sue them. (Corollary: you only need to sustain DDOS attacks when the rest of the internet infrastructure can sustain the traffic).

Dylan16807 · on Aug 19, 2015

Why do you specify trillions of objects? They practically promise data loss at that scale.

Also, the typical civilization-ending comet or asteroid is going to destroy 0 or 1 geographically-diverse data centers. Remember that this is a retention promise, not an uptime promise.

mikeash · on Aug 20, 2015

Just because it's how many objects they store. But now that you mention it, it has no relevance to the probability of data loss due to global catastrophe, since it cancels out.

If you're proposed reliability is data loss of one in 100 billion per year, that assumes the risk of e.g. global thermonuclear war is no more than that amount, or you're ignoring it for the purposes of the calculation. Which is why I wonder about their assumptions.

As for a comet destroying at most one data center, if it really ends civilization then the data will probably be lost before too much longer even if the facility physically survives.

Taek · on Aug 19, 2015

That's one of the reasons that I'm excited about decentralized storage. Data stored on dozens or hundreds of nodes spread across multiple countries and jurisdictions is much more robust to things like earthquakes, storms, government intervention, and companies going out of business or changing their profitability model. It has a much stronger defence against black swan events.

When you are storing many billions of files consuming many millions of disks, you become vulnerable to black swan events, simply because you are rolling the dice so many times.

leni536 · on Aug 19, 2015

I'm quite hyped about ipfs. I think they can deliver in few years.

zkhalique · on Aug 19, 2015

The irony is that, by that time, Amazon will almost certainly not be around. So doing it in terms of time is actually a different calculation, and this is based on wrong assumptions :)

yeukhon · on Aug 20, 2015

Well the expectation is different for different players. If Facebook lose 1 photo for 1 customer out of 10M customers, FB wouldn't care. The user may just assume a glitch. For a small business, losing 1 photo for 1 customer out of 10,000 customers can be a big deal, but nonetheless not the worst case. Offer sincere apology, do extra data backup/replication if those data are extremely valuable to customer, and roll out a new service.

Tloewald · on Aug 19, 2015

Another problem is that any errors in assumptions or omissions made when calculating the odds will be enormously magnified.

I wouldn't trust any of these figures unless they have ongoing efforts to test them empirically. E.g. create distributed databases of 100 trillion objects, mess with them in various ways, and perform correctness checks on them.

wodenokoto · on Aug 21, 2015

You're thinking of average as a frequentist, and Amazon are thinking as Bayesian.

Ever thought about how we can talk about the chance of rain tomorrow? Or the risc that a comet may strike the earth within the next million years?

hyperpape · on Aug 19, 2015

"just 0.000001% of disk space was permanently affected."

So Google just exhausted their 11 9's for centuries to come.

MichaelGG · on Aug 19, 2015

Which would be embarrassing if this had happened to a product offering such an SLA. Instead, this is just zone-local HDD storage. It will only affect customers that decided to run a system in a single site, making no snapshots or using cross-region replication, etc. Which may be quite a few seeing as there seems to be this misunderstanding that VMs "in the cloud" have physics-defying redundancy attributes. (Not saying you have this idea, just that multiple clients have expressed surprise when their single-VM setup goes down because "isn't this what the cloud is for?".)

TeMPOraL · on Aug 19, 2015

Well, single VM is not "the cloud". It's just a VM on some dude's computer. Your clients seem to have been mislead by the incredibly content-free marketing around "the cloud".

MichaelGG · on Aug 19, 2015

Basically s/hosted/cloud-based is what everyone has done.

Dylan16807 · on Aug 19, 2015

You are making repeated trials, they just overlap.

jsprogrammer · on Aug 19, 2015

Is that durability rate per-... user? project? s3-wide?

Important detail left out.

If there are 10,000,000 projects with 10,000 objects, then a system-wide durability of 99.999999999% would expect to drop 1 object per year.

miander · on Aug 19, 2015

Sounds like this wasn't caused by power surges reaching the equipment but rather an effect of repeated power loss to drive arrays not fully designed to handle it. The article is pretty unclear. Still sounds like an infrastructure problem though.

magicalist · on Aug 19, 2015

The incident report ismavis posted below (and linked in the article) has far more information: https://status.cloud.google.com/incident/compute/15056#57195...

on Aug 19, 2015

[deleted]

darkr · on Aug 19, 2015

AFAIK google doesn't use hardware-based arrays in their servers.

I think the reference to batteries would more likely be in reference to a DRUPS (Diesel Rotary UPS) which most datacenters run, sometimes with some form of battery in combination with the flywheel. Typically the combination of kinetic energy in the flywheel (and potentially batteries) only hold enough power to last for 30 seconds or so (often as low as 10 seconds), which gives the diesel power generators enough time to come online and take over from there.

What I guess is that might have happened is that grid power was lost, they switched over to UPS fine the first time, grid power came back so they swapped back, repeat a few times and on one of those times the batteries didn't have enough charge to keep things going for the generator swap-over.

dekhn · on Aug 19, 2015

From http://www.cnet.com/news/google-uncloaks-once-secret-server-... "Google's big surprise: each server has its own 12-volt battery to supply power if there's a problem with the main source of electricity"

walshemj · on Aug 19, 2015

A so UPS on the cheap a real UPS is always running of the battery Telco style and you are testing your gensets at least weakly and checking the tanks every month or so.

At BT they where tested every day for the really sensitive systems.

thrownaway2424 · on Aug 19, 2015

Nope.

http://www.buildablade.com/images/Google_data_centers-1.jpg

tw04 · on Aug 19, 2015

A proper storage array refuses to boot if the batteries are not sufficient to de-stage cache in a timely fashion, or at the very least keep cache available for the advertised (generally 72 hours) time window if it doesn't have the ability to de-stage the cache to a more permanent medium.

linkregister · on Aug 19, 2015

Google's actual statement with RCA: https://status.cloud.google.com/incident/compute/15056#57195...

Aloha · on Aug 19, 2015

I work on cell sites, grounding system design and repair is a primary design element, even then, the presumption in the industry is that if a site takes a direct hit - or for that matter a nearby strike - the equipment is a total loss.

The surge suppression gear we put in (lead ins at power feeds, RF feed, etc) is mostly to prevent a fire and to ensure the extra energy goes largely to ground.. but it won't prevent dead gear.

logicallee · on Aug 19, 2015

Are you saying essentially that "There is no such thing as a surge protector, they don't physically exist. Only surge reducers exist." Because that's what it sounds like to me.

EDIT:

All right, I'll rephrase. According to Google's infobox from nat'l geographic, lightning generates up to 1 billion volts.

-> Are surge protectors at even the highest-end data centers simply not rated to a billion volts of surge protection?

PhantomGremlin · on Aug 19, 2015

There's a "protector", but the protection isn't guaranteed to be 100% effective.

That's an OK use of the language in this context and has many parallels in other fields. E.g. using a condom as protection against pregnancy and disease. Many people have learned the hard way that it's not 100%.

Aloha · on Aug 20, 2015

Effectively yes, when you're on lightning scale, yes.

I use an isobar myself to power reduce noise, but I'm under no illusion that it will protect my shit from a direct strike.

danepowell · on Aug 19, 2015

This raises a relevant concern that's been on my mind: what's the best way to back up cloud services? Given that services like S3 and Google Drive have many more nines of durability than any local storage system I could devise, are backups even worth the trouble?

There are a lot of cloud-to-cloud backup services out there, but to me that seems like the blind leading the blind, especially with regards to malicious data destruction. For instance, I've recently been experimenting with Cloudally to automatically back up Google Drive, which seems like a good solution at first- until you think about the fact that Cloudally uses Google accounts for authentication (and doesn't use 2FA for native authentication). In other words, an attacker with access to my primary data (Google Drive) would also have access to my backups. Worse than that, Cloudally actually increases the attack surface, since its lack of 2FA presumably makes it easier to crack than my Google account.

Similarly, I'm guessing a lot of cloud backup services share data centers with the services they are backing up.

nemo1618 · on Aug 20, 2015

If you really care about durability, your best best is erasure-coding + a wide geographic distribution of shards. For example, you could encode 1 TB of data into four shards, each shard containing 500 GB. You distribute these to servers in SF, NYC, Berlin, and Sydney. The key here is that you only need two shards to recover your 1 TB of data, and they can be any two shards. So if lightning strikes Berlin, and the Big One hits SF, your data is still safe. And thanks to erasure-coding, you can achieve this with only 2x redundancy (instead of 4x).

mirimir · on Aug 20, 2015

It's been a while since I followed this. I see over 100 Reed-Solomon erasure-coding projects on GitHub. Which would you recommend?

nemo1618 · on Aug 20, 2015

Well, I have to shamelessly plug my startup, www.siacoin.com, which implements the scheme I described above using a peer-to-peer network, with payments made on a blockchain. We are using Klaus Post's excellent pure-Go implementation (https://github.com/klauspost/reedsolomon) which exceeds 1GB/s throughput.

mirimir · on Aug 21, 2015

Thank you. It's very cool. But can Sia work with hosts that are reachable only as Tor hidden services?

njharman · on Aug 19, 2015

durability is not same as "protection". It doesn't include items like government seizure, hacking. Your own drives, in your physical possession fill in that gap (somewhat).

impostervt · on Aug 19, 2015

Curious - how do they know lightning hit four times? Was someone outside counting?

cdr · on Aug 19, 2015

As sz4kerto said, it was probably tracked by building systems. But beyond that, weather services track every single lightning strike in most of the developed world. Eg BELLS: http://radar.meteo.be/en/3337408-Lightning+detection.html#pp...

This is how, for example, you can know whether a wildfire was started by lightning - once a point of origin is determined, simply check the data for strikes.

https://en.wikipedia.org/wiki/Lightning_detection http://www.lightningmaps.org/

chinathrow · on Aug 19, 2015

Yes, lots of folks are counting.

http://www.blitzortung.org/

keithpeter · on Aug 19, 2015

Thanks very much for posting this link. Could be really handy for physics teaching next year.

sz4kerto · on Aug 19, 2015

Building power management systems can detect this.

larrys · on Aug 19, 2015

Could also be security cameras in addition to what others mentioned.

atlbeer · on Aug 19, 2015

The real clouds are getting angry at companies misappropriating their name

lgleason · on Aug 19, 2015

In other news, real clouds file lawsuit against cloud providers for trademark and copyright infringement. :)

upbeatlinux · on Aug 19, 2015

The cloud strikes back.

chatmasta · on Aug 19, 2015

> Google said that just 0.000001% of disk space was permanently affected.

Assuming 1 petabyte of total storage at the datacenter, that equates to about 100mb. I wonder how much storage they have there.

decker · on Aug 19, 2015

I think your estimate is too low by 1000x. 1PB is 500 2T drives. There's no way Google would run a data center with only 25-250 hosts.

bentpins · on Aug 19, 2015

https://what-if.xkcd.com/63/

XKCD guessed 15EB in 2013

sbecrab4 · on Aug 19, 2015

Assuming the wild 1PB guess above... that could be anything from 100mb of a single volume lost, all the way up to 100 million volumes each losing one byte of data. There isn't enough information to determine the number of volumes affected. 0.000001% of disk space isn't very informative. Publishing the percentage and/or number of persistent volumes affected would give us an idea of the scale of the problem.

_jgvg · on Aug 19, 2015

0.000001% of Persistent Disk space, which is probably a small fraction of the total datacenter storage.

circa · on Aug 19, 2015

"Lightning crashes, an old server dies...."

jmartinpetersen · on Aug 19, 2015

Oh, I see you're Selling the Drama ...

okadaka · on Aug 20, 2015

So Google said: "...although... the storage systems are designed with battery backup, some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain. In almost all cases the data was successfully committed to stable storage, although manual intervention was required in order to restore the systems to their normal serving state. However, in a very few cases, recent writes were unrecoverable, leading to permanent data loss on the Persistent Disk."

I thought battery is supposed to cover writing the entire write buffer cache to disk in case of power loss. Sounds like they had some badly designed gear which did no account for partial battery charge which should downsize the cache to battery's capacity.

calyhre · on Aug 19, 2015

Glad to be part of the 0.000001%. We had a tough night because of this outage :(

iradik · on Aug 19, 2015

At Amazon, when resolving an issue in our internal ticketing system I recall there being "Act of God" as a reason code. Seems applicable here.

JupiterMoon · on Aug 20, 2015

I think this is a technical term in the insurance world.

rcthompson · on Aug 19, 2015

It sounds like they mostly lost recently created/stored data that hadn't yet been fully replicated to the required degree of redundancy.

tedchs · on Aug 19, 2015

Key point from Google's status incident page that clarifies some of the partially-incorrect statements in the press and comments about this:

"In a very small fraction of cases (less than 0.000001% of PD space in europe-west1-b), there was permanent data loss."

https://status.cloud.google.com/incident/compute/15056#57195...

sbecrab4 · on Aug 20, 2015

Quoting the data lost as a percentage of disk space is both accurate and misleading. It makes the impact sound tiny because only recent writes were affected. Obviously writes that were in flight at the time of the incident are going to be a tiny percentage of overall storage. What they don't tell us is what percentage of persistent disks which were in use at the time were affected. That percentage is likely far higher. If only 0.000001% of volumes in use were affected it would never have made the news.

pliu · on Aug 19, 2015

"The Google Computer Engine (GCE) service allows..."

Did anyone else cringe?

djhworld · on Aug 19, 2015

Do they know what customers were affected?

I use google drive a lot, I don't track what's in my drive, should I be worried?

bskap · on Aug 19, 2015

I think it only impacted people who were using Compute Engine and storing data on a hard drive instead of using Cloud Storage. I suspect that Drive data has redundant copies stored in multiple data centers in addition to frequent backups.

marcelocamanho · on Aug 20, 2015

Zeus killing some porn

ai_ja_nai · on Aug 19, 2015

So Google has no redundancy at datacenter level?

wmf · on Aug 19, 2015

Not for block storage (aka persistent disks): "we would like to take this opportunity to highlight an important reminder for our customers: GCE instances and Persistent Disks within a zone exist in a single Google datacenter and are therefore unavoidably vulnerable to datacenter-scale disasters. Customers who need maximum availability should be prepared to switch their operations to another GCE zone. For maximum durability we recommend GCE snapshots and Google Cloud Storage as resilient, geographically replicated repositories for your data."

Achieving RPO=0 generally requires synchronous replication to a different datacenter which adds significant latency.

pacala · on Aug 19, 2015

Amazon builds datacenter redundancy in the same geographic locale. You can then setup synchronous replication between datacenters without atrocious latency, as all the disks are fairly close by, albeit powered [in an emergency] by independent generators.

OTOH, Google does datacenter redundancy across different locales, which make synchronous replication perform much worse, like you noted.

wmf · on Aug 20, 2015

GCE appears to have a similar region/zone structure as Amazon, but neither provider replicates block storage across zones.

pacala · on Aug 20, 2015

I stand corrected. Looking at https://cloud.google.com/compute/docs/zones?hl=en, it seems that GCE has multiple zones in the same locale (region). I might have confused with Google's internal setup, which during my tenure there, was heavy on using replication across locales, up to systems like Megastore and the 200ms synchronously replicated commit.

ishanr · on Aug 19, 2015

Thats why you should always write it down folks...

vanderZwan · on Aug 19, 2015

Wouldn't successive lightning strikes be more likely due to ionisation of the air?

xvedejas · on Aug 19, 2015

Well, there's the opposite argument:

"Wouldn't successive lightning strikes be less likely due to reduced local electric potential?"

It's not readily apparent to me which of the effects will be larger.

rtkwe · on Aug 19, 2015

The volume of air ionized by a lightening strike is small and dissipates quickly because it's so hot.

Florin_Andrei · on Aug 19, 2015

Assuming they don't deplete the charge too much, maybe.

ck2 · on Aug 19, 2015

I sense a future Mr. Robot plot idea.

necessity · on Aug 19, 2015

And magnets! Everyone loves magnets.

toocute2care · on Aug 19, 2015

The wrath of the Gods.

apparitions · on Aug 19, 2015

Wow. NSA attacks are really leveling up.

jneander · on Aug 19, 2015

This could potentially explain a lingering error on my Google Drive. Might there be movement within Google to contact the owners of data that was lost?

Filligree · on Aug 19, 2015

The article mentions Compute Engine, which is an external offering that isn't really used internally. It's hard to say what else might have been affected, if anything, but going by the article this shouldn't be causing your problem.

jneander · on Aug 19, 2015

Fair enough. One can hope.

swah · on Aug 19, 2015

File syncing is a hard problem..

JupiterMoon · on Aug 19, 2015

Only because we make it hard.

AnimalMuppet · on Aug 19, 2015

If you've got a way to make it easy, we'd like to hear it.

JupiterMoon · on Aug 19, 2015

Well image if file systems, file formats and version control systems were built from the ground up together. Currently file syncing is harder because we expect our file server to handle an infinite variety of legacy file formats and the file sever cannot try to know anything about the document structure (whereas a version control system can examine the contents of plain text files).

Surely this is why Google docs can manage collaborative working more easily -- because they can control of the whole stack.

Better than the OP would be file syncing of arbitrary files between arbitrary systems is hard.

kuschku · on Aug 19, 2015

Similar to git, but if you have multiple branches of the same file, display them visually and allow the user to merge just that file?

xeromal · on Aug 19, 2015

If people have to do anything it's not good enough.

kuschku · on Aug 19, 2015

That is then mathematically impossible.

What if I edit the same pixel, in the same image, in the same nanosecond on two different devices, and then sync?

There are situations where conflicts are necessary. The focus should be on making conflict resolution easily accessible, not on trying to be smart and overriding files at random

JupiterMoon · on Aug 21, 2015

Actually quite a lot of serious by non-power-users do want this. You only have to look at a random persons laptop and see and the file names with "final version.docx", "final version 2.docx" "final final version.docx" etc to see this. They certainly don't want a console based solution but a cross between git and word style track changes would make lots of people pretty happy.

jankeymeulen · on Aug 19, 2015

Can't say for sure but I'd doubt it.

(Disclaimer: I work there but have absolutely no knowledge about how Drive stores data)

nkrisc · on Aug 19, 2015

The odds of it being your data is probably the same as you being struck by lightning four times...

gdulli · on Aug 19, 2015

I've lost attachments to old gmail messages before, I never thought it was impossible or unlikely for Google to lose data.

I'm sure the data wasn't truly lost. If I'd called them up and they'd made it their priority to find my old files, they could have done so, having so many redundant backups. But of course no one at Google is taking calls like that or acting on individual requests. The data was effectively lost, not technically lost. But I'm sure it's uncommon.

mathattack · on Aug 19, 2015

Beyond security, this highlights one of the main issues with the cloud. Was there no backup?

Of course once you get beyond the headline, I think most people are much worse with protecting themselves from rare outages than Google.

bsurmanski · on Aug 19, 2015

As mentioned in the article, it only affected recently written data of 'Google Compute Engine' services. GCE allows user to launch VMs and generate arbitrary data on the server.

Normally, Google redundantly distributes out data to at least 3 different geographically distinct locations. Check out the 'BigTable' white paper [0] for more info.

For 99% of cases (and pretty well all user cases), this would not cause data lose. The key here is that the data was generated on the servers and did not have a chance to duplicate before the event.

[0] http://research.google.com/archive/bigtable-osdi06.pdf

axiak · on Aug 19, 2015

Nowadays they use Reed–Solomon coding to effectively distribute their data without copying it to 3 places.

louis-paul · on Aug 19, 2015

Do you have a source for that? Why would they start using it now?

azurezyq · on Aug 19, 2015

Disclaimer: googler, but not working on storage.

link here: http://static.googleusercontent.com/external_content/untrust...

actually in colossus one can tune RS coding parameters per file, to get a tradeoff between performance/durablity.

RS coding uses less copies, but same level of safety (tradeoff is the recovery computation time.)

magicalist · on Aug 19, 2015

Also how would that help with geographic redundancy? Local recovery from errors does you no good if your datacenter gets wiped out in a flood.

axiak · on Aug 19, 2015

I can't find the presentation anywhere, but at HBaseCon 2014 one of the lead developers of Bigtable stated that they went to RS. Even for their older databases.

EDIT: In this video https://vimeo.com/100153741, around the 23 minute mark

ismavis · on Aug 19, 2015

You find more details in the Google incident report: https://status.cloud.google.com/incident/compute/15056#57195...

gtaylor · on Aug 19, 2015

> Beyond security, this highlights one of the main issues with the cloud. Was there no backup?

Unless I am specifically paying for backup service, I wouldn't expect/want them to do that for me. And even if I was, I'd still have offsite backups if the system was important.

If you aren't being responsible with your backups in a noisy environment like Google Cloud/AWS, understand that you are vulnerable to freak accidents like this. Google/AWS's job in all of this is to try to reduce the frequency of issues and to minimize the impact.

MichaelGG · on Aug 19, 2015

If you were running the data center instead of Google, it'd suddenly stop being cloud. How would that change lightning and power loss?

Google Compute Engine offers customers the option to make snapshots for backup, or use a true "cloud" storage engine. If anyone lost data here it was customers explicitly not doing backups and only using a single zone. I don't know why anyone would expect different. GCE easily allows you to network machines in multiple data centers, but close geographically. So you'd only need to handle region-wide disasters.

mankyd · on Aug 19, 2015

The article states that it was "some recently written data [...] located on storage systems which were more susceptible to power failure". Perhaps it hadn't made it to a backup.

sowbug · on Aug 19, 2015

This was closer to a colocated server than a cloud service. GCE gives you a VM that you install your favorite OS on. Google keeps the VM running, but the rest (such as whether the server backs itself up) is up to you.