Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Replacing a silently failing disk in a ZFS pool (imil.net)
125 points by rodrigo975 on July 3, 2019 | hide | past | favorite | 66 comments


> What? there’s a shitton of docs on this topic! Are you stupid?

I've been using one form of Unix or another since 1994. I've been employed as a sys admin over many of those years. I think the git CLI is just dandy. I love iptables. lsof? No problem. No CLI has made me feel more dumb than ZFS's. I have more "how to" notes for it than any other Unix CLI that I use. So no, you aren't stupid. (The Linux "ip" CLI isn't my favorite either – when you need a rosetta stone for your CLI it's not a great sign https://access.redhat.com/sites/default/files/attachments/rh...)


Before I put my first zfs system in production, I prepared for the following scenarios, using the actual hardware: Corrupt bits on three disks. Cut power to disk Pulled disk out while running. Decided I hated a healthy disk and replaced it with a different disk. Pulled a disk from the zfs and the root mirror and installed a new root mirror without backups. So I now know that it has the ability to recover from most things I could come up with. If I had only kept notes on how to do it, but I deferred that with "I'll do the google and maybe setup a play-system in the same state to test stuff before doing it on the real system".. not a good admin this one (me, myself, not the author, kudos for them to write about their experience and make us all smarter)..


I think every single person in charge of a production storage system has pulled their hair out for a while dry running these things. If only we built playbooks for our software that showed exactly what to do in these simple failure modes.

This post is a very strange failure mode so it'd probably not be in that guide.


This actually isn't all that strange a failure mode. We have several large ZFS arrays in service and replace 1-2 failed disks every month. About 90% of the time the first warning you get is exactly this - a message from the CAM controller saying it failed a read in the syslog. ZFS nor SMART often notice these until they get pretty bad/frequent. By the time they're bad enough for other software to notice, your pool is performing pretty poorly.

We deal with this by watching for these errors, printing to a log specifically for Icinga to watch for and alert on, and preemptively replace the disks. It would be nice if the other software (ZFS, SMART) would notice these in time to not become severe.


yeh, it's somewhat strange, then again, disks trying their best, and drivers trying their best, to serve, probably result in many silent retries that just appear as slight performance jitter.. It's not like the FS would know the first time a disk re-reads a sector, it might not even be relocated the first time it happens.. Maybe we need some mechanism for reporting "yeah, we got your bits but it wasn't as smooth sailing as usual" to filesystems such as zfs.


As marcosdumay says, smart should handle this, but you do have to look.

Iffy sectors get marked as Pending, Reallocated sectors are also marked. There's some smart stats for seek times too, I believe (but those weren't as clearly associated with failure as sector counts).

There is a danger that manufacturers will avoid acknowledging problems in smart, but so far, I haven't heard about that happening too often. There are certainly some drives that are dead or dying with fine smart values, but it seems rare, and there's often a not obvious dependency where a drive that is very unhealthy may not be able to record new smart values.


In my experience, SMART doesn't detect anything of value. 95% of my disk failures are:

- Seek time gets very high (>80%). - Drive drops dead and won't respond

SMART doesn't detect either of these.

Setting an alert on seek times > 30ms has been by far the best predictor of drive failure. If the seek times go over ~400ms then other components start complaining about slow/missing I/O, but by that point you've already gone crazy because your array is unusably slow.

If a drive develops bad sectors SMART will report it, but that's rare enough that I don't bother watching for it. It's a non-event with ZFS because of the checksumming. These drives never have one random error; they start raising hundreds in the space of minutes, and that's the moment you replace the drive.


Smart shouldn't be used for this, the sata transport layer should be updated to allow indicating it, syscalls should pass this information along, so filesystems can react to it.. otherwise we'd need a new smart field, as the semantics of the existing fields are insufficiently described, and reacting to them would require adding a smart-query to every filesystem call that touches disk.

thing is, "iffy" depends on what firmware decides is iffy.. Maybe a firmware relocates a sector the very first time it fails to read, maybe it relocates after try 200. What we want is to be notified, when we request a sector, if reading it was 100% painless. Depending on your demands for reliability, maybe it's okay to re-read a sector a few times, and still not reallocate it, but maybe your filesystem would like to gather these stats from the disks anyway.


> It’s not like the FS would know the first time a disk re-reads a sector

I can understand why the default mode of operation for disks involves the disk presenting a clean “you get a block or you don’t” interface to the OS over e.g. SATA; but I’m surprised that it’s what we’re stuck with 100% of the time. We have ECC memory (where the memory controller reports errors to the OS), but no ECC disks.

Maybe this is what Apple is trying to do with putting their T2 chip in charge of being a storage controller. That way, it doesn’t just get to do encryption things; it can also make highly abstract policy-level decisions on what to do about these sorts of error events. Too bad such an approach is only really tenable if you’re building your own storage system out of your own raw NAND; it’d be nice to be able to have an external low-level storage controller built into e.g. the RAID card of a RAID array of regular HDDs, that those HDDs would then slave themselves to, and which ran OS-uploadable firmware.


Isn't that what smart was supposed to do?


This reminds me of an episode (the pilot) of House, MD. The full script is here [1]. The excerpt:

> In a typical case, if you don't cook pork well enough, you digest live tapeworm larvae. They've got these little hooks, they grab onto your bowel, they live, they grow up, they reproduce. Reproduce? There's only one lesion, and it's nowhere near her bowel. That's because this is not a typical case.

New professionals always expect things to be similar to what they studied. What they learn: the books and lectures often covered the "most common" situations, but all those edge cases and weird scenarios come up on a daily basis.

[1]: https://www.springfieldspringfield.co.uk/view_episode_script...


On this topic (and related in many ways) is Bryan Cantrill's talk Zebras All the way Down.[1]

The gist is that in medicine there's the concept of diseases that are zebras and horses, rare and common, and the saying goes "when you hear hoof beats think horses not Zebras". Unlike medicine, in software most the easy problems have already been abstracted away or made trivial through our software stacks, so often what's left are Zebras. What used to be the exceedingly rare bugs now account for a large proportion of the odd behavior we see in very robust tech stacks, because the easy bugs have been hunted down and eliminated already.

1: https://www.youtube.com/watch?v=fE2KDzZaxvE


“Every method you use to prevent or find bugs leaves a residue of subtler bugs against which those methods are ineffectual.”

Boris Beizer, Software Testing Techniques. Second edition. 1990


Extremely topical given the outages we're seeing lately.


In the early days I used one monolithic OS for everything, but I have since virtualized everything with VMWARE and no longer have to worry about my OS failing and then having to check my phone for solutions on how to fix my broken OS.

One obvious caveat to virtualisation is the overhead, but with time, you can learn how to use virtualisation with ease and re-provision an entire new OS (and virtual disk) if needed. In terms of how this relates to disk health: well the answer is you now have multiple healthy disks, and only rarely have to then deal with failing disks, which due to compartmentalization, the risk of a failing disk ruining your day is minimized and hopefully the failing disks have nothing too crucial on them. Throw in some cloud storage solutions for backups and this makes things even more bareable.


I have pretty much embraced containerization... as long as the source is out there, and data is well backed up, I'm relatively safe. To me, if it isn't a scripted deployment, it's not really exercised.

Aside: One thing I've done a few times is setup a queue and service to replicate data in an rdbms to elasticsearch or mongo as completely denormalized records for searching against. When I did that, it became very easy to store a .json.gz of these same records in s3/blob storage as a secondary backup.


Big Kudos to ZFS (on Linux), it's just amazing how sane and stable it is. It saved my ass multiple times in the past years. I've also just finished upgrading a RAIDZ1 vdev by replacing one disk at a time with bigger ones. Resilvering took 15h for each disk and there was some trouble with a (already replaced) disk failing during that. Panic mode set in, but ZFS provided the means to fix it quite easily - all good. Best decision ever to pick ZFS.


you can replace multiple disks at a time, assuming you have enough drive slots.


And assuming the array has sufficient redundancy to allow multiple disks to be offline, which a RAIDZ1 vdev doesn't.


If you have space you can “zfs attach” the new drives to make mirror vdevs with the disks they’re replacing, and then “zfs detach” the old drives to break the mirror when done.

This is preferable because you’re never exposed to any additional risk of data loss and can replace more disks at once than original pool redundancy would allow.


Not on a raidz vdev you can't.

You can't make mirrors out of anything other than existing mirrors or single disk vdevs.

(You _can_ run zpool replace on more disks than the pool redundancy has, assuming you don't need to disconnect the old disks to put the new ones in...running zpool replace on two disks in a four-disk raidz1 is perfectly legal, as long as the old disks are still there.)


zpool replace poolname olddisk newdisk

Keeps the olddisk online and utilizes it until newdisk has resilvered, then detaches it. It seems equivalent in safety to

zpool attach poolname olddisk newdisk Wait for resilver... zpool detach poolname olddisk


zfs replace does not require you to offline the drives. TFA only suggests it as a workaround for the reslivering being slowed down by a faulty drive.


yeah, unfort. I did not :(


Replacing one disk at a time in a ZFS pool that's almost full is a bad idea, because all new writes will go to the sole new disk.


I don't think that's correct for the case that dvdgsng is describing. IIRC, the RAIDZ vdev won't allow the use of any extra space until all of its drives have been replaced with bigger ones.

What you're describing occurs when growing a ZFS pool by adding a new vdev, i.e. extra disks that sit alongside the existing ones.

(Edit: Didn't realize that mbreese had already addressed this point in a different reply.)


True, I didn't pay attention that OP was growing the vdev and not the pool.


But that's the only way to make an existing pool bigger, isn't it? Replace each disk, one at a time, and then auto-expand the pool at the end. The usable space doesn't change until the pool expands at the end.

I think you're thinking about adding a new vdev to an existing pool to make the pool larger. This is almost always a bad idea, because like you said, the writes all end up on the new vdev.


If you can have downtime: you could create a new pool, turn off writes to the old pool, snapshot the entire pool into the new pool using the built in snapshot & transfer mechanics, and then kill the old pool.

If you're ok with write performance degradation & doing multiple resilvers than what the op did is fine too.


That assumes you have enough extra spaces to do it and you didn't start full. Ex: just bought a used Dell R710 with 6 bays, it came with 2TB drives... plan is to play around a bit with those... but I don't have an extra, supported controller with enough slots (even if the cover is off) to populate it while I would migrate. In the end, it'll be full of shucked 8tb drives, but pretty sure I'll just start over at that point. My old nas' drives are 6yo in a decade old synology box, and my trust level is failing before any drives do.


Something that's important to consider is if your drives can handle a resilver. Each resilver is essentially reading the entire disk (with lots of asterisks) while setting up a parallel zfs system to provision your disks and using the transfer I described will only read your systems fs once.

You don't even need to use the same system. You can setups a different system, run the transfer, turn off the main system, swap the disks.

It's a hastle but it's an option at your disposal.


If I'm using Raid-Z2, isn't there a way to drop the old drive and add a new drive to take its' place?


Interestingly enough I had a reverse scenario a few weeks ago: ZFS alerted on rw errors on a drive, smartctl and the kernel was perfectly fine with it. The drive was in fact failing, the physical noises (!) made it clear, but other than ZFS, nothing was reporting it.


Hardware failing is a noisier (literally in your case) version of silent data corruption, which is one of the core original use cases and justifications for ZFS:

> BILL MOORE We had several design goals, which we’ll break down by category. The first one that we focused on quite heavily is data integrity. If you look at the trend of storage devices over the past decade, you’ll see that while disk capacities have been doubling every 12 to 18 months, one thing that’s remaining relatively constant is the bit-error rate on the disk drives, which is about one uncorrectable error every 10 to 20 terabytes. The other interesting thing to note is that at least in a server environment, the number of disk drives per deployment is increasing, so the amount of data people have is actually growing at a super-exponential rate. That means with the bit-error rate being relatively constant, you have essentially an ever-decreasing amount of time until you notice some form of uncorrectable data error. That’s not really cool because before, say, about 20 terabytes or so, you would see either a silent or a noisy data error.

> JEFF BONWICK In retrospect, it isn’t surprising either because the error rates we’re observing are in fact in line with the error rates the drive manufacturers advertise. So it’s not like the drives are performing out of spec or that people have got a bad batch of hardware. This is just the nature of the beast at this point in time.

> BM So, one of the design principles we set for ZFS was: never, ever trust the underlying hardware.

> […]

> Greenplum has created a data-warehousing appliance consisting of a rack of 10 Thumpers (SunFire x4500s). They can scan data at a rate of one terabyte per minute. That’s a whole different deal. Now if you’re getting an uncorrectable error occurring once every 10 to 20 terabytes, that’s once every 10 to 20 minutes—which is pretty bad, actually.


> But… at less than 40K/s! Turns out that very logically the failing disk and its timeouts was slowing down the silvering, so I learned that to avoid this kind of situation, you should offline the failing disk from the zpool:

Yeah I ran into this too. I wish raidz would route around the slow drive by re-balanceing reads to the more performant drives based on IO queue depth (reconstructing data from parity if needed). They do it for mirrors, but not raidz.


I just spent about a week trying to get a raidz pool going for a home server under Ubuntu. It was about one of the most frustrating experiences in recent memory. My goal was to run the root system off an SSD with the heavily used folders offloaded onto a ZFS raidz pool.

I followed the Ubuntu Root on ZFS guide with changes I thought would be appropriate.

I gave up eventually and came to a disgruntled conclusion is that it works for beginner users tho follow guides to the T and advanced users that know the changes that need to be made from institutional knowledge, but not for people in between.


That depends on your OS. Ubuntu is, frankly, very bad at it -- though few other distributions are very much better.

I'd recommend NixOS. ZFS is a proper first-class filesystem there, and you can use it almost however you like.


Thanks... have a used server I'm going to play with in a couple weekends... won't have the final drives for the new shared drive setup for a couple months (moving away from an old nas box) and going to try a few scenarios with an nvme as the boot drive via adapter, and most storage/backup to the RAID array.

Was planning on trying both unraid and windows server as host OSes, and maybe even using a NAS distro in a VM managing the individual drives, etc. Would prefer something that supported docker, full vms and a friendly NAS UI, but not sure such a beast exists.


FreeNAS is linux distro with a friendly NAS UI: https://freenas.org/ - but still linux, so you get all the usual linux stuff - docker, full VMs, etc...


FreeNAS is based on FreeBSD, not the Linux kernel. It's from a completely different lineage of operating systems.


Ooops, thanks for the correction!


As mentioned in another comment, FreeNAS is based on FreeBSD and they dropped Linux/Docker compatibility, and full VMs are less similar to other platforms. If I did FreeNAS, it would be in a VM with the drives mapped to the VM.


Ooops, I thought it was Linux, sorry!


Great, thank you. I will look into it.


What exactly did you try? On the surface that sounds pretty straightforward.

The problem I most often see people run into is trying to use mismatched drives they have lying around. With ZFS you’ll have a much better time just purchasing matched disks to fit the zpools you have in mind. It was built for well planned, huge arrays in datacenters and that legacy shows.

I’ve used ZFS for years in a variety of configurations on Solaris, FreeBSD and macOS. My only training was the Solaris and FreeBSD man pages, so it’s not that dark an art.


I have a 4-bay QNAP x64 appliance that I wanted to put a full Linux distro. I wanted to have a matched 3-drive raidz pool for /var, /usr/local and /home and have the rest of the system live off an SSD.

My goal was to have a media server for the home with DVR capabilities to use a Hauppage USB tuner to use within Plex. QNAP pulled DVB support 2-3 years ago, unfortunately so a native solution wasn't working. I considered FreeNAS but LCDproc is not available which is a must have for me.

I tried several ways of going about this: installing the system to the SSD and copying over those specific directories to the raidz pool with appropriate changes to mounts. I tried doing the debootstrap method, and chrooted to create the raidz within the new install.

However, either I ran into issues in either configuring grub, or the remapped directories not mounting at boot. I am sure there was a simple element that I was missing, but within the time constraints of personal life and a skillset not to par with an IT admin, I gave up and relegated my self to the QNAP environment, without the DVR function I so badly wanted.


So, what does one do if all they have laying around are mismatched drives?


You _can_ mix drives, but you'll get the worst from each. For each vdev (mirror pair, or raid stripe) writes will be limited by the slowest drive, and space will be limited by the smallest drive.

So if they're not highly mismatched, it can work ok. For example if you have a 1 TB and a 1.5 TB disk, but from the same series and as such with otherwise similar spec, it would work fine as a mirror vdev. Write speed will be limited to the slowest of them (most likely the 1 TB), and you'd only get 1 TB of available space, but it's more than nothing.


Ah, ok. I have 2 3TB and a 4TB running on a Proxmox v3 system. Was wanting to upgrade to Proxmox v5 and try ZFS.

The 4TB is a different make than the 2 3TB drives, though.


I've been running a mixed-make mirror array for half a year now. Like I said, speed is limited by slowest drive but apart from that it works just fine.


Thank you. I appreciate the info.


Note it's probably not a good idea (if possible) to mix 512 and 4096 byte sector drives in the same vdev.


If you create the pool with ashift=12, you're fine.

The only hazard is if you have an ashift=9 pool (the default, IIRC) and later add a 4096 byte sector drive, you might lose some write perf. It probably doesn't matter for homelab use.

There aren't many 512 byte drives any more. You'll be fine.


You can partition your drives based on the size of the smallest and make your pool on those partitions.


Yes. I tried this maybe a year or so back, and came to much the same conclusion, and I say this as someone who's been running ZFS systems for over a decade now. When it comes to actually booting from them, the potential for "fit and finish" issues with the distro (and grub...) is very significant.

Proxmox works really well for me, you can basically treat it as a non-desktop Debian distro, with ZFS built-in, and a largish community of people who use it that way.


Note that his zpool is in "raidz" mode not "mirror" mode. The latter is going to be far better and faster at handling replacement of dead/dying drives because you can pretty much just remove the old one, plug the new one in and then update the zpool (in that order). It's also faster in normal use due to lack of performance issues related to calculating and storing of parity bits in raidz.


I've got a RAID-Z2 array that is being kind of weird, and I'm not entirely sure where the problem is, but it has never lost data. At first I thought it was a marginal disk. Now I'm thinking it might be the controller.

It's amazing how easy it is to work with ZFS though. I was able to take each drive out of the pool, run badblocks on it (read/write to exercise it and look for errors), and then add it back in once it passed the burn in test. Easy peasy. Even with each drive being a dm-crypt.

I've been using ZFS for a very long time, and never had data loss on it. Even back in the early days of ZFS+FUSE under Linux. At one time I had 5 big systems running ZFS for backups of ~150 machines.


Has anyone investigated Stratis? I only recently heard about it, and it sounds like it has a ways to go to match feature parity, but it is targeting ZFS and Btrfs, seems like it has made a lot of progress in the last 2 years, and has Redhat behind it (?). It's not clear to me how closely it will match ZFS features, but it seems like it might be on a path to maturity before Btrfs.

https://stratis-storage.github.io/

I really wish we had HAMMER on Linux.


Its nowhere near ZFS/HAMMER/BTRFS ...

Check their FAQ here: https://stratis-storage.github.io/faq/

In short:

In terms of its design, Stratis is very different from ZFS/BTRFS, since they are both in-kernel filesystems. Stratis is a userspace daemon that configures and monitors existing components from Linux’s device-mapper subsystem, as well as the XFS filesystem.

Red Hat move seems very strange here because it would be better just to hire several BTRFS developers and join the BTRFS Linux ecosystem instead of writing something new ... I still do not know what for ...


> it would be better just to hire several BTRFS developers and join the BTRFS Linux ecosystem instead of writing something new

I found this previous discussion rather enlightening on the subject:

https://news.ycombinator.com/item?id=14907771


They have a bunch of device-mapper, LVM, and XFS developers on staff though. So I don't think it's strange they're leveraging the people they already have. In a sense they are hedging their bets. From their perspective, if Btrfs pans out by community effort, then Red Hat can change their minds yet again and support it. And if it Stratis pans out better, in particular as it relates to support contracts, then at least they got a product they can ship and support.

The open stratis question for me is, with so many layers it depends on, how are bugs going to get handled? And how fail safe will it really be when there are problems? And what do repairs entail, and how long do they take on large volumes? What sorts of goofy unexpected edge cases will users run into, and how long will it take to find workarounds and permanent fixes? Is it really offering both the general Linux community, including even Fedora in particular, something they'll really want? And I don't know the answer to any of that. I have no idea if it'll see widespread adoption like LVM has.


If you already have a spare disk installed doesn't it make more sense to just add it to pool anyway and just bump up the redundancy to Z2 or Z3?


That is one of those Great Debates.

If your spare disk is a "hot spare" where it is plugged in and powered up, yes.

If your spare disk philosophy is "cold spare" where the drive is in a box on a shelf, probably not.

Hot spares are faster and easier to turn into replacement drives but the "hot spare" is subject to usage-based (electrical, thermal, and mechanical) failure modes. For a given drive, the probability of failure is largely proportional to power-on hours[1].

Cold spares are not accumulating power-on hours so they likely won't fail while waiting to be used and presumably won't fail for quite some time after being installed as a replacement.

[1] Intuitively, but hard to prove because of many confounding factors.

Backblaze has the best drive failure rate data that I'm aware of: https://www.backblaze.com/b2/hard-drive-test-data.html


Yes, assuming: (1) You haven't already created the pool, as you can't alter an existing vdev from (e.g.) RAIDZ to RAIDZ2, and (2) your pool only contains a single vdev, or you're happy with an extra disk per vdev.

Otherwise you're into dealing with spares, as gvb describes.


I have a few zfs pools mostly for temp space for tape backup. I tried using an old attached dell storage for ZOL with luns, but ZOL had locking issues. But pretty much we replace all cheap solaris zfs storage for a mix of flash/consumer drive nimble.


    Jul  2 12:51:02 <kern.crit> newcoruscant kernel: ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Looks like the FreeBSD kernel has a similar configuration as Linux, which is a command timer of ~30 seconds. If the drive hangs longer, then the whole link is reset. And the reason why the drive is hanging is lost in that reset.

The likely reason why the drive is hanging, if it's a consumer drive, is it has very high bad sector recovery time, and can approach 3 minutes. That's pretty crazy.

Anyway, it's central to any RAID to be able to get a discrete read error from the drive. And that read error will include the LBA for the bad sector. And that information is needed to know what data is affected and where to get a good copy (from mirror or from reconstruction using parity). This is the same on md raid on Linux, ZFS, Btrfs, and even hardware RAID will depend on it. There are too many commands in the queue to just assume one of those commands and therefore one of those requested sectors (which could be thousands) is the bad sector. A discrete read error with LBA is necessary.

And the link reset prevents that.

On Linux, this is set per drive (this is a Linux SCSI command timer, which applies to all PATA and SATA drives too, and it is not a drive setting, it's a kernel setting) timeout:

    $ cat /sys/block/sda/device/timeout 
    30
Ideally the drive supports SCT ERC (that's for SATA, there's a SCSI/SAS equivalent) and you use 'smartctl -l scterc' to change it to a value less than the kernel command timer. That way the drive itself gives up faster, and issues a discrete read error, and now the kernel (and ZFS) can figure out how to repair the problem.

If the drive doesn't support configurable SCT ERC, you have to raise the kernel command timer to an obscene level.

    # echo 180 > /dev/block/sdN/device/timeout
And now the kernel will just wait and wait and wait and wait until finally the drive does give up on recovery and issues a discrete read error, and the kernel and ZFS can fix it.

The late tl;dr is, this is not a ZFS problem. This is the result of long set kernel command timers, and a refusal by kernel developers to update them for the incredibly high (some might say crazy) deep recoveries that consumer drives use. But here's the thing, macOS and Windows know about these high recoveries, and tolerate them without doing link resets. And that's why they eventually recover bad sectors, but the user notices them as performance slow downs.

And what fixes them? A clean install. And that's because a sector write that fails due to a bad sector, causes a remap. Which is the same mechanism that ZFS, Btrfs, md raid, and hardware raid will all depend on. The read error results in getting good data from a copy (mirror or reconstruction from parity), the good copy is both sent to the application layer as well as results in an overwrite command to the sector that had the read error. If that write fails the drive firmware itself remaps that LBA to a spare sector.

Anyway, this is a misconfiguration, the question is who is to blame for it? And even that's complicated because the drive doesn't announce its SCT ERC support or value. You have to poll it. Could the kernel poll for this? I guess? But should it? Probably not its domain? Is it a distribution question rather than an upstream kernel question? Perhaps, yes, the distros should use high command timers by default, and expect sysadmins will reduce them if the use case requires it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: