Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ZFS fans, rejoice – RAIDz expansion will be a thing soon (arstechnica.com)
266 points by rodrigo975 on June 18, 2021 | hide | past | favorite | 193 comments


The article is a great example of all the somewhat surprising peculiarities in ZFS. For example, the conversion will keep the stripe width and block size, meaning your throughput of existing data won't improve. So it's not quite a full re-balance.

Other fun things are the flexible block sizes and their relation to the size you're writing and compression ... Chris Siebenmann has written quite a bit about it (https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSLogicalV...).

One thing I'm particularly interested in is to see if this new patch offers a way to decrease fragmentation on existing and loaded pools (allocation changes if they are too full, and this patch will for the first time allow us to avoid building a completely new pool).

[edit] The PR is here: https://github.com/openzfs/zfs/pull/12225

I also recommend reading the discussions in the ZFS repository - they are quite interesting and reveal a lot of the reasoning behind the filesystem. Recommended even to people who don't write filesystems as a living.


> The article is a great example of all the somewhat surprising peculiarities in ZFS. For example, the conversion will keep the stripe width and block size, meaning your throughput of existing data won't improve. So it's not quite a full re-balance.

This is generally in-line with other ZFS operations. For example, changing compression policies will not rewrite existing data and only new data is affected.

It simplifies some code paths and keeps performance good no matter what. You don't get a surprising reduction on performance.


I'm immensely pleased with the approach they've chosen, because it offers the possibility of improving your stripe size, without requiring commitment to. New blocks are written with the new stripe size, so all you have to do to improve your stripe width is rewrite all of your files (and deal with the snapshot churn that will cause). It also offers the tempting possibility of the reverse, where you can create new files with stripes and then rewrite all your data to reduce capacity.

I used to work at Isilon, and the design of this feature looks very familiar to me: That's a good thing. I'm excited for the possibilities for ZFS going forward. I hope they'll come up with a restriper tool that will allow you to rewrite all of your files with the new block layout without breaking snapshots, etc, but either way, this is a huge advancement for my own personal use of ZFS.


There is also a linear degradation of metadata/blockinfo performance (in the order of the number of rewrites) due to the need to check multiple possible locations for the block pointer. In practice this probably just means more RAM utilization but... eh.

Even as a home user, I've become convinced the juice is not worth the squeeze. For serious use, just buy four drives at a time and upgrade four drives at a time. That gives you 4-drive RAIDZ1 in one VDEV and if you want you can expand to a second VDEV or upgrade the capacity of the drives in your existing VDEV.

If you don't want to do four drives at once, you can still use ZFS with 1 drive = 1 pool (or 2 mirrored drives = 1 pool for redundancy) and just manually distribute files over the volumes. No, it won't automatically do it like Unraid but perhaps you can layer unraid over the top somehow if you really insist (otherwise there's also solutions like git-annex). It still gets you far better data integrity guarantees and far better reliability than any available alternatives.

Maybe it is just stockholm syndrome but do you really, really need to dynamically expand your array by an arbitrary number of disks that badly? There is ultimately no way around the reliability question so you can't expand forever and if you're serious about setting up a nice fileserver that'll be reliable for the long run, putting 4 drives in it isn't that bad. And ZFS is no worse than the alternatives on 1-drive or 2-drive scenarios in terms of reliability/expansion. Meanwhile ZFS is far far better than the alternatives (btrfs) in terms of data integrity and reliability.

There is also the possibility of manually doing some weird shit like putting 2 partitions on each disk so you could move one member of the VDEV over to another disk if you expand, if you start with 2 disks with 2 partitions each then you could grow to 4 disks without issue, which would lower the increment even further.


> Meanwhile ZFS is far far better than the alternatives (btrfs) in terms of data integrity and reliability.

No, it's not. Btrfs is nowadays perfectly reliable as long as you avoid the in development features which are all explicitly marked as in development and will warn you. That's why Facebook uses it in production. It also has some nice advantage for home use. Btrfs pools are a lot simpler to manage and have less quirks than ZFS.

> Maybe it is just stockholm syndrome but do you really, really need to dynamically expand your array by an arbitrary number of disks that badly? There is ultimately no way around the reliability question so you can't expand forever and if you're serious about setting up a nice fileserver that'll be reliable for the long run, putting 4 drives in it isn't that bad.

Btrfs will do that just fine for example. That's really a limitation of ZFS.

I wish ZFS users would just stop FUDing about btrfs. Most of them haven't touched it for a decade and keep paroting the same old things.


I used/use SLES for a long time, and there is a strong recommendation from SUSE itself how you should btrfs with SLES.

-Just use it for the OS, XFS for data

-Just use it in mirror configuration

-Don't touch anything else

-Make Snapshots and delete them is stable

With those points, i too had never a problem.


Novell controls the XFS development and Redhat has bought a competing technology. Obviously they are not going to push Btrfs. Its use amongst large companies speak for itself however.

I see that tge ZFS fanboys are out in force however. I mean, people should be free to use a second class citizen zoth a terrible license and a pool gestion made for masochists if they want.


Why is Novell's statement that BTRFS is only narrowly useful trivially ignored but large companies eg Facebook using it meant to be interpreted as positive proof while handily ignoring that Facebook is using it for ephemeral data.


>ignoring that Facebook is using it for ephemeral data

What really?? That's hilarious!! Probably just because of zstd compression i would think.


>Novell controls the XFS development and Redhat has bought a competing technology.

What? Are you trolling? Redhat is the biggest contributor to XFS

> Silicon Graphics Red Hat

https://en.wikipedia.org/wiki/XFS

>Its use amongst large companies speak for itself however.

Tell me please? Facebook and.....?

>second class citizen zoth a terrible license

RHEL does not support btrfs anymore (since 2017) i wonder why is that...oh right RHEL is not a experimental distribution.

I would say btrfs is not first class at all (ext4 and XFS is), i just wait until Linus and his Fedora create some entertaining emails.


I'm starting to get concerned about the ZFS issue list, there are a ton of gotchas hiding in using OpenZFS that will cause data loss:

* Swap on ZVOL (data loss)

* Hardlocking when removing ZIL (this has caused dataloss for us)


> Hardlocking when removing ZIL

You cannot remove the ZIL, it's an integral part of ZFS. I assume you mean a SLOG[1].

That said, do you have an issue link? I'm curious. I know people have had some issues with importing pools with a bad or missing SLOG device, though most couldn't be reproduced.

[1]: https://www.ixsystems.com/blog/o-slog-not-slog-best-configur...


Yes my mistake it was slog, this is easily reproducible w/ truenas, add cache, manually remove cache from cli, hardlock, there are multiple reports of this in the openzfs issue tracker


Took me a while to get the zil vs slog terminology to stick.

I tried searching the bug tracker but no open issues came up, but I do run TrueNAS so that should be easy enough to test. Cheers!


Can you please link me to the issue with swap on zvol causing data loss.



I prefer just to have mirrors but its cool that it slowly coming, some people seem to really want this feature.

ZFS has been amazing to me, I have zero complaints.

I just wish it wouldn't have taken so long to come to /root on linux. Even still today you have to a lot of work unless you want to use the new support in Ubuntu.

This license snafu is so terrible, open-source licenses excluding each other. Crazy. The world would have been a better place if linux had incorporated ZFS long ago. (And no we don't need yet another legal discussion, my point is just that its sad).


ZFS on root is super easy in NixOS. (And I agree with your licensing point.)


With native encryption and ZFSBoot menu? Because that's where I want to get to.

I still haven't played around with Nix, but I know I should.


I was disappointed by the lack of RAIDZ2 resize when I built my ZFS fileserver, but it turns out that my data growth is slower than the growth in size of HDD's, so I just replace the drives every 4 or 5 years and copy the data over. Now drives are so big that I might just go with mirroring instead of RAIDZ2


I quite like XFS + LVM. LVM now has a high level wrapper for kernel raid and md-integrity.

for previous data I can't bear to go less than raid 6 (equiv). And require ECC ram. I've had several events where after one drive failed, I discovered minor errors on a second drive...

Currently kernel raid can't use raid 6 to decide a majority win if a bit error is discovered. MD-integrity seems to cost a fair bit of performance (relative to zfs). So I like either plain LVM+raid 6 or also add an integrity option if I want to defend against bit rot.

It's simple to operate. Loads of experts available if it breaks. Well tested. easy to expand or even drastically reshape.

It lacks "send". and performance of snapshots could be lower (try thin pools). Can easily add SSD caching, but performance improvement is possibly not as high as with alternatives

works well enough for me...


This will be very useful!

TIL FreeNAS is now TrueNAS.


Actually there's more!

A new version of TrueNAS is in the works, it's called TrueNAS scale and it's going to be Linux-based (no more FreeBSD).

I'm frankly happy, because TrueNAS is great as a NAS operating system but I really wanted to run containers where my storage is and having to run a VM adds a really unnecessary overhead (plus, it's another machine to manage)


TrueNAS supports creation and management of FreeBSD jails through its WebUI.


Ah, so Linux is approaching feature parity with FreeBSD jails?


I'm not sure that talking about feature parity even makes sense.

What do FreeBSD jails do that Linux can't? I know there's no single jail system call that you can call so you need to assemble the container "manually", but that's an implementation detail as far as I am concerned.

On the other hand, true network namespaces (VNET on FreeBSD) have been usable on Linux for longer than on FreeBSD.

It's a bit disingenuous to say that Linux is the one "approaching" feature parity when they're both missing features that the other has.


Ah, that bait got you.

It's a smug comment that claims something but offers no example at all, not even an anecdotal one.

If FreeBSD jails were so much better than Linux namespaces/containers etc we would be all using FreeBSD by now. The license is even more permissive. Containers have been out for ~ten years now and yet... Here we are.


I suspected it was bait, but FreeBSD jails being better than whatever Linux has is such a common meme that I'm asking anyway just in case someone actually knows enough about them to answer.

I don't know enough to say for sure, but my current impression is that Linux namespaces are actually more powerful than jails, but harder to use because the kernel doesn't provide a "simple" interface to create an isolated container, and userspace software implementing containers get to use the individual bits as they see fit.


And if ZFS was so much better than Linux ext3 we would be all using Solaris, right? ;-)

The market doesn’t work like that.


Just upgraded my home NAS, had to swap all 8 drives, took 7 days... Not to mention it doubled the size of the array, I would have been much happier with an incremental increase.


I did that once, and the experience was a big part of why I use unraid now.


With RAID10, one could swap out 2 drives to get a size increase.

With 2 4 disk vdevs, one could swap out 4 drives for a size increase.

So I'm assuming you have a single 8 disk vdev, and no spare places to put disks.


I have two boxes that are mainly used as NASes, and I just wait to upgrade until I can fit the contents of both onto one new array. So I bring one down and set it up with the new array, copy the old one still up onto the new array, then swap the now copied array with the array I removed to set up the new array. It's a juggle, basically, but since I use mdadm everything sets itself up pretty painlessly.


I look forward to a day, if it will ever happen, that I can have a ZFS RAIDZ2 with arbitrary number of disks of arbitrary sizes.


this might sound like a troll comment but its coming from someone with almost zero experience with raid. What is the purpose of ZFS in 2021 if we have hardware RAID and linux software RAID? BTRFS does RAID too. Why would people choose ZFS in 2021 if both Oracle and Open Source users have 2 competing ZFS? are they interoperable?


ZFS RAID is the best RAID implementation in many respects. Hardware RAID is bad at actually fixing errors on disk (as opposed to just transparently correcting) and surfacing errors to the user.

BTRFS is frequently not considered stable enough for production usage.

ZFS has dozens of useful features besides RAID. Transparent compression, instant atomic snapshots, incremental snapshot sync, instant cloning of file systems, etc etc.

Yes, different ZFS implementations are mostly compatible in my experience, and they should become totally compatible as everyone moves to OpenZFS. FreeBSD 13 and Linux currently have ZFS feature parity I believe.


>BTRFS is frequently not considered stable enough for production usage.

I'm not sure this is still true, especially after Facebook deployed it on millions of servers.

https://btrfs.wiki.kernel.org/index.php/Production_Users

https://btrfs.wiki.kernel.org/index.php/Production_Users


Do they really care if any one of those servers loses its data though? At that scale, you need higher level redundancy so it makes sense to prioritize filesystems for other features than how well they actually retain your data.

For example, detecting corruption would be more important than never having any, so that you can get IO errors and immediately kill the server before it serves bad data.


Specifically BTRFS's RAID5/RAID6 implementation is unstable and susceptible to data loss. Facebook doesn't use those RAID levels in production.

https://www.phoronix.com/scan.php?page=news_item&px=Btrfs-Wa...

https://btrfs.wiki.kernel.org/index.php/Gotchas#Parity_RAID


So what you’re saying is, btrfs works if you can afford an ops/sre team on par with Facebook’s?


> BTRFS is frequently not considered stable enough for production usage.

synology NAS use btrfs by default. What is not stable?


Btrfs in single drive (and mirror/RAID1) modes are fine. The instability is in the RAID5/6 implementation[1]. Synology runs Btrfs (in single drive mode) on top of their own mdadm based RAID setup.

[1]: https://btrfs.wiki.kernel.org/index.php/RAID56


> What is the purpose of ZFS in 2021 if we have hardware RAID and linux software RAID?

Others have touched on the main points, I just wanted to stress that an important distinction between ZFS and hardware RAID and linux software RAID (by which I assume you mean MD) is that the latter two present themselves as block devices. One has to put a file system on top to make use of them.

In contrast, ZFS does away with this traditional split, and provides a filesystem as well as support for a virtual block device. By unifying the full stack from the filesystem down to the actual devices, it can be smarter and more resilient.

The first few minutes of this[1] presentation does a good job of explaining why ZFS was built this way and how it improves on the traditional RAID solutions.

[1]: https://www.youtube.com/watch?v=MsY-BafQgj4


I don't want to be trolling either, but a simple Google search gives you really detailed answers. Or just look at Wikipedia: https://en.wikipedia.org/wiki/ZFS

Some highlights: hierarchical checksumming, CoW snapshots, deduplication, more efficient rebuilds, extremely configurable, tiered storage, various caching strategies, etc.


I can't speak with much experience, but what I have gleamed is.

- You generally want to avoid hardware raid, if the card dies you'll likely need to source a compatible replacement vs. grabbing another SATA/was expander and reconstructing the array.

- zfs handles the stack all the way from drives to filesystem, allowing them to work together (i.e filesystem usage info can better dictate what gets moved around tiered storage, or better raid recovery.


My understanding is that hardware RAID is mainly a thing in the Windows world, because apparently its software RAID implementation is garbage


I'm not a big Windows Server guy, but as I understand it, in Windows you would use Storage Spaces instead of traditional RAID.


Storage Spaces is their branded RAID solution.

Apparently in the enterprise server versions it's fairly decent, but the desktop versions are pretty trash re: performance.


The last time I had to use HW raid it was horrible. The software for managing the RAID array was a poorly documented, difficult to use proprietary blob. I used it for years and the experience never improved. And this is a thing where if you make a mistake you can destroy the very data that you've gone to such lengths to protect. Having switched to ZFS several years ago, I lack to the words to express how much I don't miss having to deal with that.


> What is the purpose of ZFS in 2021 if we have hardware RAID

Hardware RAID is actually older then ZFS style software RAID. ZFS was specifically designed fix the issues with hardware RAID.

The problem with Hardware RAID is that is has no ideas what going on on top of it, and even worse, its a mostly a bunch of closed-source fireware from a vendor. And they cost money.

You can find lots of terrible story about those.

ZFS is open-source and battle tested.

> linux software RAID

Not sure what you are referring too.

> BTRFS does RAID too.

BTRFS is basically copied many of the features done in ZFS. BTRFS has a history of being far less stable. ZFS is far more battle tested. They say its stable now, but they had said that many times. It eat my data twice so I have not followed the project anymore. A file system in my opinion gets exactly 1 chance with me.

They each have some features the other doesn't but broadly speaking they are similar technology.

The new bcacheFS is also coming up and adding some interesting features.

> Why would people choose ZFS in 2021 if both Oracle and Open Source users have 2 competing ZFS?

Not sure what that has do with anything. Oracle is an evil company, they tried take all these great open source technologies away from people and the community thought against it. Most of the ZFS team left after the merger.

The Open-Source version is arguable better, and has far more of the original designers working on it. The two code bases have diverged a lot since then.

At the end of the day ZFS is incredibly battle tested, works incredibly well at what it does. And had a incredible reputation of stability basically since it came out. They question in my opinion is why not ZFS, then why ZFS.


> ZFS is far more battle tested. They say its stable now, but they had said that many times. It eat my data twice

Did you mean "it ate my data" to apply to ZFS? Or did you mean BTRFS?


It was probably BTRFS.

I never fell for the BTRFS meme but many friends of mine did, and many of them ended up with a corrupted filesystem (and lost data).


He was referring to mdadm RAID.


It's pretty easy, having lots of experience with HW-Raid and SW-Raid, software it the way to go because:

1. Do you trust Firmware...i don't, i can tell you storys about freaking out san's...never had that with solaris or freebsd and zfs.

2. Why having a additional abstraction layer, HW Raid caching vs FS-Caching, no transparency for error correction, not smart raid rebuild etc.

the list can go on and on, but HW-Raid is a thing of the past (exceptions are specialized san's etc)


> What is the purpose of ZFS in 2021 if we have hardware RAID

Hardware RAID controllers predate ZFS by a long time.. ZFS is a much more modern design and because it integrates the whole storage layer it can offer all the features it does, which a RAID controller hiding behind a disk interface cannot do.

When ZFS came out many people (me included) considered that the end of relevance for hardware RAID controllers. I used to use hardware RAID pre-ZFS but have never again after switching to ZFS when Solaris 10 first included it.


They are not interoperable but they're barely competing as Solaris is dead. Does Oracle Linux even offer Oracle ZFS? I assume they stick to btrfs considering they are the original developers.

RAID does not feature the data protection offered by a copy on write filesystem, and OpenZFS is the most stable and portable option.


The counter is what is the point of Btrfs and Linux software RAID when ZFS is better in many ways....

Btrfs is not stable in all configurations. Mdadm and others don't do checksumming, scrubs, health checks, etc as well as ZFS (or at all). That's not even touching on built in encryption, compression, snapshots, boot environments and more.


> hardware RAID

That's just the worst of all worlds: Usually proprietary and you get the extreme aversion to improvement (or any change really) of hardware vendors.

This ZFS changes is going to come, and it may end up being complex to implement for users... but it's happening. At the risk of being hyperbolic: Something like would never be possible with a HW raid system unless it had explicitly been designed for it from the start.

Also: ZFS does much more than any hardware RAID ever did.


with hardware raid, if you have data corruption that doesn't bring down a whole drive, you have no idea which data copy is the correct one. (at least not in an automated way that doesn't require manual work)

With zfs, it has checksumming so it knows which data copy is the correct one.

As drives get bigger, the odds of bit decay, while being low on a per-bit level, become great enough for the drive in total, and is a concern.

File systems have had checksumming, but if there is no exposure into the raid layer, the file system can't use this checksum to recover from bit decay errors since it can't control which drive processes a request.

This is why zfs operates at both layers.

zfs vs btrfs is harder, and i'd guess its the zfs layered read and write caching systems enticing people in.


The TL;DR is that, IIUC, RAID depends on disks being 100% perfect with exactly zero read errors, ever, until they go *clonk* and die all at once.

Today's many-TB hard drives use mind-boggling engineering to shrink data fluctuations down almost to the size of individual atoms (not quite there yet, but making progress (!)), while higher I/O speeds push the onboard processors' ECC mechanisms almost beyond breaking point.

This basically means that the likelihood of HDDs returning uncaught bit errors has gone from a maybe-once-in-a-lifetime event (with very large magnetic flux sizes on disk in the 80s) to something that individual power users using disks with great-looking SMART output should generally expect to see every few months or years+.

My rule of thumb is that any storage device over maybe 250GB-500GB in size needs to be redundant. If you can shove that much data into one place, the odds that something significant will be hiding somewhere in that data easily go beyond 100% once you go past that size range, IMO.


No matter what happens, people will seemingly forever declare BTRFS is not as stable and not as safe. There's a status page that details what BTRFS thinks of itself[1], and I doubt any of the many people docking BTRFS have read or know or care what that page says. There is one issue still being worked out to completion, a "write hole" problem, involving two separate failures, an unplanned/power-loss shut-down, followed by a second disk failure, which can result in some data being lost[2] in RAID5/6 scenarios.

Other than that one extreme double-failure scenario being worked out, BTRFS has proven remarkably stable for a while now. A decade ago that wasn't quite as absolutely bulletproof, but today the situation is much different. Personally, it feels to me like there is a persistent & vocal small group of people who seemingly either have some agenda that makes them not wish to consider BTRFS, or they are unwilling to review & reconsider how things might have changed in the last decade. Not to belabor the point but it's quite frustrating, and it feels a bit odd that BTRFS is such a persistent target of slander & assault. Few other file systems seem to face anywhere near as much criticism, never so out of hand/casually, and honestly, in the end, it just seems like there's some continent of ZFS folks with some strange need to make themselves feel better by putting others down.

One big sign of trust: Fedora 35 Cloud looks likely to switch to BTRFS as default[3], following Fedora 33 desktop lat year making the move. A number of big names use BTRFS, including Facebook. I have yet to see any hyperscalers interested in ZFS.

I'm excited to see ZFS start to get some competent expandability. Expanding ZFS used to be a nightmare. I'll continue running BTRFS for now, but I'm excited to see file systems flourish. Things I wouldn't do? Hardware RAID. Controllers are persnickety weird devices, each with their own invisible sets of constraints & specific firmware issues. If at all possible, I'd prefer the kernel figure out how to make effective use out of multiple disks. BTRFS, and now it seems ZFS perhaps too, do a magical job of making that easy, effective, & fast, in a safe way.

Edit: the current widely-adopted write hole fix is to use RAID1 or RAID1c3 or RAID1c4 (3 copy RAID1, 4 copy RAID1) for meta-data, RAID5/6 for data.

[1] https://btrfs.wiki.kernel.org/index.php/Status

[2] https://btrfs.wiki.kernel.org/index.php/RAID56

[3] https://www.phoronix.com/scan.php?page=news_item&px=Fedora-C...


> it feels to me like there is a persistent & vocal small group of people who seemingly either have some agenda that makes them not wish to consider BTRFS

Look here is the problem. The BTRFS people declared multiple time that it is stable. For a few years it was repeatably 'BTRFS' is stable now if you use it in such an such a away.

But then it destroyed a drive.

Then it a few years later and now it was actually stable.

But then it destroyed data again.

I know what they are saying about themselves but unfortunately the project has simply lost credibility with a lot of people. File system should be developed to be stable first and then slowly add features never being unstable. They shouldn't still regularly destroy peoples data after years of years of development.

While at the same time ZFS has been basically stable since early days, many developers at Sun switched their root drives to ZFS before it was even officially released. It never had years and years of routine instability.

> Few other file systems seem to face anywhere near as much criticism,

Where few file system claimed they were stable for years while routinely losing data. The file system has one job first and foremost, don't destroy users data.

This is not a conspiracy, this is simply the reality that tons of people have lost data because BTRFS repeatably claimed stability when it wasn't.

Its now another couple years later, likely this wouldn't happen again. But that after all these years they still haven't managed to get RAID 5/6 fully working doesn't exactly scream confidence. So me and I assume many others simply have lost confidence in the project and the approach they take to the development.


If a filesystem looses your data, you never use it again.

Kirk McKusick creator of UFS/2


Netflix has been using ZFS in production for many years now. Unnamed research companies are using ZFS moving PB's of data. NetApp is FreeBSD based and was on the forefront of what we now call ZFS. I'm totally biased, designed many production critical systems with ZFS at it's core in one way or another. The power of ZFS send and receive function is tremendous to say the least, it beats any file based synchronizing methods.


>Netflix has been using ZFS in production

User of freebsd myself but that is BS, Netflix uses freebsd and UFS2 on the openconnect device.

https://openconnect.netflix.com/en/

You watch youtube about netflix they state clearly that they use ufs.

I "think" it's that one

https://www.youtube.com/watch?v=veQwkG0WdN8

Microservice-backend is linux/aws


We UFS for content serving, and a mix of UFS and ZFS for boot/log/config filesystems.

We use UFS for content because we rely on zero-copy async sendfile for our high performance video serving data path. When using ZFS, sendfile is not async, and because ZFS uses ARC rather than the page cache, it requires a copy from ARC to network with ZFS, so its yet ready for our high performance workload.

We don't use RAID at all.


>ZFS for boot/log/config filesystems.

Thanks for clarification, so can i assume that bectl/adm is in full swing?


Send/Recv and snapshots by far my favorite features, I keep snapshots of my root drive on my NAS so if my main m2 drive dies in my desktop, i can just boot a live cd and send/recv to my new drive and reboot and i'm back where i left off.

I also keep a clean copy of an installed debian install snapshot on my NAS so i can just send it to new machines rather than run through the whole setup. works great.


Snapshots & send-receive are my favorite BTRFS features too! I too love snapshotting Debian installs & send'ing them to new machines for atomic updates, or to build live usb images. I've been using Pottering's "Revisiting How We Put Together Linux Systems" naming system for my subvolumes, and that's worked fairly well[1]. I even did some fiddling around with enhancing Debian-live to load btrfs into ram a while back, so I could atomic update, de-mount the boot drive, &c[2].

Oh you are talking about ZFS, aren't you. ;)

Playing the "who came up with it first" game doesn't particularly interest me. Again, it feels like a spirit of competition when to me, that seems like a bad spirit: we should be cooperative & boosting each other. We're both open source, we're both trying to make civilization possible & to share greatnesses.

[1] http://0pointer.net/blog/revisiting-how-we-put-together-linu...

[2] https://github.com/rektide/debian-live-boot


>we should be cooperative & boosting each other.

Is that why you sound like a sour apple?


I'm sour because it feels like zfs folks in particular have it in for btrfs and it's very tiring. there's so many OSes which default to btrfs that there's be some real modern evidence if there were problems but we are constantly dogged by massive negativity.


>OSes which default to btrfs

No just two Gnu/Linux distributions:

openSUSE/SLES and Fedora

And SUSE (SLES) tells you (strong recommendation) to use btrfs just for the OS (for data XFS, just like RHEL) and just in a mirror configuration.

Look i work since ~forever with SLES, and if you operate it in mirror mode (just os), don't touch it and just make snapshots i never had problems. But i HAD complete data-loss with btrf many times when i did for example a defrag and re-compress, those are native btrfs-tools, and that is not acceptable to me. The Filesystems is THE place in a OS where errors like that are not acceptable (to me).


One guess I can make for the "hate" BTRFS gets is probably because everyone loves their data and doesn't expect to "fight" with a file system to get access to it.

E.g. Sailfish OS is perhaps the only mobile OS I know that uses / used BTRFS in production (and they adopted it nearly 6-7 years ago!). And some of its users have had issues with BTRFS in the earlier versions - https://together.jolla.com/questions/scope:all/sort:activity... ... in fact, I too remember that once or twice, we had to manually run the btrfs balancer before doing an OS update. For Sailfish OS on Tablet Jolla even experimented with LVM and ext4, and perhaps even considered dropping BTRFS. (I don't know what it uses for newer versions of Sailfish OS now - I think it allows the user to choose between BTRFS or LVM / EXT4).

Most users consider a file system (be it ZFS or BTRFS) to be a really low-level system software with which they only wish to interact transparently (even I got anxious when I had to run btrfs balancer on Sailfish OS the first time worrying what would happen if there was not enough free space to do the operation and hoping I wouldn't lose my data). Even on older systems, everybody frustrated over the need to run a defragmenter.

Perhaps because of improper expectations or configurations, some of the early adopters of BTRFS got burnt with it after possibly even losing their precious data. It's hard to forget that kind of experience and thus perhaps the "continuing hate" you see for BTRFS - a PR issue that BTRFS' proponents needs to fix.

(It's interesting to see the progress BTRFS has made. Thanks to your post, I may consider it for future Linux installations over EXT4. Except for the hands-on tinkering it required once or twice, I remember it as being rock-solid on my Sailfish mobile.)


Suse uses btrfs in production for the root filesystem, and they have done so for years.

https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha...


And I get why (snapshots are wonderful), but opensuse, albeit tumbleweed, is also the only OS I've had lose its root filesystem and force me to reinstall. Some of us distrust btrfs for a reason.

(Details: Corrupted filesystem, happened twice, ~2019 IIRC, on a single disk system so not even touching the RAID code, first time couldn't repair, second some didn't try. Hasn't happened again and wasn't just a checksum error so I doubt that hardware is at fault but could be wrong.)


I've lost data in BTRFS setups each of the three times I've given it a try over the course of 6 or so years. Root drive just became unrecoverable. These are all single disk setups

Meanwhile I've been running ZFS for close to the same time and have never lost anything.

I get it's an anecdotal view point but that's a very hard reputation to rebuild for BTRFS.


Counter anecdotal view point, I've been using BTRFS over the last 4 years now on multiple drives and multiple systems. No data loss has occurred.


If it was so bad that nobody had positive anecdotes it would have to be nightmarish. It's like a car models reliability. A bad model new off the line is still liable to have years of trouble free usage just fewer on average. The difference between reliable and unreliable is the difference between 2 small negative probabilities over many iterations.


I really just don't care about anecdotal data bits like this, especially when these attempts were tried probably 8+ years ago under who knows what conditions. How long ago did you give up btrfs, how long has it been, since your 6 years of trying it?


I think you may have unearthed the problem with btrfs. I know little about file systems and I know it might eat my data.

If the btrfs failed for you, you’ll remember and likely won’t try again. I had that experience with a now defunct hard drive maker. After the second replacement one failed I vowed never again.

They probably should rename it.


> BTRFS has proven remarkably stable for a while now. A decade ago that wasn't quite as absolutely bulletproof, but today the situation is much different.

When I did a fresh install of Fedora 33 on my primary workstation 3 months ago, I had exactly the same rationale for sticking with the default BTRFS selection. "I'm sure it has come quite a long way since I last tried it, I would like to have some of those features, I know there are some large production installations now, and the fact that it's the default in Fedora is a sign of confidence from the community."

After 2 months of use, I ended up with a corrupt filesystem in the middle of my work day and could not find any way to recover from it other than to do a full reinstall and restore my files from a backup. This was on a single NVMe drive in a system running a few small VM's, a browser, a chat client, and a few terminals. Thankfully I only lost a day's worth of work, but that's the last time I install BTRFS on any of my personal systems.

> [...] it feels a bit odd that BTRFS is such a persistent target of slander & assault.

My experience is obviously anecdotal (as are all individual experiences), so I won't be surprised if you dismiss my comment like you did nullwarp's comment. But "slander & assault" just seems like a weird way to dismiss all critics of BTRFS at once, as if everyone is out to get BTRFS. Filesystems have a thankless job. Do it right, and most users will never even think about it. But lose a user's data once, and you've likely lost that user forever.

> A number of big names use BTRFS, including Facebook. I have yet to see any hyperscalers interested in ZFS.

ZFS is excellent for large arrays of spinning disks, but if you're using a bunch of fast SSD's, performance really sucks. There is a lot of lock contention contributing to that which isn't noticeable on slower devices. I can't speak to FB's environment, but if they're managing a large number of SSD's like most hyperscalers, then ZFS would probably get ruled out based on performance comparisons.


I think there were at least 2 - 3 cycle where BTRFS declare itself as good and stable only to have people experiencing yet another Data Loss.

So once they are burnt, they wont come back.


My resizing consists of buying 8 more hard drives that are 2x the previous 8 and moving data over every few years (:


FYI you don’t have to move data. You can just replace each disk one at a time, and after the last replacement you magically have a bigger vpool.


I'll believe it when I see it, why anyone uses BTRFs (UnRaid or any other form of software raid that isn't ZFS) is still beyond me. At least when we're not talking SSD's ;)

ZFS is incredible, curious to mess around with these new features!


BTRFS was useful for me. When those (RAID5) parity patches got rejected many, many years ago for non-technical reasons like not matching a business case/goal or similar, it changed my view of open source.

That was the day I realized that some open source participants and supporters are interested in having open source projects that are good enough to act as a barrier to entry, but not good enough to compete with their commercial offerings.

Judge the world from that perspective for a while and it can help to explain why so much open source feels 80% done and never gets the last 20% of the polish needed to make it great.


> (RAID5) parity patches got rejected many, many years ago

Ooooh. (Booo!)

I wouldn't mind a citation/mailinglist reference for this, if you have one. (I honestly have no idea what I'd Google.)


http://web.archive.org/web/20150301234243/blog.ronnyegner-co...

It's in the quotes from the offline (not on a mailing list) follow up, so it's all hearsay.

I should be clear, I don't necessarily mean I think the developers are complicit in that. I think what happens is more subtle. IE: Companies sponsor the project just enough to be the biggest open source product in the space, but not enough for the developers to make it great.

Or as an alternative conspiracy theory, companies sponsor the projects of great developers that build awesome core features, but never give enough support for someone to turn that into a marketable product. That way they can usurp the work for their own products.

I know there's a lot of speculation there, but, if you watch for it, you can see how most entrenched tech companies are really, really taking advantage of open source developers. Basically the people who are passionate and want to build great things are getting hugely ripped off by people with yachts and rockets.


Thanks very much for replying, and for the link!

One of the biggest differences I've noticed between open source and "the cathedral" is that commercial endeavors that revolve around productization tend (as a rule) to manifest sufficient runway to fully round out the implementation of an idea to the point the implementation can participate in the market cohesively by representing itself attractively/competitively. This is often a broad-spectrum effort that requires domain specialization across a huge number of skills, and the burden of sustaining cohesive focus is typically only viable in a commercial context; I think similar levels of adequate collective focus (many individuals, one goal) are only typically raised in cult-type contexts.

Besides the passion-project foundation you mentioned, a lot of open source seems to come into existence because an $employer needed a really specific thing one time and they let the developer license the code under GPL and here it is and there's the 2.8 pages of documentation and it's got some speling misteaks in it and hopefully it works. (...Woops, I just described NPM, and some percentage of PyPI.)

Very very problematically, there's no collective language in the FOSS scene to distinguish between passion projects and commercially-driven JIT-developed code-dumps. After all, the code probably has just as many bugs per 1,000 lines, and the different contexts produce results that work the same, so...?

IMO, being able to encode that attribution to our communication would make SO MUCH difference in terms of user support, project coordination, etc! Coming from a perspective that's still optimistic :), arguing that "this is great, but it doesn't fit our business use case" translates for me to "my contract doesn't extend to me implementing/grokking/mentally integrating/testing/maintaining this new code, and it's not interesting enough for me to figure it out out of hours either" - so the invitation really is there, "send patches in if they're important enough to you", but it requires working cue perception (and possibly lack of cynicism) in all readers in order to be interpreted correctly. IF this is in fact the message that was being sent (!).

That the followup work was not done does indeed waste the effort made by the patch author, and is generally arguably stupid. But this to me brings up questions about the patch author's motivations, and why they didn't have a go at hammering everything into place - because, assuming fully adequate motivation/stamina and sufficient free time to iterate on the patch until the mailinglist likes it, eventually you'll reach a point where either the patch is in a staging tree somewhere, or the list has exploded into a flamewar about why the patch hasn't been accepted already, at which point (continuing to assume ideal circumstances) the patch author could go follow up on all the raised points.

I guess the outcome depends on whether the patch author considers the above AbSoLuTeLy ToO MuCh WoRk SeRiOuSlY ArE YoU KiDdInG Me, or welcomes the community involvement/participation/feedback and does their best to negotiate it to the point of getting the code merged. That the patch author didn't do this is something only they can provide extra context and judgement about; as you noted about speculation, I could only come up with uncited hypotheses here.

Looping back to the first paragraph, there are indeed many instances, for example in the audio/image/video editing scene, where software availability for Linux is incredibly restricted compared to Windows. The options are there, except they don't really work, or they fall over really quickly, or they feel really clunky. I think this sadly comes down to market demand. For example I've tried poking around with simple audio editing tasks - literally just loading a couple of tracks and crossfading them together - on Linux, and come up blank. Pop culture saturation might also be a factor, for example Renderman and Maya have been around for Linux for ages, but it's sliiightly (I suspect) easier to find "how to <jmp> over the license check" thingys for After Effects or Photoshop, for example.

However, with all of this being said, I have noticed a few industries where the comparison of feature parity in what's available in open source vs what's available commercially is sufficiently vast it directly leads to actual mental disorientation ("wait, this is where things are really at??"). It's incredibly difficult in this situation to stay non-cynical and not draw the types of conclusions you allude to in these settings (that perhaps there are agreements in place to not implement certain features, for example). And it's kind of interesting how "get Photoshop" is kind of a thing - an obscure thing, but still a thing that happens, while this seems (generally speaking) to happen less to Eyewateringly-Expensive Software™ targeted at Linux (?).

I guess this reply was me doing the (googles) denial-anger-bargaining-depression-acceptance thing while reasoning through your comment and trying to not be cynical :D, haha. It certainly is one of those worldview classification grey areas, where it's almost like it can be both things at once (except it can't, because that wouldn't make sense)...

I do also definitely agree that a lot of open-source development work is unfairly leveraged, and additionally that the "learning to code" movement (yay, more labor externalization!) is overhyped, almost to the point of the cult thing I noted above.


For my big media volume, which had existed for around 10 years, I use snapraid.

Because of several things:

* I can mix disk sizes

* I can add new disks over time as needed

* If something dies, up to the entire server, I can just stick any data disk in another system and read it

I didn't want to become a zfs expert (and the learning curve seems steep!), and I didn't want to spend thousands of dollars on new gear (dedicated NAS box and a bunch of matched-size disks).

I repurposed my old workstation into a server, spent a few hours getting it set up, and it works. I've had two disks fail (one data, one parity, and recovered from both). Every time I've added a new disk, it's been 50-100% larger than my existing disks.

I've also migrated the entire setup to a new system (newer old retired workstation), running proxmox, and was pleasantly surprised it only took about an hour to get that volume back up (incidentally, that server runs zfs as well.. I just don't use it for my large media storage volume).


UnRaid and Synology user here and I completely agree with all your points. The knowledge that at worst I will lose the data on just 1 disk (or 2 if I fail during a rebuild) is very calming. If not for UnRaid there is no way I could manage the size of the media volume I maintain (from a time, energy, and money perspective). I mean if you know ZFS well and trust yourself then more power to you but UnRaid and friends fill a real gap.


The funny thing here is that I went with zfs lately because it saved me money. As a poor student trying to maximise gb/$, raid6/lk was the move balance point between capacity and redundancy.


The learning curve of ZFS compared to every alternative out there is significantly lower IMO. The interface is easier and the guides online are great.

There are drawbacks as the one discussed here, but as a Linux user who doesn’t want to mess up with the FS and uses ZFS for the backup server, the experience has been great so far.


I just put two 8TB drives into btrfs because it's a home server, I can't provision things up front. One day I may put a third 8TB drive and turn this RAID1 into RAID5. btrfs lets me do that, zfs doesn't, simple as.

One day I may switch the whole thing to bcachefs, which I've donated and am looking forwards to. For the moment, btrfs will have to do.

EDIT: downvoted by... the filesystem brigade?


I disagree with this statement on multiple fronts. On the first level of the onion RAID1 is for high reliability and BTRFS has historically low reliability but if you peel back that layer you are presenting the ability to transition from BTRFS RAID1 to RAID5 as an appealing feature of BTRFS vs ZFS and yet this just isn't so.

BTRFS has been promising usable RAID5 since 2009 when it was "heading for 1.0" and yet among the most recent developments not but 3 months ago was to add the following warning to btrfs-progs on creation or conversion.

"RAID5/6 support has known problems is strongly discouraged to be used besides testing or evaluation,"

Worse this feature was presented as usable around 2011/12 before being revealed to be unfixably data eating without substantial rewrites in 2016 and 5 years later remains so.

Your hardware might need to be replaced before you can avail yourself of the benefit you posit.

Meanwhile an approach that would actually work on both BTRFS and ZFS would be to add 2 drives to go from RAID1 to RAID10.

The last peel of onion is complaining about down votes. This invites more down votes. If I had to guess people down voted you because you presented a feature that has been a massive pain point for BTRFS as a proposed advantage.


>RAID5

I wish you lots of fun with that on btrfs :)

Edit:

https://btrfs.wiki.kernel.org/index.php/Status

RAID56 Unstable n/a write hole still exists

> treated as if I'm storing business data or precious memories without backups, guess I'm just dumb

No your not, but don't use unstable features in a filesystem


Well, that's the idea! This a low I/O media server where all the important stuff (<5G of photos) has 2+ redundancy, once remotely, and on every workstation I sync, with the rest of the data being able to crash and burn without much repercussion.

The whole point of me using RAID1 (and maybe later RAID5) is that if a disk goes bust, odds are I can still watch a movie from it until I can get another disk. What's more, if I ever fill the RAID1 and I don't feel like breaking the piggy bank for another disk, I can go JBOD as far as my usecase is concerned.

But hey, if the orange website tells me all servers are supposed to be treated as if I'm storing business data or precious memories without backups, guess I'm just dumb. On that note: donations welcome, each 8TB disk costs close to 500 USD here in Uruguay, so if anyone's first world opinion can buy me a couple so I can use the Right Filesystem™, I'd appreciate it!


Look i really don't care was FS you use, just don't use unstable features. If you pay so much for your disks JUST for your movies, it seams they have some value for you.

>so if anyone's first world opinion can buy

Oh buhuu, says the Guy who can afford a NAS for his movies and >2 workstations, stop with your wannabe victim role.


At 500 a disk you could fly a consultant out to demonstrate a high capacity storage solution consisting of many $200 disks who just happens to forget to bring the disks back!


There are a large group of people who really dislike BTRFS. I think they were probably burned by it at some point but I’ve never had trouble and I’ve been using it since it became the default on fedora.


btrfs does have some advantages over zfs

   - no data duplicated between page cache and arc
   - no upgrade problems on rolling distros
   - balance allows restructuring the array
   - offline dedup, no need for huge dedup tables
   - ability to turn off checksumming for specific files
   - O_DIRECT support
   - reflink copy
   - fiemap
   - easy to resize


>- ability to turn off checksumming for specific files

This is something that has always confused me. BTRFS users are always advised to disable copy-on-write (thus preventing checksumming or compression) for VM images or database files to avoid massive performance hits from fragmentation. Even Facebook still stores its databases mostly on XFS filesystems. However, the ZFS community seems to indicate that you can achieve reasonable performance for databases and VMs just by tuning the recordsize (e.g. https://pg.uptrace.dev/zfs/). How does ZFS mitigate the problems from fragmentation?


But the main thing of an fs is to preserve your files...btrfs can't even check the most important point.


The checksumming helps to spot faulty hardware, that's a step above most other filesystems and often smart info too.


Checksums don't help against bugs. You are much less likely to lose your whole disk with ext4 or ZFS than BTRFS.


I see this a lot but have never had problems with BTRFS and I’ve used it both on my larger disks (2+tb) and my root (250gb ssd) across multiple computers for the last four years.


And even included in the kernel


    - defragmentation


Some of these are fair points but zfsonlinux/OpenZFS has had O_DIRECT since 0.8.x.


ZFSOnLinux just ignores the O_DIRECT flag if I remember correctly. Granted, this is what btrfs should do by default as well since there is an ugly issue where software can modify the O_DIRECT buffer after it was submitted causing btrfs checksum errors even though nothing was corrupted (and there is nothing to be done about it except disabling O_DIRECT or creating a buffer copy).


Yes (and as you mention this is the right thing to do) but I think it means that the Linux kernel will bypass the page cache which is still useful.


> why anyone uses BTRFs (UnRaid or any other form of software raid that isn't ZFS) is still beyond me.

BTRFS can do after-the-fact deduplication (with much better performance than ZFS dedup) and copy-on-write files. And you can turn snapshots into editable file systems.


I've had 3 catastrophic BTRFS failures. In two cases, the root filesystem just ran out of space and there was no way to repair the partition. Last time, the partition was just rendered unmountable after a reboot. All data was lost.No such thing has ever happened with ZFS for me.


A recent Fedora install here came with a new default of BTRFS use rather than ext4. So i'm curious about your experience, were any of those catastrophic failures recent? Do you know of any patches entering the kernel that purport to fix the issues you experienced?


Last one was two years ago. I was told that it was a hardware issue. Same SSD is still going strong with ext4 now.


I've had some annoying failures too. But I wasn't listing pros and cons, I was explaining that there are some very notable features that ZFS lacks.


That's fair. However, when listing notable features for the sake of comparing software, I think it's important to also list other characteristics of a given piece of software. If we were to compare software by feature sets alone, one might argue that Windows has the most features, so Windows must be best OS.


I think cloning a zfs snapshot into a writeable filesystem matches at least the functionality of btrfs writeable snapshots, but I could be ignorant about some use-cases.


Let's say you want to clear out part of a snapshot of /home, but keep the rest.

So you clone it and delete some files. All good so far, but the snapshot is still wasting space and needs to be deleted.

But to make this happen, your clone has to stop being copy-on-write. All the data that exists in both /home and the clone will now be duplicated.

And you could say "plan ahead more", but even if you split up your drive into many filesystems, now you have the problem that you can't move files between these different directories without making extra copies.


To put it in other words, zfs doesn't support rebasing a clone off of a newer snapshot. Otherwise e.g. you could create a clone, snapshot each clone, create two new clones and promote them, and then delete the original snapshot. But what zfs does is re-attach the original snapshot to the promoted clone of the original volume, and it is still the referred base of the other clone.


I’m a beginner in ZFS, but copying the modified clone and then destroying the clone and the snapshot would solve your problem, wouldn’t it?


Licensing. Similarly, otherwise it would've been included in macOS a long time ago (as the default fs according to some..)


The reason it didn’t end up in macOS is because NetApp sued Sun for patent infringement. Apple wanted nothing to do with that lawsuit and quickly abandoned the project.

As others have stated, dtrace has the exact same license and has been in MacOS for years.


The licensing is nothing to do with it on OSX - indeed DTrace (also under the CDDL) has been shipping in it for years.


And it's arguably even a bigger issue on Linux distros.


It’s a moderate pain on Linux and then only really that if you’re running on something bleeding-edge like Arch. Otherwise it’s just a kernel module like any other.


But it doesn't ship with either Red Hat or SUSE distros, which is an issue for supported commercial use.


Whats Oracle's play here, do they somehow make money out of ZFS which makes them reluctant to re-license it?


Is there a CLA for OpenZFS/ZoL? I don't believe there is, so I don't think Oracle can unilaterally relicense it.


Even if there were a CLA for OpenZFS, it wouldn't affect Oracle's inability to relicense the whole thing.

They could relicense their codebase, of course, but the number of changes that have happened since they diverged is not small.


A CLA and copyright assignment was how Oracle were able to make (now Oracle) ZFS proprietary again in the first place. As you say though, OpenZFS and Oracle ZFS have diverged quite a bit, and most of the world is now based around the OpenZFS variant that acts as the upstream for Linux, FreeBSD and even Windows variants.


I do believe that the license was fine for macOS but when Oracle bought Sun that killed it cold.

Jobs never liked anybody other than himself holding all the cards. Having Ellison and Oracle holding the keys to ZFS was just never going to fly.


It's a combination of the license and the fact that it's Oracle, of all entities, that owns the copyright. Perhaps either one by itself wouldn't be a dealbreaker but the combination is. And, of course, Oracle could have changed the license at any time after buying Sun.

(Of course, Jobs may have just decided he didn't want to depend on someone else for the MacOS filesystem in any case.)

ADDED: And as others noted, there were also some storage patent-related issues with Sun. So just a lot of potential complications.


That makes absolutely no sense. Jobs and Ellison were best friends. Oracle acquiring Sun would have made it MORE attractive, not less.

https://www.cnet.com/news/larry-ellison-talks-about-his-best...


I had ZFS on a Mac from Apple for a short amount of time during one of the betas :( I think TimeMachine was going to be based on it but they pulled out.


FYI there is a third-party effort for making OpenZFS usable on macOS.

https://openzfsonosx.org/

I used it for a while but unfortunately since they are not many people working on this and they are not working on it full time it can take them a good while from a new version of macOS is released until OpenZFS is usable with that version of macOS. This was certainly the case a while ago and why I stopped using OpenZFS on macOS and went back to only using ZFS on FreeBSD and Linux instead of additionally using it on macOS. So with my Mac computers I only use APFS.


Jobs and Ellison were really close friends


And also cold hearted clear eyed businessmen unlikely to allow friendship to affect their corporations.

I’d love to be a fly on the wall for some of those conversations.


Simplicity. There's a lot of complexity in ZFS I'd rather not depend on, and because it does so many things it's a big investment and liability to switch to.

While I understand why it would be useful in a corporate setting, for personal use I've found the combination of LUKS+LVM+SnapRAID to work well and don't see the benefit of switching to ZFS. Two of those are core Linux features, and SnapRAID has been rock solid, though thankfully I haven't tested its recovery process, but it seems straightforward from the documentation. Sure I don't have the real-time error correction of ZFS and other fancy features, but most of those aren't requirements for a personal NAS.


What about if you were just starting today, with 0 knowledge about basically anything related to storage and how to do it right?

That's my case, I'm learning before setting up a cheap home lab and a NAS, and I'm wondering if biting into ZFS is just the best option that I have given today's ecosystem.


I was in the same place 6 or 7 years ago. Due to indecision, I ended up using btrfs, zfs, and mdadm (technically, Synology hybrid raid) on various devices. They all work, more or less.

Looking back, the lessons that come to mind are:

- Always have 2 backups (not counting the primary copy), at least 1 "cold" (inaccessible without human intervention) and at least 1 offsite. Backup frequently and retain old backups. With backups, bad decisions are reversible.

- With btrfs or zfs, using a collection of 2-disk mirrors was useful because it provided flexibility (to expand the array, just add another pair of disks) and seemed to have better performance than a single disk. Try to pair disks from different manufacturing batches though. I saw two disks from the same batch and _used in the same mirror_ fail in the same month, which was disconcerting.

- The only data corruption I had to deal with was from RAM that started off good and went bad after a couple years.

- Standardizing on btrfs or zfs from the beginning would have allowed backup by sending snapshots, which would have been a lot easier than cobbling together a solution using rsync.

- Scrub on a regular schedule. Set up monitoring software to notify you of the outcome of each scrub and of any SMART errors.


Thank you. I need to start small, otherwise I feel overwhelmed by too many moving pieces to keeo in mind and plan for.

So I'm starting small, from powering up a ThinkCentre M910 I had laying around, with an internal disk that can be used to store backups. I have 0 need for performance so my idea was to extend storage with an external USB3 HD enclosure. For now, I don't have the space nor the machine where to install dual hard disks for building a decent RAID. Time will tell.


> That's my case, I'm learning before setting up a cheap home lab and a NAS, and I'm wondering if biting into ZFS is just the best option that I have given today's ecosystem.

ZFS is the simplest stack that you can learn IMHO. But if you want to learn all the moving parts of an operating system for (e.g.) professional development, then more complex may be more useful.

If you want to created a mirrored pair of disks in ZFS, you do: sudo zpool create mydata mirror /dev/sda /dev/sdb

In the old school fashion, you first partition with gdisk, then you use mdadm to create the mirroring, then (optionally) LVM to create volume management, then mkfs.


I dove into ZFS for my home lab as a relative novice.

It's not terrible, but there are a few new concepts to come to grips with. Once you have them down, it's not terrible.

If you don't plan on raiding, IMO, ZFS is overkill. The check-summing is nice, but you can get that from other filesystems.

Maintenance is fairly straight forward. I've even done a disk swap without too much fuss.

The biggest issue I had was setting up raid z on root with ubuntu was a PITA (at the time at least, March of this year). I ended up switching over to debian instead. Once setup, things have been pretty smooth.


Two things I like from it, as per what I've read so far:

* Checksumming

* As you mention, easy maintenance

* Snapshots and how useful they are for backups

In the end what I value is stuff that works reliably, doesn't get in the way, and requiring minimal supervision. And in the particular case of FS, I'd like to adopt a system that helps avoid bitrot in my data.

Could you drop some names that you would consider as good alternatives of ZFS?


For close to ZFS feature parity but much younger, BTRFS.

Otherwise it's sort of figuring out what features you want to drop. XFS and ext4 are probably where I'd look for a single disk hard drive.

Like I said, you could do ZFS, but definitely feels a bit like overkill. Setting up a vdev with one disk just to get snapshots and checksums seems like a lot.


I would still go with a collection of composable tools rather than something monolithic as ZFS, and to avoid the learning curve. But again, for personal use. If you're planning to use ZFS in a professional setting it might be good to experiment with it at home.


As mentioned in the sibling comment, one thing I like is having systems that don't require me to supervise, fix things, etc. In part that's why I've been alwas a user of ext4, it just works.

But I've recently found bitrotin some of my data files and now that I happened to be learning about how to build a NAS, I wanted to make the jump to some FS that helps me with that task.

Could you mention which tools you would use to replace ZFS? Think of checksumming, snapshotting, and to a lesser degree, replication/RAID.


I would argue that a collection of mostly composable tools can easily be much more complex (and bug-prone!) than a single “monolith”. Less moving parts can be good sometimes and I would argue that a file system/volume management is a very compact problem domain where better integration between the tools is more important than extendibility.


> LUKS+LVM+SnapRAID

+ your fs

Yeah that sounds like a lot less complexity


ZFS has all of these features and more. If I don't need those extra features by definition it's a less complex system.

Using composable tools is also better from a maintenance standpoint. If tomorrow SnapRAID stops working, I can replace just that component with something else without affecting the rest of the system.


> If tomorrow SnapRAID stops working, I can replace just that component with something else without affecting the rest of the system.

Can you actually? If some layer of that storage stack stops working then you can no longer access your existing data, because all these layers need to work correctly to correctly reassemble the data read from disk.


It's a hypothetical scenario :) In reality if there's a project shutdown there would be enough time to migrate to a different setup. Of course it would be annoying to do, but at least it's possible. With a system like ZFS I'm risking having to change the filesystem, volume manager, storage array, encryption and whatever other feature I depended on. It's a lot to buy into.


Since all those tools are from different dev's the system gets more complex. But hey if you really think that ZFS is to complex to hold 55 petabytes because it has to many potential bugs you should tell them:

https://computing.llnl.gov/projects/zfs-lustre


Thankfully I don't have to manage 55 petabytes of data, but good luck to them.

Did you miss the part where I mentioned "for personal use"?

> Since all those tools are from different dev's the system gets more complex.

I fail to see the connection there. Whether software is developed by a single entity or multiple developers has no relation to how complex the end user system will be.

But many small tools focused on just the functionality I need allows me to build a simpler system overall.


>Did you miss the part where I mentioned "for personal use"?

Since ZFS is simpler to use then your setup, is used to store 55PB of data without a single bit error since 2012, i don't see why someone should use inferior stuff, even when it's "personal use".

>But many small tools focused on just the functionality I need allows me to build a simpler system overall.

Sometimes monoliths are better for example the network-stack and storage....maybe kernels (big Maybe here)


> Whether software is developed by a single entity or multiple developers has no relation to how complex the end user system will be.

The first part of this sentence is probably true, as far as I see, but the complexity of a system perceived by the user depends primarily on the "surface" of the system. That surface includes the UI, the documentation and important concepts you have to understand for effective usage of the system. And in that regard, ZFS wins hands down against LUKS + LVM + SnapRaid + your FS of choice. Some questions a user of that LVM stack has to answer, aren't even asked of a ZFS user. E.g. the question how to split the space between volumes or how to change the size of volumes.


RAM?

Everytime I looked into setting up a freenas box, every hardware guide insisted that ungodly amounts of absolutely-has-to-be-ECC RAM was essential, and I just gave up at that point.


The "you need at least 32GB of memory and it has to be ECC, or don't even bother trying to use ZFS" crowd has done some serious harm to ZFS adoption. Sure, that's what you need if you want excellent data integrity guarantees and to use all of ZFS' advanced features. If you're fine with merely way-better-than-most-other-filesystems data integrity guarantees and using only most of ZFS' advanced features, you don't need those.


I really don't know where the "You gotta have ECC RAM!" thing started. I've been running a ZFS RAID on Nvidia Jetson Nanos for years now and haven't had any issues at all with data integrity.

I don't see why ZFS would be more prone to data integrity issues spawning from a lack of ECC than any other filesystem.


Relevant quote from one of ZFS's primary designers, Matt Ahrens: “There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. ... I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS."


Yeah, I remember reading that a few years ago.

If I were running a server farm or something, then yeah, I'd probably use ECC memory, but I think if you're running a home server, then the argument that ZFS necessitates ECC more than Ext4 or Btrfs or XFS or whatever doesn't really seem to be accurate.


> the argument that ZFS necessitates ECC more than Ext4 or Btrfs or XFS or whatever doesn't really seem to be accurate

Agreed.

> If I were running a server farm or something, then yeah, I'd probably use ECC memory, but I think if you're running a home server

Then you should still use ECC RAM, regardless of what filesystem you're using.

No, really. ECC matters (https://news.ycombinator.com/item?id=25622322) generally.


Fair enough, though AFAIK none of the SBC systems out there have ECC, and I generally use SBCs due to the low power consumption.


Years ago I saw it at:

https://www.truenas.com/community/threads/ecc-vs-non-ecc-ram...

(the gist of the scary story is that faulty ram while scrubbing might kill "everything".) However, in the end ECC appears to NOT be so important, e.g., see

https://news.ycombinator.com/item?id=23687895


There is literally only one feature that uses massive amounts of memory. Online de duplication relies on keeping an in ram table of duplicated blocks. This means that more duplication you have the larger the table is.

FreeBSD Mastery: ZFS by Michael Lucas around pg 174

Deduplication Memory Needs ==========================

"For a rough-and-dirty approximation, you can assume that 1 TB of deduplicated data uses about 5 GB of RAM. You can more closely approximate memory needs for your particular data by looking at your data pool and doing some math. We recommend always doing the math and computing how much RAM your data needs, then using the most pessimistic result. If the math gives you a number above 5 GB, use your math. If not, assume 5 GB per terabyte."

https://www.tiltedwindmillpress.com/?product=fmzfs

This is not to say you need 5GB for every 1TB of data. It doesn't even mean you need 5GB of data for every 1TB for which you have enabled dedup it means you need approximately 5GB of data for each TB of data which is both duplicated and residing on a dataset for which you have enabled dedup. Because of the high memory cost of dedup which rises exactly in proportion to its utility its only useful in cases in which you can plan ahead for its requirements. 99% of users are unlikely to use dedup however this doesn't stop some, not you obvious, from promoting the idea that ZFS requires 5GB of memory per TB or some some absurd figure.

As an aside I really liked the book I found it easy to read and understand and very informative despite being focused on FreeBSD its mostly applicable to Linux as well.


Heh so you have that backwards. All RAM should be ECC if you care about what’s stored in it. It’s not a ZFS requirement, it’s just that ZFS specifically cares about data integrity so it advises you to use ECC RAM. But it’s not like any other file system is immune from random RAM corruption: it’s not, it just won’t tell you about it.


Neither quantity nor ECC is essential.

ZFS defaults to assuming it is the primary reason for your box to exist, but it only takes two lines to define more reasonable RAM usage: zfs_arc_min and zfs_arc_max. On a NAS type server, I would think setting the max to half of your RAM is reasonable. Maybe 3/4 if you never do anything except storage.

ECC is not recommended because ZFS has some kind of special vulnerability without it; ECC is recommended because ZFS has taken care of all the more likely chances of undetectable corruption, so that's the next step.


It is not that simple regarding ECC. Since ZFS uses more memory, the probability of hitting a memory bug is simply higher with it.


But it doesn’t really use more memory. The ARC gives the impression of high memory usage because it’s different than the OS page cache and usually called out explicitly and not ignored in many monitoring tools like the OS cache is. Linux—without ZFS—will happily consume nearly all RAM with any filesystem if enough data is read and written.


This is correct. Any filesystem using the kernel's filesystem cache will do this, too.

For a long running, non-idle system, a good rule of thumb is that all RAM not being actively used is being used by evictable caching.


A colleague who was used to other UNIXes was transitioning to Linux for a database. He saw in free that used was more at more than 90%, so he added more ram. But to his surprise it was still using 90%! He kept adding ram. I told him that he had to subtract the buffer and cached values (this was before free had the Available column).


Before the Available column there was the -/+ buffers/cache line that provided the same information. Maybe it was too confusing.

             total      used      free   shared buffers    cached
      Mem: 12286456  11715372    571084        0   81912   6545228
  -/+ buffers/cache:  5088232   7198224
     Swap: 24571408     54528  24516880


https://www.linuxatemyram.com/ is one of my favorite single-serving sites.


ZFS likes RAM and uses it to get better performance (and don't think about using dedup without huge ram), but you don't need it and can change the defaults.

ECC tends to attract zealots after a perfect error-free existence which ECC does tend towards but doesn't deliver, it just reduces errors. I personally don't care about a tiny amount of bit rot (zfs will prevent most of this) and rebooting my storage machine now and then.

You can run ZFS/freenas on a crappy old machine and you'll be just fine as long as you aren't hosting storage for dozens of people and you aren't a digital archivist trying to keep everything for centuries.

Real advice:

* Mirrored vdevs perform way better than raidz, I don't think the storage gain is worth it until you have dozens of drives

* Dedup isn't worth it

* Enable lz4 compression everywhere

* Have a hot spare

* You can increase performance by adding a vdev set and by adding RAM

* Use drives with the same capacity


> Dedup isn't worth it

To add to that, ZFS dedup is a lie and you should forget its existence unless you have a very specific scenario of being a SAN with a massive amount of RAM, and even then, you had better be damn sure.

I really wish ZFS had either an option to store the Dedup Table on a NVMe like Optane, or to do an offline deduplication job.


It does have the former, these days - the "allocation_classes" feature lets you make the permanent home of certain subsets of data on "special" vdevs - which includes methods of specifying "store dedup table there".

Now, that becomes the only place entries on it are stored, so you best make it redundant if you don't want to lose your pool from a single NVMe failing, but the feature is there.

The latter I would predict seeing approximately when the sun burns out, on ZFS. It _really_ doesn't like the idea of data changing locations retroactively.


Thanks for this. I completely missed this feature in the run up to 0.8.

I'm going to have to do some test setups with this.


> Enable lz4 compression everywhere

Is the perf penalty low enough now that it just doesn't matter? I've always disabled compression on datasets I know are going to store only high-entropy data, like encoded video, that has a poor compression ratio.

I second the hot spare recommendation many times over. It can save your bacon.


It's generally the other way around actually, aside from storing already highly compressed datasets (e.g. video). The compression from lz4 will get you better effective performance because of the lower amount of io that has to be done, both in throughput and latency on zfs. This is because your CPU can usually do lz4 at hundreds of gb/s compared to the dozen you might get on your spinning rust disks.


Neat! Makes sense.


Does rebooting help with soft errors in non-ECC RAM? I would have thought bit flips would be transient in nature, but I'm not really familiar.


Running ZFS (FreeNAS/TrueNAS) on 2 home made NAS devices for years and years, I can say it is rock solid without ever using ECC RAM due to lack of choices. I can bet there were many soft-errors in all these years, but so far I never had problems that could not be recovered; the biggest issue ever was destroying the boot USB storage in months, but that was partially solved lately, I moved to fixed drives as boot drive and later I moved to virtualization for boot disk and OS, so the problem completely went away.


occasionally a bit flip will corrupt the state of something important and long running, a reboot will obviously clear this

usually it will hit nothing and have no side effects


You really only end up needing that if and only if you're also going to do live deduplication of large amounts of data. Very few people actually need that, just using compression with lz4 or zstd depending on your needs will suffice for just about everyone and perform better. the ECC argument is probably about a 50/50 kind of thing, you can get away without it and ZFS will do it's best to detect and prevent issues but if the data was flipped before it was given to ZFS then there's nothing anyone can do. You might get some false positives when reading data back if you got some flaky ram but as long as you have parity or redundancy on the disks then things should still get read correctly even if a false problem is detected. That might mean you want to run a scrub (essentially ZFS's version of fsck) more often to look for potential issues but it shouldn't fundamentally be a big deal. If you end up wanting 24/7 highly available storage that won't blip out occasionally you'll probably really want the ECC ram but if you're fine with having to reboot it occasionally or tell it to repair problems that it thinks were there (but weren't because the disk is fine but the ram wasn't) then you should be fine. The extra checksums and data that ZFS can use for all this can make it really robust even on bad hardware. I had a bios update cause some massive PCIE bus issues that I didn't realize were going on for a bit and ZFS kept all my data in good condition even though writes were sometimes just never happening because of ASPM causing issues with my controller card.


Others have said good things (ECC is good by itself, has not much to do with ZFS) and it is actually quite easy to check if you need much RAM for ZFS. Start a (Linux) VM with a few hundred megabytes of RAM and run ZFS an on it. Of course, it will not be as performant as having a lot of RAM. But it will not crash, or hang or be unusable in one way or another.

Sources: - https://www.reddit.com/r/DataHoarder/comments/3s7vrd/so_you_... - https://www.reddit.com/r/homelab/comments/8s6r2r/what_exactl... - My own tests with around 8 TB ZFS data in a Linux vm with 256 MB RAM.


As always, it depends on your use-case.

I have several file-servers all use ZFS exclusively. and 10x that number of servers using ZFS as the system FS.

Rule of thumb that I like: 1GB RAM/TB of storage. This seems to give me the best bang-for-our-buck.

For a small (under 20) number of office users, doing general 'office' stuff, using Samba, it's overkill.

For large media shares with heavy editor access, and heavy strains on the network, it's a minimum.

Depends on what the server is serving.

DeDUP is a different story. The RAM is used to store the frequently accessed data. If you are using DeDUP you fill the motherboard with as much RAM as will fit. NO EXCEPTIONS! This may have been the line of thinking that scared you away from it.

I have a 100TB server that is just used for writing data to and is never read from (sequential file back-ups before it's moved to "long term storage"). It has 8GB of RAM, and is barely touched.

I also have a 20TB server with 2TB of RAM, that keeps the RAM maxed out with DeDUP usage.

ECC: It's insurance, and it's worth it.


That's not precisely why dedup needs gobs of RAM. (If you already know this distinction, I apologize, I just want to make sure people reading this do.)

You effectively (unless you use allocation classes) need to keep the entire DDT in RAM all the time if you don't want any write to a dedup-enabled dataset to potentially require blocking on reading the relevant segment from spinning disks into RAM (thus tanking performance even worse than dedup normally does). It's not really related to the mechanisms in the rest of ZFS for keeping {frequently,recently} used data cached in RAM.


The freenas hardware requirements themselves say "8 GB RAM (ECC recommended but not required)"

https://www.freenas.org/hardware-requirements/

I myself use freenas with 16GB of non-ECC ram.

Of course it is possible to have a bit flip in memory that is then dutifully stored incorrectly by ZFS to disk, but this was a possibility without ZFS as well.

I've actually been waiting for this feature for since I first setup my pool. It seemed theoretically possible we were just waiting for an implementation.


FreeNAS is excellent in many ways. Except that weird gospel their forum people have.

ZFS only needs a lot of RAM if deduplication is enabled. And it shouldn't be for most use cases, or only enabled on one dataset that benefits from it.

Many ZFS installs are fine on 8GB or less.

ECC RAM is better but not required. The idea is to catch memory errors, hence ECC is better.


Does anyone know if this also means a draid can be expanded?


> Data newly written to the ten-disk RAIDz2 has a nominal storage efficiency of 80 percent—eight of every ten sectors are data—but the old expanded data is still written in six-wide stripes, so it still has the old 67 percent storage efficiency.

This makes this feature quite ‘meh’. The whole goal is capacity expansion and you won’t be able to use the new capacity unless you rewrite all existing data, as I understand it.

This feature is mostly relevant for home enthusiasts and I think it doesn’t really bring the desired behavior this user group wants and needs.

> Undergoing a live reshaping can be pretty painful, especially on nearly full arrays; it's entirely possible that such a task might require a week or more, with array performance limited to a quarter or less of normal the entire time.

Not an issue for home users as they often don’t have large work loads thus this process is fast and convenient. Even if it would take two days.


...what?

You don't get the increased storage efficiency on existing data, without rewriting it, but the new capacity is certainly available for use.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: