This F_FULLFSYNC behaviour has been like this on OSX for as long as I can remember. It is a hint to ensures that the data in the write buffer has been flushed to stable storage - this is historically a limitation of fsync that is being accounted for - are you 1000% sure it does as you expect on other OSes?
Maybe unrealistic expectation for all OSes to behave like linux.
Maybe linux fsync is more like F_BARRIERFSYNC than F_FULLFSYNC. You can retry with those for your benchmarks.
Also note that 3rd party drives are known to ignore F_FULLFSYNC, which is why there is an approved list of drives for mac pros. This could explain why you are seeing different figures if you are supplying F_FULLFSYNC in your benchmarks using those 3rd party drives.
Last time I checked (which is a while at this point, pre SSD) nearly all consumer drives and even most enterprise drives would lie in response to commands to flush the drive cache. Working on a storage appliance at the time, the specifics of a major drive manufacturer's secret SCSI vendor page knock to actually flush their cache was one of the things on their deepest NDAs. Apparently ignoring cache flushing was so ubiquitous that any drive manufacturer looking to have correct semantics would take a beating in benchmarks and lose marketshare. : \
So, as of about 2014, any difference here not being backed by per manufacturer secret knocks or NDAed, one-off drive firmware was just a magic show, with perhaps Linux at least being able to say "hey, at least the kernel tried and it's not our fault". The cynic in me thinks that the BSDs continuing to define fsync() as only hitting the drive cache is to keep a semantically clean pathway for "actually flush" for storage appliance vendors to stick on the side of their kernels that they can't upstream because of the NDAs. A sort of dotted line around missing functionality that is obvious 'if you know to look for it'.
It wouldn't surprise me at all if Apple's NVME controller is the only drive you can easily put your hands on that actually does the correct things on flush, since they're pretty much the only ones without the perverse market pressure to intentionally not implement it correctly.
Since this is getting updoots: Sort of in defense of the drive manufacturers (or at least stating one of the defenses I heard), they try to spec out the capacitance on the drive so that when the controller gets a power loss NMI, they generally have enough time to flush then. That always seemed like a stretch for spinning rust (the drive motor itself was quite a chonker in the watt/ms range being talked about particularly considering seeks are in the 100ms range to start with, but also they have pretty big electrolytic caps on spinning rust so maybe they can go longer?), but this might be less of a white lie for SSDs. If they can stay up for 200ms after power loss, I can maybe see them being able to flush cache. Gods help those HMB drives though, I don't know how you'd guarantee access to the host memory used for cache on power loss without a full system approach to what power loss looks like.
On at least one drive I saw, the flush command was instead interpreted as a barrier to commands being committed to the log in controller DRAM, which could cut into parallelization, and therefore throughput, looking like a latency spike but not a flush out of the cache.
The drive controller is internally parallel. The write is just a job queue submission, so the next write hits while it's still processing previous requests.
People have tested this stuff on storage devices with torture tests. Can you point at an example of a modern (directly attached) NVMe drive from a reputable vendor that cheats at this?
FWIW, macOS also has F_BARRIERFSYNC, which is still much slower than full syncs on the competition.
On my benchmarking of some consumer HDD's, back in 2013 or so, the flush time was always what you'd expect based on the drive's RPM. I got no evidence the drive was lying to me. These were all 2.5" drives.
My understanding was, the capacitor thing on HDD's is to ensure it completely writes out a whole sector, so it passes the checksum. I only heard the flush cache thing with respect to enterprise SSD's. But I haven't been staying on top of things.
You definitely weren't testing the cache in a meaningful way if you were hovering over the same track.
WRT to the capacitor thing being about a single sector, think about the time spans. You should be able to even cut out the drive motor power, and still stay up for 100s of ms. In that time you can seek to a config track and blit out the whole cache. If you're already writing a sector you'll be done in microseconds. The whole track spins around every ~8ms at 7200RPM.
Tangential thinking out loud: this makes me think of a sort of interleaving or striping mechanism that tries to leave a small proportion of every track empty, such that ideal power loss flush scenarios would involve simply waiting for the disk to spin around to the empty/reserved area in the current track. On drives that aren't completely full, it's probably statistically reasonable that for any given track position there's going to be a track with some reserved space very close by, such that the amount of movement/power needed to seek there is smaller.
Of course, this approach describes a completely inverted complexity scenario in terms of sector remapping management, with the size of the associated tables probably being orders of magnitude larger. :<
Now I wonder how much power is needed for flash writes. The chances are an optimal-and-viable strategy would probably involve a bit of multichannel flash on the controller (and some FEC because why not).
Oooh... I just realized things'll get interesting if the non-volatile RAM thing moves beyond the vaporware stage before HDDs become irrelevant. Last-millimeter write caching will basically cease to be a concern.
But thinking about the problem slightly more laterally, I don't understand why nobody's made inline SATA adapters with RAM, batteries and some flash in them. If they intercept all writes they can remember what blocks made it to the disk, then flush anything in the flash at next power on. Surely this could be made both solidly/efficiently and cheaply...?
> But thinking about the problem slightly more laterally, I don't understand why nobody's made inline SATA adapters with RAM, batteries and some flash in them.
Hardware raid controllers with Battery Backup Units was really popular starting in the mid 90’s until maybe mid 2010’s? Software caught up in a lot of features and batteries failed often and required a lot more maintenance. Super caps were to replace the batteries but I think SSDs and software negated a ton of the value add. You can still buy them but they’re pretty rare to see in the wild.
I've heard of those, they sound mildly interesting to play with, if just to go "huh" at and move on. I get the impression the main reason they developed a... strained reputation was their strong tendency to want to do RAID things (involving custom metadata and other proprietaryness) even for single disks, making data recovery scenarios that much more complicated and stressful if it hadn't been turned off. That's my naive projection though, I (obviously) have no real-world experience with these cards, I just knew to steer far away from them (heh)
An inline widget (SATA on both sides) that just implements a write cache and state machine ("push this data to these blocks on next power on") seems so much simpler. You could even have one for each disk and connect to a straightforward RAID/SAS controller. (Hmm, and if you externalize the battery component, you could have one battery connect to several units...)
You are indeed right about the battery/capacitor situation ("you have to open the case?!"), I wouldn't be surprised if the battery level reporting in those RAID cards was far from ideal too lol
With all this being said, a UPS is by far the simplest solution, naturally, but also the most transiently expensive.
Probably more like downvoted because missing the point.
Sure fsync allows that behavior, but also it's so widely misunderstood that a lot of programs which should do a "full" flush only do a fsync, including Benchmarks. In which case they are not comparable and doing so is cheating.
But that's not the point!
The point is that with the M1 Macs SSDs the performance with fully flushing to disk is abysmal bad.
And as such any application with cares for data integrity and does a full flush can expect noticable performance degradation.
The fact that Apple neither forces frequent full syncs or at least full syncs when a Application is closed doesn't make it better.
Though it is also not surprising as it's not the first time Apple set things up under the assumption their hardware is unfailable.
And maybe for a desktop focused high end designs where most devices sold are battery powered that is a reasonable design choice.
"And maybe for a desktop focused high end designs where most devices sold are battery powered that is a reasonable design choice"
Does the battery last forever? Do they never shut down from overheating, shut down from being too cold, freeze up, they are water and coffee proof?
Talk to anyone that repairs mac about how high-end and reliable their designs trully are - they are better than bottomn of the barrel craptops, sure, but not particularly amazing and have some astounding design flaws.
As the article points out, a lot of those cases can be detected with advanced notice (dying battery, and overheating - probably even being too cold). In those cases the OS makes sure all the caches are flushed.
Spilled drinks are a viable cause for concern, but if they do enough damage to cause an unexpected shutdown, you've probably got bigger issues than unflushed cache.
On many laptops even with water damage you can recover your local data fully, not do for Macs (for more reasons then just data loss/corruption due to non flushing).
Especially if you are already in a bad situation you don't want your OS to make it worse.
The CPU can't possibly get too cold. See for example overclocking performed by cooling the CPU with liquid nitrogen. Condensation is a factor as is lost of ductility of plastic at low temp making it brittle. Expansion and contraction of materials especially when different materials expand to different degrees.
"The CPU can't possibly get too cold" - Untrue. There are plenty of chips with what overclockers like to call "cold bugs".
Sequential logic (flipflops) has a setup time requirement. This means the combinatorial computation between any two connected pairs of flops (output of flop A to input of flop B) has to do its job fast enough such that the input of B stops toggling some amount of time before the next clock edge arrives at the flipflop. Violate that timing, and B will sometimes sample the wrong value, leading to an error.
Setup time is what most people are thinking about when they use LN2 or other exotic forms of cooling. By cooling things down, you usually improve the performance of combinatorial logic, which provides more setup time margin, allowing you to increase clock speed until setup time margin is small again.
But flops also have hold time requirements - their inputs have to remain stable for some amount of time after the clock edge, not just before. It's here where we can run into problems if the circuit is too cold. Imagine a path with relatively little combinatorial logic, and not much wire delay. If you make that path too fast, it might start violating hold time on the destination flop. Boom, shit doesn't work.
Many phones, laptops cameras and similar are only guaranteeing functionally by above 0 degree....
Luckily they often operate in lower temperatures too, but not seldomly by hoping they don't get cooled that much themself (because they are e.g. in your pocket).
Incidentally CPUs do get too cold, not at a reasonable temperature, but sufficiently low temperatures do change the characteristics of semi conductors. Not something to worry about if you're not using liquid nitrogen (or colder).
I've had my phone shut off on me from being out in the Chicago cold for a couple hours. Battery over 50% when I brought it back inside and warmed it up.
I mean the Apple hardware in question is usually a laptop, which has its own very well instrumented battery backup. In most cases the hardware knows well in advance if the battery is gonna run dry.
And yes the hardware is failable. But the kind if failure that would cause the device to completely lose power is extremely rare. The OS has many chances to take the hint and flush the cache before powering down.
A simple test can be to see the degree of dataloss you can occur with a hard power off.
I think the author did that test for M1 Mac but idk. if they did the test with the other laptops.
But then the M1 Mac is slower when flushing then most SSDs out there and even some HDDs. I think if most SSDs wouldn't flush data at all we would know of that and I should have run into problems with the few docent hard resets I ran into in the last few years. (And sure there are probably some SSDs which cheap out on cache flushing in a dangerous way, but most shouldn't as far as I can tell).
We’d see data loss only if the power loss or hard reset happened before the data is actually flushed. After the data is accepted into the buffer there would be a narrow time window when it could occur. Also, a hard reset on the computer side may not be reflected on the storage embedded electronics.
It isn't, because otherwise it would be showing the ~same performance with and without sync commands, as I showed in the thread. There is a significant performance loss for every drive, but Apple's is way worse.
There is no real excuse for a single sector write to take ~20ms to flush to NAND, all the while the NAND controller is generating some 10MB/s of DRAM traffic. This is a dumb firmware design issue.
This affects T2 Macs too, which use the same NVMe controller design as M1 Macs.
We've looked at NVMe command traces from running macOS under a transparent hypervisor. We've issued NVMe commands outside of Linux from a bare-metal environment. The 20ms flush penalty is there for Apple's NVMe implementation. It's not some OS thing. And other drives don't have it. And I checked and Apple's NVMe controller is doing 10MB/s of DRAM memory traffic when issued flushes, for some reason (yes, we can get those stats). And we know macOS does not properly flush with just fsync() because it actively loses data on hard shutdowns. We've been fighting this issue for a while now, it's just that it only just hit us yesterday/today that there is no magic in macOS - it just doesn't flush, and doesn't guarantee data persistence, on fsync().
Ive just been scanning through linux kernel code (inc ext4). Are you sure that its not issuing a PREFLUSH? What are your barrier options on the mount? I think you will find these are going to be more like F_BARRIERFSYNC.
Those are Linux concepts. What you're looking for is the actual NVMe commands. There's two things: FLUSH (which flushes the whole cache), and a WRITE with the FUA bit set (which basically turns that write into write-through, but does not guarantee anything about other commands). The latter isn't very useful for most cases, since you usually want at least barrier semantics if not a full flush for previously completed writes. And that leaves you with FLUSH. Which is the one that takes 20ms on these drives.
> Those are Linux concepts. What you're looking for is the actual NVMe commands.
Im not sure what commands are being sent to the NVMe drive. But what you are describing as a flush would be F_BARRIERFSYNC - NOT the F_FULLFSYNC which youve been benchmarking.
Sigh, no. A barrier is not a full flush. A barrier does not guarantee data persistence, it guarantees write ordering. A barrier will not make sure the data hits disk and is not lost on power failure. It just makes sure that subsequent data won't show up and not the prior data, on power failure. NVMe doesn't even have a concept of barriers in this sense. An OS-level barrier can be faster than a full sync only because it doesn't need to wait for the FLUSH to actually complete, it can just maintain a concept of ordering within the OS and make sure it is maintained with interleaved FLUSH calls.
I don't know why you keep pressing on this issue. macOS has the same performance with F_FULLFSYNC as Linux does with fsync(). Why would they be different things? We're getting the same numbers. This entire thing started because fsync() on these Macs on Linux was dog slow and we couldn't figure out why macOS was fast. Then we found F_FULLFSYNC which has the same semantics as fsync() on Linux. And now both OSes perform equally slowly on this hardware. They're obviously doing the same thing. And the same thing on Linux on non-Apple SSDs is faster. I'm sure I could install macOS on this x86 iMac again and show you how F_FULLFSYNC on macOS also gives better performance on this WD drive than on the M1, but honestly, I don't have the time for that, the isssue has been thoroughly proved already.
Actually, I have a better one that won't waste as much of my time.
Plugs in a shitty USB3 flash drive into the M1.
224 IOPS with F_FULLFSYNC. On a shitty flash drive.
58 IOPS with F_FULLFSYNC. On internal NVMe.
Both FAT32.
Are you convinced there's a problem yet?
(I'm pretty sure the USB flash drive has no write cache, so of course it is equally fast/slow with just fsync(), but my point still stands - committing writes to persistent storage is slower on this NVMe controller than on a random USB drive)
It seems to be pretty apples to apples, they're running the same benchmark using equivalent data storage APIs on both systems. What are you thinking might be different? The Linux+WD drive isn't making the data durable? Or that OSX does something stupid which could be the cause of the slowdown rather than the drive? Both seem implausible.
Something that is not quite clear to me yet (I did read the discussion below, thank you Hector for indulging us, very informative): isn't the end behaviour up to the drive controller? That is, how can we be sure that Linux actually does push to the storage or is it possible that the controller cheats? For example, you mention the USB drive test on a Mac — how can we know that the USB stick controller actually does the full flush?
Regardless, I certainly agree that the performance hit seems excessive. Hopefully it's just an algorithm, issue and Apple can fix this with a software update.
MacOS was really just FreeBSD with a fancier UI. Not sure what is the behavior now, but I'm pretty sure FreeBSD behaved almost exactly the same as a power loss rendered my system unbootable over 10 years ago.
I'm sorry but this is incorrect. NeXTSTEP was the primary foundation for Mac OS X, and the XNU kernel was derived from Mach and IIRC 4.4BSD. FreeBSD source was certainly an important sync jumping off point for a number of Unix components of the kernel and CLI userland, there was some code sharing going on for a while (still?), but large components of the kernel and core frameworks were unique (for better or worse).
4.3, only Rhapsody incorporated elements from 4.4, but that was the tail end of nextstep, essentially the initial preview of macos (it was released as osx server 1.0, then forked to darwin from which the actual OSX 10.0 would be built, two major pieces missing from rhapody were Classic and Carbon, so it really was nextstep with an OS9 skin).
Thanks for the correction, man has it been a long, long time. I had the Public Beta and than got on the OS X train pretty fast on a good old B&W G3. Even with the slowness the multitasking still allowed getting around it and having all Unix right there with a big rush to initial porting was really interesting, good times. I remember calling Apple for help getting Apache compiled and got forwarded right out of the regular call system to some dev whose name I sadly forget and we worked through it.
Everything is a million times more refined and overall better now but I do have a bit of nostalgia for the community and really getting your hands dirty back then while still having a fairly decent fallback. I haven't actually needed to mess with kernel stuff since 10.5 or so but thinking back makes me wonder about paths not taken.
> so it [Rhapsody] really was nextstep with an OS9 skin
Sorry to be pedantic, but Rhapsody's user interface is modeled after the Mac OS 8 "Platinum" design language. Though 9 also was modeled on Platinum, Rhapsody's interface appears nearly identical to Mac OS 8's except for the Workspace Manager which doesn't exist in 8.
I like that. Fsync() was designed with the block cache in mind. IMO how the underlying hardware handles durability is its own business. I think a hack to issue a “full fsync” when battery is below some threshold is a good compromise.
It's important to read the entire document including the notes, which informs the reader of a pretty clear intent (emphasis mine):
> The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.
This seems consistent with user expectations - fsync() completion should mean data is fully recorded and therefore power-cycle- or crash-safe.
That particular implementation seems inconsistent with the following requirement:
> The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes.
If I wrote that requirement in a classroom programming assignment and you presented me with that code, you'd get a failing grade. Similarly, if I were a product manager and put that in the spec and you submitted the above code, it wouldn't be merged.
> You are quoting the non-normative informative part
Indeed, I am! It is important. Context matters, both in law and in programming. As a legal analogy, if you study Supreme Court rulings, you will find that in addition to examining the text of legislation or regulatory rules, the court frequently looks to legislative history, including Congressional findings and statements by regulators and legislators in order to figure out how to best interpret the law - especially when the text is ambiguous.
> If I wrote that requirement in a classroom programming assignment and you presented me with that code, you'd get a failing grade.
It's a good thing operating systems aren't made up entirely of classroom programming assignments.
Picture an OS which always runs on fully-synchronized storage (perhaps a custom Linux or BSD or QNX kernel). If there's no write cache and all writes are synchronous, then fsync() doesn't need to do anything at all; therefore `int fsync(int) {return 0}` is valid because fsync()'s method is implementation-specific.
This allows you to have no software or hardware write cache and not implement fsync() and still be POSIX-compliant.
> Context matters, both in law and in programming. As a legal analogy, if you study Supreme Court rulings, you will find that in addition to examining the text of legislation or regulatory rules, the court frequently looks to legislative history, including Congressional findings and statements by regulators and legislators in order to figure out how to best interpret the law - especially when the text is ambiguous.
The POSIX specification is not a court of law, and the context is pretty clear: fsync() should do whatever it needs to do to request that pending writes are written to the storage device. In some valid cases, that could be nothing.
> Picture an OS which always runs on fully-synchronized storage (perhaps a custom Linux or BSD or QNX kernel). If there's no write cache and all writes are synchronous, then fsync() doesn't need to do anything at all; therefore `int fsync(int) {return 0}` is valid because fsync()'s method is implementation-specific.
Sure, I'll give you that, in a corner case where all writes are synchronized to storage before completing. However, most modern computers cache writes for performance, and the speed/security tradeoff is the context of this discussion. We wouldn't be having this debate in the first place if computers and storage devices didn't cache writes.
> The POSIX specification is not a court of law
Indeed, it isn't; nor is legislative text (the closest analogy in law). Hence the need for interpretation.
> fsync() should do whatever it needs to do to request that pending writes are written to the storage device
The wording here is quite subtle. Without SIO, fsync is merely a request, returning an error if one occurred. As the informative section points out, this means that the request may be ignored, which is not an error.
> If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted.
Compare this to e.g. the wording for write(2):
> The write() function shall attempt to write nbyte bytes from the buffer pointed to by buf to the file associated with the open file descriptor, fildes. [yadadada]
This actually specifies that an action needs to be performed. fsync(2) sans SIO is merely a request form that the OS can respond to or not. And because macOS does not define SIO, you have to go out and find out what that particular implementation is actually doing and the answer is: essentially nothing for fsync.
It makes sense that a null implementation is permitted to cover cases such as the one illustrated above where all writes are always synchronized. However, it violates the spirit of the law (so to speak) as discussed in the normative section to have a null implementation where writes are not always synchronized (i.e., cached). As another commenter noted, the wording was not intended to give the implementor a get-out-of-jail-free card ("it was merely a request; I didn't actually have to even try to fulfill it").
There’s also the very likely possibility that the storage is lying to the OS, that the data that was accepted and which is in the buffer has been written somewhere durable while it’s actually waiting for an erase to finish or a head to get wherever it needs to be. There are disk controllers with batteries precisely for those situations.
And, if cheating will give better numbers on benchmarks, I’m willing to bet money most manufacturers will cheat.
Since crashes and power failures are out of scope for POSIX, even F_FULLSYNC's behavior description would of necessity be informative rather than normative.
But, the reality is that all operating systems provide some way to make writes to persistent storage complete, and to wait for them. All of them. It doesn't matter what POSIX says, or that it leaves crashes and power failure out of scope.
POSIX's model is not a get-out-of-jail-free card for actual operating systems.
At least it is also implemented by windows, which cause apt-get in hyperv vm slower
And also unbearable slow for loopback device backed docker container in the vm due to double layer of cache. I just add eat-my-data happily because you can't save a half finished docker image anyway.
OSX defines _POSIX_SYNCHRONIZED_IO though, doesn't it? I don't have one at hand but IIRC it did.
At least the OSX man page admits to the detail.
The rationale in the POSIX document for a null implementation seems reasonable (or at least plausible), but it does not really seem to apply to general OSX systems at all. So even if they didn't define _POSIX_SYNCHRONIZED_IO it would be against the spirit of the specification.
I'm actually curious why they made fsync do anything at all though.
No problem - sorry if i came off harsh, i thought you were being pedantic :D
TBH, im not so sure its that different. Scanning through the linux docs it seems that this behaviour can be configured as part of mount options (e.g. barrier on ext4). At least its explicit on macOS (with compliant hardware).
> No problem - sorry if i came off harsh, i thought you were being pedantic :D
No just I did a ctrl+F ctrl+C ctrl+V without thinking enough. No need to apologize though, my reply was actually flippant I should have been more respectful of your (correct) point.
> TBH, im not so sure its that different. Scanning through the linux docs it seems that this behaviour can be configured as part of mount options (e.g. barrier on ext4). At least its explicit on macOS (with compliant hardware).
I disagree (unless Linux short-cuts this by default). The reason is in the POSIX rationale:
*RATIONALE*
> The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk. Since the concepts of "buffer cache", "system crash", "physical write", and "non-volatile storage" are not defined here, the wording has to be more abstract.
The first paragraph gives the intention of the interface. It's clearly to persist data.
> If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted. This could be valid in the case where the system cannot assure non-volatile storage under any circumstances or when the system is highly fault-tolerant and the functionality is not required. In the middle ground between these extremes, fsync() might or might not actually cause data to be written where it is safe from a power failure. The conformance document should identify at least that one configuration exists (and how to obtain that configuration) where this can be assured for at least some files that the user can select to use for critical data. It is not intended that an exhaustive list is required, but rather sufficient information is provided so that if critical data needs to be saved, the user can determine how the system is to be configured to allow the data to be written to non-volatile storage.
Now this gives a rationale for why you might not include it. And lists three examples of where it could be valid to water down the intended semantics. The system can not support it; the functionality is not required because data durability is guaranteed in other ways; the functionality is traded off in cases where major risks have been reduced.
OSX on a consumer Mac doesn't fit those cases.
Linux with the option is violating POSIX even by the letter because presumably mounting the drive with -onobarrier does not cause all your applications to be recompiled with the property undefined. But it's not that unreasonable an option, it's clearly not feasible to have two sets of all your software compiled and select one or the other depending on whether your UPS is operational or not.
Oh yeah, I definitely agree with you on this. If anything you should be able to pass in flags to reduce resiliency - not have the default be that way. Maybe thats how the actual SIO spec reads (i havent read it).
The implication (in fact no, it's explicitly stated) is that this fsync() behaviour on OSX will be a surprise for developers working on cross platform code or coming from other OS's and will catch them out.
However if in fact it's quite common for other OS's to exhibit the same or similar behaviour (BSD for example does this too, which makes sense as OSX has a lot of BSD lineage), that argument of least surprise falls a bit flat.
That's not to say this is good behaviour, I think Linux does this right, the real issue is the appalling performance for flushing writes.
> If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system.
> fsync() might or might not actually cause data to be written where it is safe from a power failure.
The fsync() function shall request that all data for
the open file descriptor named by fildes is to be
transferred to the storage device associated with the
file described by fildes. The nature of the transfer
is implementation-defined. The fsync() function shall
not return until the system has completed that action
or until an error is detected.
then:
The fsync() function is intended to force a physical
write of data from the buffer cache, and to assure
that after a system crash or other failure that all
data up to the time of the fsync() call is recorded
on the disk. Since the concepts of "buffer cache",
"system crash", "physical write", and "non-volatile
storage" are not defined here, the wording has to be
more abstract.
The only reason to doubt the clarity of the above is that POSIX does not consider crashes and power failures to be in scope. It says so right in the quoted text.
Crashes and power failures are just not part of the POSIX worldview, so in POSIX there can be no need for sync(2) or fsync(2), or fcntl(2) w/ F_FULLFSYNC! Why even bother having those system calls? Why even bother having the spec refer to the concept at all?
Well, the reality is that some allowance must be made for crashes and power failures, and that includes some mechanism for flushing caches all the way to persistent storage. POSIX is a standard that some real-life operating systems aim to meet, but those operating systems have to deal with crashes and power failures because those things happen in real life, and because their users want the operating systems to handle those events as gracefully as possible. Some data loss is always inescapable, but data corruption would be very bad, which is why filesystems and applications try to do things like write-ahead logging and so on.
That is why sync(2), fsync(2), fdatasync(2), and F_FULLFSYNC exist. It's why they [well, some of them] existed in Unix, it's why they still exist in Unix derivatives, it's why they exist in Unix-alike systems, it's why they exist in Windows and other not-remotely-POSIX operating systems, and it's why they exist in POSIX.
If they must exist in POSIX, then we should read the quoted and linked page, and it is pretty clear: "transferred to the storage device" and "intended to force a physical write" can only mean... what that says.
It would be fairly outrageous for an operating system to say that since crashes and power failures are outside the scope of POSIX, the operating system will not provide any way to save data persistently other than to shut down!
> the fsync() function is intended to force a physical write of data from the buffer cache
If they define _POSIX_SYNCHRONIZED_IO, which they dont.
fsync wasnt defined as requiring a flush until version 5 of the spec. It was implemented in BSDs loooong before then. Apple introduced F_FULLFSYNC prior to fsync having that new definition.
I dont disagree with you, but it is what it is. History is a thing. Legacy support is a thing. Apple likely didnt want to change peoples expectations of the behaviour on OSX - they have their own implementation after all (which is well documented, lots of portable software and libs actively uses it, and its built in to the higher level APIs that Mac devs consume).
Depends on the definition of "storage device", I guess. If it's physical media, then OS X doesn't. If it's the controller, then OS X does. But since the intent is to have the data reach persistent storage, it has to be the physical media.
My guess is that since people know all of this, they'll just keep working around it as they already do. Newbies to OS X development will get bitten unless they know what to look for.
If you need to run software/servers with any kind of data consistency/reliability on OS X this is definitely something you should be aware of and will be a footgun if you're used to Linux.
Macs in datacentres are becoming increasingly common for CI, MDM, etc.
I believe it’s at least 5s. Marcan didn’t specify how long it was, but gave an example of at least 5s. That could cause a device to think it’s allowed to do something via MDM but not actually have a record in the database allowing it to do so.
You don't just lose seconds of data. When you drop seconds of writes, that can effectively corrupt any data or metadata that was touched during that time period. Which means you can lose data that had been safe.
And with those hundreds of databases, we’re only learning about this behavior now instead of any of the previous decades where abundant errors would have caused a conversation.
Doesn’t seem like an issue worthy of hundreds of HN comments and upvotes, just people raising a stink over a non-issue.
If in a hundred times someone loses power, their system is corrupted once, it’s a choice whether you accept that or not. I do not. I want quality and quality is an system that does not get corrupted.
I've used Macs for years and never been aware of it.
Note: the tweeter couldn't provoke actual problems under any sort of normal usage. To make data loss show up he had to use weird USB hacks. If you know you have a battery and can forcibly shut down the machine 'cleanly' it's not really clear what the need for a hard fsync is.
"Macs in datacentres are becoming increasingly common for CI, MDM, etc."
CI machines are the definition of disposable data. Nobody is running Oracle on macOS and Apple don't care about that market.
These days, best practice for data consistency / reliability in that environment, IIUC, is to write to multiple redundant shards and checksum, not to assume any particular field-pattern spat at the hard drive will make for a reliability guarantee.
The history is also interesting. It's not that "macOS cheats", but that it sincerely inherited the status quo of many years, then tried to go further by adding F_FULLFSYNC. However, Linux since got better, leaving macOS stuck in the past and everybody surprised. It's a big problem.
Docs [1] suggests that even F_FULLFSYNC might not be enough. Quote:
> Note that F_FULLFSYNC represents a best-effort guarantee that iOS writes data to the disk, but data can still be lost in the case of sudden power loss.
When building databases, we care about durability, so database authors are usually well aware that you _have_ to use `F_FULLSYNC` for safety. The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.
Having a separate syscall is annoying, but workable. Having a scenario where we call flush and cannot ensure that this is the case is BAD.
Note that handling flush failures is expected, but all databases require that flushing successfully will make the data durable.
Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
Wow. Could this explain why we have a lot of problems with MySQL running on Mac OS with the databases randomly getting totally corrupted and basically needing to be restored from backup each time?
At first glance, it seems to make sense - if someone shuts down while there is still uncommitted data because MySQL has tried a fsync(), it could leave the files on disk in a weird state when the power is cut. Am I missing something?
Maybe that's right, maybe it's not - impossible to tell from the snippet. I'm deeply suspicious when they start citing performance numbers on what is a ultimately an ordering change though.
> Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
The best the OS can do is to trust the device that the data was, indeed, written to durable storage. Unfortunately, many devices lie about that. If you do a `F_FULLSYNC`, you can say you did your best, but the data is out of your hands now.
> When building databases, we care about durability, so database authors are usually well aware that you _have_ to use `F_FULLSYNC` for safety. The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.
> Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
No, not without that. Even with that, you can't have durable writes; Not on a mac, or linux or anywhere else, if you are worried about fsync()/fcntl+F_FULLSYNC because they do nothing to protect against hardware failure: The only thing that does is shipping the data someplace else (and depending on the criticality of the data, possibly quite far).
As soon as you have two database servers, you're in a much better shape, and many databases like to try and use fsync() as a barrier to that replication, but this is a waste of time because your chances of a single hardware failure remain the same -- the only thing that really matters is that 1/2 is smaller than 1/1.
So okay, maybe you're not trying to protect against all hardware failure, or even just the flash failure (it will fail when it fails! better to have two nvme boards than one!) but maybe just some failure -- like a power failure, but guess what: We just need to put a big beefy capacitor on the board, or a battery someplace to protect against that. We don't need to write the flash blocks and read them back before returning from fsync() to get reliability because that's not the failure you're trying to protect against.
What does fsync() actually protect against? Well, sometimes that battery fails, or that capacitor blows: The hardware needed to write data to a spinning platter of metal and rust used to have a lot more failure points than today's solid state, and in those days, maybe it made some sense to add a system call instead of adding more hardware, but modern systems aren't like that: It is almost always cheaper in the long run to just buy two than to try and squeeze a little more edge out of one, but maybe, if there's a case where fsync() helps today, it's a situation where that isn't true -- but even that is a long way from you need fsync() to have durable writes and avoid data loss or corruption.
> No, not without that. Even with that, you can't have durable writes; Not on a mac, or linux or anywhere else, if you are worried about fsync()/fcntl+F_FULLSYNC because they do nothing to protect against hardware failure: The only thing that does is shipping the data someplace else (and depending on the criticality of the data, possibly quite far).
"The sun might explode so nothing guarantees integrity", come on, get real. This is pointless nitpicking.
Of course fsync ensures durable writes on systems like Linux with drives that honor FUA. The reliability of the device and stack in question is implied in this and anybody who talks about data integrity understands that. This is how you can calculate and manage error rates of your system.
> "The sun might explode so nothing guarantees integrity", come on, get real. This is pointless nitpicking.
I think most people understand that there is a huge difference between the sun exploding and a single hardware failure.
If you really don't understand that, I have no idea what to say.
> Of course fsync ensures durable writes on systems like Linux with drives that honor FUA
No it does not. The drive can still fail after you write() and nobody will care how often you called fsync(). The only thing that can help is writing it more than once.
What is the difference in the context of your comment? The likelihood of the risk, and nothing else. So what is the exact magic amount of risk that makes one thing durable and another not, and who made you the arbiter of this?
> No it does not. The drive can still fail after you write() and nobody will care how often you called fsync(). The only thing that can help is writing it more than once.
It does to anybody who actually understands these definitions. It is durable according to the design (i.e., UBER rates) of your system. That's what it means, that's always what it meant. If you really don't understand that, I have no idea what to say.
> The only thing that can help is writing it more than once.
This just shows a fundamental misunderstanding. You achieve a desired uncorrected error rate by looking at the risks and designing parts and redundancy and error correction appropriately. The reliability of one drive/system might be greater than two less reliable ones, so "writing it more than once" is not only not the only thing that can help, it doesn't necessarily achieve the required durability.
> What is the difference in the context of your comment? The likelihood of the risk, and nothing else. So what is the exact magic amount of risk that makes one thing durable and another not, and who made you the arbiter of this?
What's the difference between the sun exploding and a single machine failing?
I have no idea how to answer that. Maybe it's because many people have seen a single machine fail, but nobody has seen the sun explode? I guess I've never had a need to give it more thought than that.
> It does to anybody who actually understands these definitions. It is durable according to the design (i.e., UBER rates) of your system.
You are wrong about that: Nobody cares if something is "designed to be durable according to the definition in the design". That's just more weasel words. They care what are the risks, how you actually protect against them, and what it costs to do. That's it.
I was asking about the context of the conversation. And I answered it for you. It's the likelihood of the risk. Two computers in two different locations can and do fail.
> You are wrong about that: Nobody cares if something is "designed to be durable according to the definition in the design".
No I'm not, that's what the word means and that's how it's used. That's how it's defined in operating systems, that's how it's defined by disk manufacturers, that's how it's used by people who write databases.
> That's just more weasel words.
No it's not, its the only sane definition because all hardware and software is different, and so is everybody's appetite for risk and cost. And you don't know what any of those things are in any situation.
> They care what are the risks, how you actually protect against them, and what it costs to do. That's it.
You seem to be arguing against yourself here. Lots of people (e.g., personal users) store a lot of their data on a single device for significant periods of time, because that's reasonably durable for their use.
There is a point at which a redundant array of inexpensive and unreliable replicas is more durable than a single drive. Even N in-memory databases spread across the world is more durable than a single one with fsync.
Unfortunately few databases besides maybe blockchains have been engineered with that in mind.
> There is a point at which a redundant array of inexpensive and unreliable replicas is more durable than a single drive. Even N in-memory databases spread across the world is more durable than a single one with fsync.
Unless a failure mode you are concerned about include being cut off from the internet, or your system isn't network connected in the first place, in which case maybe not eh?
Anyway surely the point is clear. "Durable" doesn't mean "durable according to the whims of some anonymous denizen of the other side of the internet who is imagining a scenario which is completely irrelevant to what I'm actually doing with my data".
It means that the data is flushed to what your system considers to be durable storage.
Also hardware failures and software bugs can exist. You can talk about durable storage without being some kind of cosmic-ray-denier or anti-backup cultist.
Say you have mirrored devices. Or RAID-5, whatever. Say the devices don't lie about flushing caches. And you fsync(), and then power fails, and on the way back up you find data loss or worse, data corruption. The devices didn't fail. The OS did.
One need not even assume no device failure, since that's the point of RAID: to make up for some not-insignificant device failure rate. We need only assume that not too many devices fail at the same time. A pretty reasonable assumption. One relied upon all over the world, across many data centers.
"but guess what: We just need to put a big beefy capacitor on the board, or a battery someplace to protect against that. We don't need to write the flash blocks and read them back before returning from fsync() to get reliability"
I believe drives that do have capacitors are aware of it and return immediately from fsync() without writing to flash. Thats the point of this API
Since neither Macs nor any other laptops have SSDs with capacitors, this point is kind of moot.
I have at various points replaced or upgraded 15 NVME SSD's in desktops and laptops, and I have not seen a single one - could you please let me know where I can find a non-server SSD with capacitors that are large enough for it to flush data in case of a sudden power loss?
Laptop batteries are irrelevant - battery failure, freezin or cutting power to the curcuitbord by holding the off buttons are the failrue modes you have to protect against.
> The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.
Yeah you can definitely write a transactional database without having to rely on knowing you've flushed data to disk. Not only can you, but you surely have to otherwise you risk data corruption e.g. when there's a power-cut mid-write.
The whole point of transactional flush to disk is that you get confirmation that data is now safe from power loss. You don't get any guarantee because you _called_ flush. The guarantee comes from flush returning.
That's not defence. It fails the principle of least-surprise. If everyone's experience is that fsync is flushing then why would somebody think to look up the docs for Mac in case they do it differently?
Drive caches also used to not exist in the past. At that point, behavior was the same as it is on Linux today. It then regressed when drive caches became a thing.
Maybe it not being added to OSes when drive caches came into the picture was arguably a bug, and Linux has been the first OS to fix it properly. macOS instead introduced new, non-buggy behavior, and left the buggy one behind :-)
> Drive caches also used to not exist in the past. At that point, behavior was the same as it is on Linux today. It then regressed when drive caches became a thing.
You mean in the 1980s? Linux wasn’t used before this wasn’t a concern for sysadmins and DBAs. This concern has been raised for years - back in the PowerPC era the numbers were lower but you had the same arguments about whether Apple had made the right trade-offs, or Linux or Solaris, etc.
Given the extreme rarity of filesystem corruption being a problem these days, one might conclude that the engineers who made the assumption that batteries covered laptop users and anyone who cares about this will be using clustering / UPS were correct.
The minute storage manufacturers introduced drive caches is the minute this bug became the responsibility of storage manufacturers. IMO it’s not the kernel’s responsibility.
Also re Linux here's eg PostgreSQL 9.0 documentation saying ext4/zfs + scsi used the "SYNCHRONIZE CACHE" command with fsync even back then, and a equivalent SATA command being used by the storage stack with SATA-6 and later drives: https://www.postgresql.org/docs/9.0/wal-reliability.html
F_FULLFSYNC is nonstandard. As far as I know there is no standard-complicant way to get data on to stable storage on macOS. That's a bit of a problem. It makes a lot more sense to make the standard-compliant way actually sane.
It isn't. I already replied to you above. A barrier does not guarantee data durability and we already know Linux fsync() == macOS F_FULLFSYNC because they have the same (lack of) performance on the same hardware.
This sounds like the Linux fsync() and Linux syncfs() respectively. What you say is that F_FULLFSYNC is the same as Linux fsync() and your performance numbers back that up. Unfortunately, you would only see a difference between Linux fsync() and Linux syncfs() if you have files being asynchronously written at the same time as the files that are subject to fsync()/syncfs(). fsync() would only touch the chosen files while syncfs() would touch both. If you did not have heavy background file writes and F_FULLSYNC really is equivalent to syncfs(), you would not be able to tell the difference in your tests.
That said, let’s look at how this actually works on Mac OS. Unfortunately, the apfs driver does not appear to be open source, but the HFS+ driver is. Here are the relevant pieces of code in HFS+:
First, let me start with saying this merits a faceplam. The fsync() operation is operating at the level of the mount point, not the individual file. F_FULLSYNC and F_BARRIERFSYNC are different, but they both might as well be variants of the Linux syncfs().
For good measure, let us look at how this is done on the MacOS ZFS driver:
The file is properly synced independently of the mountpoint, such that other files being modified in the file are not immediately required to be written out to disk. That said, both F_FULLSYNC and F_BARRIERFSYNC on MacOS are mapped by the ZFS driver to the same function that implements fsync() on Linux:
It operates on the superblock, which is what MacOS’ HFS+ driver does.
From this, I can conclude:
Linux syncfs() == macOS F_FULLFSYNC on HFS+
Linux fsync() == macOS fsync()/F_FULLFSYNC/F_BARRIERFSYNC on ZFS
Also, MacOS F_BARRIERSYNC is a weakened Linux syncfs() and Apple’s documentation is very misleading (although maybe not technically wrong). POSIX does allow fsync to be implemented via syncfs (sync in POSIX, but I am saying syncfs from Linux to be less confusing). However, not issuing and waiting for the completion of an IO barrier on fsync is broken behavior like you claim.
I am not sure how MacOS APFS behaves. I imagine that additional testing that takes into account the nuances in semantics would be able to clarify that. If it behaves like HFS+, it is broken.
Edit: Upon further examination and comparing notes with the MacOS ZFS driver team lead, it seems that HFS+ is syncing more than requested when F_FULLFSYNC is used, but less than the entire filesystem. You are fine treating it as a Linux fsync. It is close enough.
I am sick of this callous and capricious disrespect for users and their data, rampant throughtout this wanky industry.
Do lawyers use Apple computers? Do they work on important documents relating to life and death?
Some people have literally been executed because developers couldn't do their job properly. People have been sent to jail for decades because developers fucked up in the british postmaster scandal.
Average people life in a dangerous world- work with documents about their financial wellbeing. They live in opressive countries where being gay is punishable by death. They drive 2 ton death machines. And now that we have put computers in places where life and limb depends on them, we are responsible for doing the job properly, that's why we get paid.
These machines are actually low-power enough that you could implement a last-gasp flush mechanism. The Mac Mini already survives 1-2 seconds without AC power (at least if idle). You could plausibly detect AC power being yanked and immediately power down all downstream USB/TB3 devices and the display (on iMacs), freeze all CPUs into idle, and have plenty enough reservoir cap to let NVMe issue a flush.
But they aren't doing that. I tested it on the Mac Mini. It loses several seconds of fsync()ed data on hard shutdown.
This does require a last-gasp indication from the PSU to the rest of the system, so if they don't have that, it's not something they could add in a firmware update.
> The ATX specification requires that the power-good signal ("PWR_OK") go high no sooner than 100 ms after the power rails have stabilized, and remain high for 16 ms after loss of AC power, and fall (to less than 0.4 V) at least 1 ms before the power rails fall out of specification (to 95% of their nominal value).
I don't think that quite works for the purpose. What you'd want is a second signal that goes low as soon as possible after loss of AC power.
My reading here is that PWR_OK going low is an indication that the PSU has stopped providing good power, and the CPU must shut down immediately, or it might miscompute something due to low voltage. At this point you absolutely don't want to do any last-minute writing, you'd be risking corruption.
What you need here is an early warning signal that you can react to while the PSU is still coasting on the internal capacitors.
16ms is just longer than one AC cycle at 60Hz and less than one AC cycle at 50Hz.
I would has a guess that 16ms is the physical limit for most consumer hardware (and maybe commercial computing) to detect mains loss.
Of course there is industrial hardware that can detect quicker than this but it would add a LOT of cost for arguably little gain, or something that could be solved in another manner.
> I would has a guess that 16ms is the physical limit for most consumer hardware (and maybe commercial computing) to detect mains loss.
Doubtful. 16ms is an awfully long time these days. There's no reason why you couldn't detect power loss much sooner, given a good input signal. The concept also gets used quite often, in the form of SSRs with zero crossing detection. Those are used for dimmers.
The reason is likely related to the awful waveforms produced by some UPSes and inverters:
Unlike a nice sine wave, those spend a good while hovering near zero volts, so the PSU has to be able to tolerate that. Detecting loss of power sooner in this case isn't a question of cost, it's a question of that you don't have a good signal to do the detection on to start with.
> Unlike a nice sine wave, those spend a good while hovering near zero volts, so the PSU has to be able to tolerate that.
That wave chart was atrocious. I wonder if the extra load on the DC-side caps leads to them having lower life expectancy than the ones in a PSU attached to a proper power grid?
Power OK signals are used to prevent latch ups in silicon due to power glitches. The signals will route to power management ICs to ensure a full reset with proper bringing up of the power rails on any power glitch.
Shitty software? My 2018 Mac Mini would crash every single time going to sleep on the last version of Mojave. I'm not alone in this as there's huge threads on MacRumors and Apple's support forum about it. Apple's "fix" was to just update to Catalina which indeed fixes it but doesn't really help if you want to run 32 bit software. Wouldn't surprise me if they did something similar again.
It has started crashing the night after I upgraded to macOS 12.2.0. The latest update to 12.2.1 hasn't fixed it. I'm pretty sure it's not hardware related as I had no issues before the OS upgrade.
Edit: Here's the first line of the crash log (which I'm sending to Apple every time):
panic(cpu 3 caller 0xfffffe0023be8be0): [data.kalloc.16]:
element modified after free (off:0, val:0x0000000000000030, sz:16, ptr:0xfffffe2fffc9bb00)
What would you implement in Asahi? Would you follow Apple's approach and defer flushes, implementing a kernel panic hook and having some kind of F_FULLFSYNC or just keep Linux' current implementation?
We're probably going to have a knob to defer flushes (but still do them, unlike Apple, after a max timeout) that will be on by default on laptops, and make sure panics flush the cache if we can. Also apparently we need to do something for the power button too, as I just tested how macOS handles that. There is a warning before the system shuts down but we need to listen to it. Same with critical battery states.
Then I misunderstood. Do you mean that Apple doesn't implement ANY timeout? So they only flush when the cache is full or when a shutdown routine has started?
They flush the cache when something requests the cache be flushed; I don't know if there is a timeout, because presumably it's not difficult for some random process to issue a FULLFSYNC and flush everything prior as a side-effect (the flush is global). But I've seen at least 5-10 seconds of data loss from drive cache loss on the Mac Mini, so if they do do deferred flushes the timeout is longer than that.
WTF, that is worse than I thought then. That's the dirtiest hack I've read, it's of very low quality for a company like Apple. That I'd expect for a OnePlus device, not for a full fledged Macbook.
When do off-the-shelf NVMe controllers flush their internal DRAM buffer? I presume that happened after a timeout, even if the OS does not issue a NVMe flush command.
Does Apple implement the NVMe spec on their controller, i.e. do they indicate "Volatile Write Cache"?
Hmm as slow as that is, does the controller support VERIFY? Because there is FUA in verify which forces the range to flush as well, and it could be used as a range flush. Depending on how they implement the disk cache its possible that is faster than a full cache walk (which is likely what they are doing).
This is one of those things that SCSI was much better at, SYNC CACHE had a range option which could be used to flush say particular files/database tables/objects/whatever to nonvolatile storage. Of course out of the box Linux (and most other OSs) don't track their page/buffer caches closely enough to pull this off, so that fsync(fileno) is closer to sync(). So, few storage systems implemented it properly anyway.
The choice of ignoring flushes vaguely makes sense if you assume the mac's SSD is in a laptop with a battery. In theory then the disk cache is non-volatile (and this assumption is made on various enterprise storage arrays with battery backup as well, although frequently its a controller setting). But i'm guessing someone just ignored the case of the mac mini without a battery.
I assumed the barrier was doing something like that, but marcan was able to inspect the actual nvme commands issued and has confirmed thats not the case.
But that would be awesome, especially with these ever growing cache capacities.
APFS at least has metadata checksums to prevent that. However it does not do data checksums (weird decision...), despite being a CoW fs with snapshotting, similar to ZFS and btrfs.
What confuses me about this is why are they so slow with F_FULLSYNC? Since that's the equivalent of what non-Apple NVMEs do under, say, Linux, and they manage to be much faster.
The OS does not matter; it's strictly about the drive. macOS on a non-Apple SSD should be equally fast with F_FULLSYNC.
Indeed, I would very much like to know what on earth the ANS firmware is doing on flushes to make them so hideously slow. We do have the firmware blobs (for both the NVMe/ANS side and the downstream S5C NAND device controllers), so if someone is bored enough they could try to reverse engineer it... it also seems there's a bunch of debug mode options, so maybe we can even get some logs at some point.
Variants of the FSYNC story have been going on for decades now. The framing varies, but typically somebody is benchmarking IO (often in the context of database benchmarking) and discovers a curious variance by OS.
On NVMes I wonder whether this really matters, but it's a serious issue on spinning disks: do you really need to flush everything to the disk (and interrupt more efficient access patterns)?
> On NVMes I wonder whether this really matters, but it's a serious issue on spinning disks: do you really need to flush everything to the disk (and interrupt more efficient access patterns)?
That depends on the drive having power loss protection, which comes most of the time in the form of a capacitor that powers the drive long enough to guarantee that its buffers are flushed to persistent storage.
Consumer SSDs often do not have that, so flushing is really important there, at least if your data, or no FS corruption is important to you.
Enterprise SSDs almost always have power loss protection, so there it isn't required for consistency’s sake, albeit in-flight data that didn't hit the block device yet is naturally not protected by that, most FS handle that fine by default though.
Note that Linux, for example, does by default a periodic flush every 30s independent of caching/flush settings, so that's normally the upper limit you'd lose, depending on the workload it can be still a relatively long time frame.
Those VM tunables are about dirty OS cache, not dirty drive cache. If you fsync() a file on Linux it will be pushed to the drive and (if the drive does not have battery/capacitor-backed cache) flushed from drive cache to stable storage. If you don't fsync() then AIUI all bets are off, but in practice the drive will eventually get around to flushing your data anyway. The OS has one timeout for cache flushes and the drive should have another one, one would hope.
As you noted, Apple's fsync() behavior is defensible if PLP is assumed. Committing through the PLP cache isn't how these drives are meant to operate - hence the poor behavior of F_FULLSYNC.
But this isn't specific to Macs and iDevices. Some non-PLP drives also struggle with sync writes on FreeBSD [1]. Most enterprises running RDBMS mandate PLP for both performance and reliability. I understand why this is frustrating for porting Linux, but Apple is allowed to make strong assumptions about how their hardware interoperates.
I mean, typical seek time on rust is O(10ms) and these controllers are spending 20ms flushing a few sectors. Obviously rust would do worse if you have the cache full of random writes, though. The problem here is the huge base cost.
Think about what's going on in the controller running any page access SSD.
You have wear leveling trying to keep things from blowing holes in certain physical pages. In certain cell architectures you can only write to pages that have previously been erased. Once you do write the data to the silicon... it's not really written anyway, because the tables and data structures that map that to the virtual table the host sees on boot also have to be written.
It is entirely reasonable that a system that does 100k honest sustained write I/O per second would come to its knees if you're insistent enough to actually want a full, real, power cycle proof, sync.
To do an actual full sync, where it could come back from power off... requires flushing all of those layers. Nothing is optimized to do that. I'm amazed that it can happen 40 times per second.
It's possible that you could speed this up a bit, but somewhere there's an actual non-wear leveled single page of data that tells the drive how to remap things to be useful... I strongly suspect writing that page frequently would eat the drive life up in somewhere between 0.1 and 20 million cycles. After that point, the drive would be toast.
I agree with the other thread that actually flushing is likely to be a very, very well guarded bit of info.
Good question. I just started up a loop doing USB-PD hard reboots on my MBA every 18 seconds (that's about one second into the desktop with autologin on, where it should still be doing stuff in the background). Let's see if it eats itself.
Finding out if a DFU restore can recover a corrupted SSD storage would be an interesting test in and of itself!
But to be honest, if I end up really bricking a machine for science, that will be worth it for the information it gives us. Obviously I'm not trying to destroy my hardware, but I'm very grateful that I can afford it if it happens thanks to all the support I'm getting from folks for the project.
Laptops are fine unless your battery has issues and you get occasional power losses, which seems to be not too uncommon for third-party batteries (which themselves are not too uncommon since Apple will charge you an arm and a leg to replace half your laptop if you have a defective battery).
Bad batteries generally allow for last-gasp handling, and I've definitely seen the SMC throw a fit on some properties a few seconds before shutdown due to the battery being really dead. Not sure if macOS handles this properly, but I'd hope it does, and if it doesn't they could certainly add the feature. It would be quite an extreme case to have a battery failure be so sudden the voltage doesn't drop slowly enough to invoke this.
A fair fraction of the bad batteries I have seen have not behaved like this. Things like immediate power failure on disconnecting AC power, or claiming to be at 30% and then dying, or denying the existence of the battery altogether (two of these have happened to me personally—one at the ripe age of four months rather than due to age—and three or four to other family members). It’s certainly more common for them to just fade fairly rapidly to zero and die there, but it’s by no means rare for them to spontaneously fall over.
We're talking different timescales here. All you need is one second or so to command the NVMe controller to flush, and killing other power consumers in the mean time would buy you more time by reducing load, possibly even giving you several minutes the way batteries work (they tend to fall over under load when defective/dead). What may visually appear as power suddenly failing isn't necessarily so at the scale of voltage threshold interrupts and PMICs.
What usually happens is battery internal resistance is too high to sustain a given power load, so once load crosses a threshold the system goes into a spiral of doom increasing current as battery voltage decreases and you end up in a shutdown. That's the "30% and suddenly 0% or a shutdown" scenario. But if you catch it before it's too late, you can just stop consuming power and let the NVMe controller flush.
The case I have in mind where it would suddenly die around 30% would happen around that point regardless of load, even asleep, after following a sufficiently typically linear discharge curve up to that point. Maybe the power management system gets a fraction of a second’s notice, I don’t know; but it wasn’t a 30% plummeting to zero over the course of ten or thirty seconds, or even a “30%; no—0%; no—dead” case, which seem to be the much more common failure modes. As for the “pull the AC power and it instantly dies” cases, I’m a layman in battery matters, with no more than high school electronics, but I’d be surprised if there’s enough in there for it to do anything—those are cases where either it literally has no battery to draw on (because it’s electronically dead), or thinks it has a battery but discovers as soon as it tries to draw on it that it effectively doesn’t actually.
If it's literally dying at 30% with no warning, it's either the battery polling being too slow (keep in mind the UI will usually only refresh once a minute or so for these things; the power management system has faster stats), or the charge estimation being way off. There's very little reason for a battery to drop from true 30% SoC to completely dead, without first going into a power draw spiral of doom which you can revert if you stop consuming as much power.
“30% to 0” and “Pull AC and it instantly dies” are typically a combination of load and device temperature. High CPU/GPU usage, high brightness, 3G/LTE usage, and cold temps and the device doesn’t have a chance.
It’s been somewhat fascinating to monitor power usage in this really crude way. TikTok on iOS, for example, uses so much power that it’s the most likely to cause the device to shut off. FB Messenger is in the top 5. Some of Apple’s background processes will also cause it, as will paging memory to disk.
There’s another bit of information that will not surprise many people on HN: high-amperage charging will cause the battery percentage to be “more wrong”. Devices will report 45% or higher and still die as if they were reporting 30%. Charging at 500mA will not only make it “more correct”, but will typically mean that a device will not suddenly die until it’s in the single digits.
Does anyone here run a desktop Mac without a battery backup device?
All of my Macs are either laptops or have a hardware backup device, so unlikely a write would be lost due to power failure (unless backup device failed which could happen).
Where I live the power is quite dirty, so even when power losses are measured in years I invest in line-filtering UPS’ to extend the life of my systems.
I even lost a MBP to a light flickering event with 0 power loss. Fried the charging circuit straight through the original power brick.
Laptops have batteries, so an AC power failure doesn't mean they immediately crash: they just keep running on battery until the battery gets low, at which point the system cleanly hibernates.
As a laptop user I would probably opt to make the same choice as Apple here. I like the idea mentioned to allow a tunable parameter to only allow ever losing 1 second of data.
Although, I also have the seemingly rare opinion here that ECC ram doesn't really matter on a laptop or desktop.
NVME even allows to make queues write through, so e.g. the kernel/fs driver could have/access the drive via a safe queue that always gets written.
You can also prioritize queues to lower the chances of important data to be lost, though Apple seems to be super aggressive on caching and the drives tend to keep some written data in cache for quite long intervals.
You presumably don't reboot your laptop by connecting a USB-PD gadget that issues a hard reset. A normal OS reboot is fine, that will flush the cache.
The most common situation where this would affect laptops, in my experience so far, would be a broken driver causing a kernel lockup (not a panic) which triggers a watchdog reboot. That situation wouldn't allow for an NVMe flush.
For products like the Mac Mini, which don’t have a battery, does this mean that a loss of mains power will cause data loss? Because brownouts do happen occasionally…
Yes. I've tested yanking the power and can easily see 5 seconds of data loss for data that was fsync()ed (but not full synced). I'm not sure yet if corruption due to reordering is also possible, but it seems likely.
I just tested that. Holding down the power button invokes a (somewhat special) btn_rst kernel panic before it has a chance to invoke a true hardware reset, and kernel panics involve an NVMe driver hook which I'm pretty sure issues a flush. Should be safe.
At least re: this issue; it's still a bad idea because it's only safe if all software is written following data integrity and flush rules to the letter, and most software isn't. You're eventually going to run into issues on any OS by doing that, because most software doesn't get this right unless it's a database. And you're still going to lose data that's in buffer cache, I'm pretty sure that won't get flushed.
See, that's a forced shutdown, a last resort measure; it's using a sledgehammer to tap in a nail. You shouldn't do that as a habit, even if this particular optimization issue wasn't a thing.
I mean I grew up diligently turning off my PC by parking the disk and using the various operating system level shutdown procedures. Nowadays I smack the off button, but that still just triggers the OS shutdown procedure. I don't turn my Mac off as a rule, its sleep mode actually works. ish.
If they're like me: outside of a software update I only reboot when the machine is not responding, at which point hard reboot is faster and more robust. I recognize it's not ideal, but I also don't think it's reasonable for the system to ever get to a point where I should be wanting to restart to "fix" it - and I would think it is a serious bug if doing so ever corrupted the system or lost any "saved" data.
Linux Magic SysRq + R S E I V B key chord will immediately shut down while still properly flushing disk cache and such. A bit annoying to enter, but a handy tool to have in your toolbox.
That's not the right keys and not the right order to do that. You should not flush caches before you terminated as much processes as possible correctly. And you are rebooting at the end.
REISUB for a somewhat safe EMERGENCY reboot and O instead of B at the end for shutdown.
> There's zero reason to reboot any modern OS daily.
- I use Arch, I like to avoid accumulating too much major updates between reboots.
- For a time I was facing a bug that resulted in a black screen of death after resuming sleep.
In my use. Yes. I didn’t realize this was the reason until I saw this thread, and now I’ve tested it. Luckily, I don’t do massive data transfers nor do I do any large data work. When I got my M1 Mac Mini, however, I did and had immediate buyer’s remorse. I thought that I/O must be terrible on this thing, and I felt cheated. After the initial stand-up, I wasn’t so angry. For most tasks, it’s faster than my old TR4 1950X.
Sure, but I do not think it’s due to their feeling that their own software is inferior. I think much more of that is cost. They needn’t pay to develop yet another OS variant, and instead benefit off of the open source community and their past contributions to said community.
They wouldn't really need to develop a variant. Plenty of people used to run servers on macOS just configured to be headless. It just doesn't meet the standard anymore.
And they used to have the XServe line of servers and storage. I used XServe RAID storage as inexpensive JBODs for some Sun servers a decade ago, they were quite nice machines.
The funny thing here is that battery-backed enterprise systems are worse off in that manner, because you're much more likely to notice a dying battery that your entire device relies on than the little battery pack hooked up to your RAID array.
Sure, you could write a program that periodically checks the battery rate (you'd have to poll since there's no ACPI notification like with a "device battery") and sends an email to the admin or something. However that's a tool that doesn't "exist" (as in, there isn't notable program that does so) which possibly hints that this isn't something system admins often do.
The above also requires there to be an interface available from userland, not only in the management firmware or BIOS/UEFI. That exists for HP, but I'm not sure all other OEMs do so.
To emulate a flushing SSD, the signal really needs to go directly to the SSD firmware so it can decide which is the last OS write it can accept while still having enough power to persist all write and flush requests it has already accepted.
Getting all that right sounds so hard it is probably better to just have enterprise SSD's have a built in supercap to give 5 seconds or so of power to do all the necessary flushing, and for laptop/desktop grade SSD's they only need to offer barriers for data consistency. Laptop and desktop users don't care if they lose the last 1 second of data before a crash as long as what is on the drive is self consistent.
I should've been a little clearer; by "enterprise systems" I was referring to RAID controllers and the like. Though yes, I believe enterprise SSDs/NVMes likely have a capacitor or, as one friend put it, an "overkill battery" to use for flushing data.
To be fair though, I sidetracked from the discussion at hand. The issue Marcan described was regarding the OS -> Disk rather than a "power loss situation". The latter does play in with the former, but solving the latter doesn't necessarily solve the former.
Enterprise system have monitoring through the BIOS which will send an email, expose the status via SNMP and other method of monitoring (same as having a faulty fan).
Correct me if I'm wrong, but I wouldn't call the management engine (eg. HP iLO) the BIOS. Whilst those may support such warnings:
1) Not everyone wants to use iLO or whatever equivalent another OEM provides.
2) Whilst such systems do support sending warnings about system components via email, dashboards, etc. that doesn't mean they'll necessarily warn about a RAID controller's battery being depleted. If I remember correctly, iLO4 doesn't.
3) What about RAID cards like the P420 (*not* the P420i) that either aren't hooked up to a management engine or are from an entirely separate OEM?
>1) Not everyone wants to use iLO or whatever equivalent another OEM provides.
Then you aren't an enterprise because they're absurdly useful for managing dozens/hundreds/thousands of systems.
>3) What about RAID cards like the P420 (not the P420i) that either aren't hooked up to a management engine or are from an entirely separate OEM?
There's a reason enterprises standardize on a common infrastructure from an OEM that supports everything in the box even though you could go on Newegg and build your own systems for thousands of dollars less.
That's the first time I've heard of batteries [for RAID controllers] having an entirely separate port than that which hooks them up to the controller. Is this a "there are some of X" or have I just been out of the loop?
The dirty secret about today's high density NAND is that tPROG is not fast. It's an order of magnitude slower than the heyday of SLC. Now that doesn't really matter for enterprise drives, they complete writes into very fast storage that is made durable one way or another (e.g., flush on power fail), and this small store gets streamed out to the NAND log asynchronously. This is why random single queue depth durable writes can actually be faster than reads on enterprise drives, because random reads have to come from NAND (tREAD is still very fast, just not as fast as writing to DRAM).
Apple may not implement such a durable cache, that's fine it's not an enterprise device and it's a cost tradeoff. So they might have to flush to NAND on any FUA, and that's slow as we've said, but not 25ms slow. Modern QLC NAND tPROG latency is more like 2.5ms-5ms, which could just about explain the EVO results when you include the OS and SATA stack and drive controller.
There's pretty close to 0% chance Apple would have messed this up accidentally though, in my opinion. It would have been a deliberate design choice for some reason. One possible reason that comes to mind is that some drives gang a bunch of chips in parallel end you end up with pretty big "logical" pages. Flushing a big logical page on a 4kB write is going to cause a lot of write amp and drive wear, so you might delay for a short period (20ms) to try pick up other writes and reduce your inefficiency.
Nope, it's not a deliberate optimization / delay. Doing the flushes creates an extra ~10MB/s of DRAM memory traffic from the NVMe controller vs. not doing them while creating the same write rate. The firmware is doing something dumb when issued a flush command, it's not just sitting around and waiting.
> There's pretty close to 0% chance Apple would have messed this up accidentally though, in my opinion
There's pretty close to 100% chance Apple would not have cared/optimized for this when designing this SSD controller, because it was designed for iOS devices which always have a battery, and where next to no software would be issuing flushes.
And then they put this hardware into desktops. Oops :-)
Lots of things about the M1 were rushed and have been fixed along the way. I wouldn't be in the least bit surprised if this were one more of them that gets fixed a couple macOS versions down the line, now that I've made some noise about it.
> Nope, it's not a deliberate optimization / delay. Doing the flushes creates an extra ~10MB/s of DRAM memory traffic from the NVMe controller vs. not doing them while creating the same write rate.
How are you measuring that and how do you figure it means the NAND writes are not being held off? Clearly they are by one means or another.
> The firmware is doing something dumb when issued a flush command, it's not just sitting around and waiting.
> There's pretty close to 100% chance Apple would not have cared/optimized for this when designing this SSD controller, because it was designed for iOS devices which always have a battery, and where next to no software would be issuing flushes.
Yes. It is clear the hardware was never optimized for it. Because it is so slow. I'm almost certain that is a deliberate choice, and delaying the update is a possible reason for that choice. It's pretty clear the hardware can run this much faster, because it does when it's streaming data out.
NAND and the controller and FTL just isn't rocket science that you'd have hardware that can sustain the rates that Apple's can and then through some crazy unforeseen problem this would suddenly go slow. Flushing data out of your cache into the log is the FTL's bread and butter. It doesn't suddenly become much more complicated when it's a synchronous flush rather than a capacity flush, it's the same hardware data and control paths, the same data structures in the FTL firmware and would use most of the same code paths even.
Pull blocks from the buffer in order and build pages, allocate pages in NAND to send them, update forward map, repeat.
powermetrics gives you DRAM bandwidth per SoC block, before and after the system level caches.
> how do you figure it means the NAND writes are not being held off? Clearly they are by one means or another.
I mean they're not just being held off. It's doing something, not waiting.
> Yes. It is clear the hardware was never optimized for it.
This is a firmware issue. The controller runs on firmware. I can even tell you where to get it and you can throw it in a decompiler and see if you can find the issue, if you're so inclined :-)
> I'm almost certain that is a deliberate choice, and delaying the update is a possible reason for that choice.
Delaying the update does not explain 10MB/s of memory traffic. That means it's doing something, not waiting.
> It's pretty clear the hardware can run this much faster, because it does when it's streaming data out.
Indeed, thus it's highly likely this is a dumb firmware bug, like the FLUSH implementation being really naive and nobody having cared until now because it wasn't a problem on devices where nothing flushes anyway.
> NAND and the controller and FTL just isn't rocket science that you'd have hardware that can sustain the rates that Apple's can and then through some crazy unforeseen problem this would suddenly go slow.
Yup, it's not rocket science, it's humans writing code. And humans write bad code. Apple engineers write bad code too, just take a look at some parts of XNU ;-)
> Flushing data out of your cache into the log is the FTL's bread and butter.
Full flushes are rare on devices where the cache can be considered persistent anyway because there's a battery and the kernel is set up to flush on panics/emergency situations (which it is). Thus nobody ever ran into the performance problem, thus it never got fixed.
> It doesn't suddenly become much more complicated when it's a synchronous flush rather than a capacity flush, it's the same hardware data and control paths, the same data structures in the FTL firmware and would use most of the same code paths even.
The dumbest cache implementation is a big fixed size hash table. That's easy to background flush incrementally on capacity, but then if you want to do a full flush you end up having to do a linear scan even if the cache is mostly empty. And Apple have big SSD caches - on the M1 Max the NVMe carveout is almost 1 gigabyte. Wouldn't surprise me at all if there is some pathological linear scan going on in the case of host flush requests, or some other data structure issue. Or just an outright bug, a cache locality issue, or any other number of things that can kill performance. It's code. Code has bugs and performance issues.
Right, so you don't really know what it's doing at all. That it does something different is expected.
> Indeed, thus it's highly likely this is a dumb firmware bug, like the FLUSH implementation being really naive and nobody having cared until now because it wasn't a problem on devices where nothing flushes anyway.
I don't think that's highly likely at all. I think it's highly unlikely.
> Yup, it's not rocket science, it's humans writing code. And humans write bad code. Apple engineers write bad code too, just take a look at some parts of XNU ;-)
I'm not some Apple apologist. I think their fsync() thing is stupid (although very surprised you didn't know about it and took you so long to check the man page, it's a old and well known issue and I don't even use or program for OSX). The hardware is clearly not very good for the task of a non-battery PC (even on batteries I think it's a questionable choice unless they can flush data in case of OS crash or low battery shutdown. I also think their kernel is low performing and a poor Frankenstein mishmash of useless microkernel bits. So you're not getting me on that one.
> Full flushes are rare on devices where the cache can be considered persistent anyway because there's a battery and the kernel is set up to flush on panics/emergency situations (which it is). Thus nobody ever ran into the performance problem, thus it never got fixed.
I never said the hardware was suitable for this type of operation.
> The dumbest cache implementation is a big fixed size hash table. That's easy to background flush incrementally on capacity, but then if you want to do a full flush you end up having to do a linear scan even if the cache is mostly empty.
I can think of dumber. A linked list you have to search.
This approach is really bad even if you don't have any syncs because you still want to place LBAs linearly even on NAND otherwise your read performance on large blocks
The fact you can come up with stupid thing that might explain it isn't a very good argument IMO. Sure that might be the case I didn't say it was impossible just didn't think it was likely. You're saying it's certainly the case. Don't think there's enough evidence, at best.
Look, it's just logic. There's a couple pages in cache. It has to flush them. Finding them and doing that doesn't take 10MB/s of memory traffic and 20ms unless you're doing something stupid. If it were a hardware problem with the underlying storage it wouldn't be eating DRAM bandwidth. The fact that it's doing that means it's doing something with the data in the DRAM carveout (cache) which is much larger/more complicated than what a good data structure would require to find the data to flush. The bandwidth should be .3MB/s plus a negligible bit of overhead for the data structure parts, which is the bandwidth of the data being written (and what you get if you do normal writes without flushing at the same rate). Anything above 1MB/s is suspicious, nevermind 10MB/s.
The logic is flawed though. You don't have the evidence or logic that it's certainly a bug or due to stupidity or oversight. I also don't know for certain that it's not which I'll acknowledge.
And if it was a strange forward map structure that takes a lot of time to flush but is fast or small or easy to implement, that actually supports my statement. That it was a deliberate design choice. Not a firmware bug. Gather delay was one example I gave, not an exhaustive list.
> And if it was a strange forward map structure that takes a lot of time to flush but is fast or small or easy to implement, that actually supports my statement. That it was a deliberate design choice.
By that logic, every time a programmer uses an inefficient data structure and introduces pathological performance it's not a bug, it's a "deliberate design choice".
At this point we're arguing semantics. My point is it's slow when it shouldn't be, and it can be made faster. Whether it's a "bug" or not comes down to whether Apple fixes it or not. I consider it a bug in my book because you don't normally design things to be 10-100x slower than the competition. It's too blatant not to be an oversight.
And I've seen Apple make many oversights in the past year and fix them in an update. There is plenty of evidence the platform was rushed and lots of things were full of jank in early macOS 11 that got fixed on the way to 12, and more are still being fixed. This would be one of many and completely in line with history so far. It's why we're requiring 12.1+ firmware as a baseline for Asahi going forward, because many things were fixed and I don't want to deal with the buggy versions.
> By that logic, every time a programmer uses an inefficient data structure and introduces pathological performance it's not a bug, it's a "deliberate design choice".
Not if it's efficient for it's primary use cases. Lots of data structures have pathological behavior including in the Linux kernel.
The details get very complicated and proprietary. NAND wears out as you use it. But it also has a retention time. It gradually loses charge and won't read back if you leave it unpowered for long enough. This is actually where enterprise drives can be speced worse than consumer. So durability / lifetime is specified as meeting specified uncorrected error rates at the given retention period. The physics of NAND are pretty interesting too and how it translates into how a controller optimizes these parameters. Temperature at various stages of operation and retention changes properties, time between erase and program does too. You can adjust voltages on read, program, erase, and those can help you read data out or change the profile of the data. Reading can disturb parts of other pages (similar to rowhammer). Multilevel cells are actually interesting some of them you program in passes so that's a whole other spanner in the works.
I don't know of a good place that covers all that, but much beyond "read/program/erase + wear + retention" is probably beyond "what every programmer should know".
The way you turn a bunch of NAND chips that have a "read/program/erase" programming model into something that has a read/write model (the flash translation layer or FTL) is a whole other thing again though. And all the endurance management and optimization, error correction... Pretty fascinating details really. The basic details though is that they use the same concepts as the "log structured filesystem", turns out a log structure with garbage collection is about a perfect it for turning the program/erase model into a random write model. That's probably what every programmer should know about that (assuming you know something about LSFs -- garbage collection, write amplification, forward and reverse mapping schemes, etc).
What every programmer should know in this context is a euphemism for how Drepper views that set of things to know i.e. Yes it's hard and yes really should know, you're a professional programmer. Storage is a little bit further away than memory, but it's still very important in certain lines of work
> Apple may not implement such a durable cache, that's fine it's not an enterprise device and it's a cost tradeoff.
I disagree with this - my Apple is an enterprise device. It's a Macbook Pro, issued by my employer, to do real work. I wouldn't give Apple a pass on this dimension. I get that the "Pro" label doesn't mean what it used to, but these aren't toys either.
Slightly related: if a drive runs with a properly journaled, fully checksummed filesystem, for example zfs or btrfs - does the write-through mode guarantee that you can only lose new data and not corrupt the old?
ZFS is not journaled. CoW eliminates the need for anything like a journal with the exception of synchronous IO, where an intent log is used that can be replayed after a power loss event.
In any case, ZFS should be fine as long as REQ_PREFLUSH is working properly. You can read a little about that here:
No, you won't see corruption on ZFS. Cutting power to the drive is always safe, you can slice a SATA cable with a guillotine if you want, you'll always see a consistent state of the filesystem. ZFS transactions are entirely atomic.
ZFS (and btrfs) is not "journaled", it's copy-on-write.
You won't see corruption of the filesystem itself, but you'll see data corruption as described in the thread. If the writes are delayed, the write ordering can get messed up. + Zfs has ZIL, which is basically journal equivalent.
Journals are used to protect filesystem metadata. The ZIL is used only to protect data.
You will not see any data corruption on ZFS as long as the underlying hardware implements REQ_PREFLUSH correctly and the software uses proper POSIX semantics. If no filesystem corruption is happening, then the stuff under ZFS is doing its job correctly and your problem is in userspace.
Following a crash, ZFS returns to a past good state. Any completed synchronous writes or writes protected by a completed fsync will be there. Any of those that did not complete can be expected to be lost (unless it was moments away from returning to userspace) and any non-synchronous IO that occurred in the last several seconds is allowed to disappear.
By default, non-sync IO is flushed to permanent storage every 5 seconds. A past good state is not something that I would call corruption and software is expected to be able to resume from the past under POSIX.
> Of course, in normal usage, this is basically never an issue on laptops; given the right software hooks, they should never run out of power before the OS has a chance to issue a disk flush command
I guess a UPS powerbackup would be useful. Laptops basically have built-in UPS which is perhaps why Apple has gone in that direction. I wonder if their high-end desktops with Apple Silicon will do something different there.
I'm not entirely sure how a UPS works with a computer, nor exactly how this flushing works, but doesn't a UPS run for awhile in the event of a power disruption, certainly more than the several seconds delay with this flushing?
I dug a bit further and the NVMe controller is doing about 6.2MB/s of DRAM reads and 10MB/s of DRAM writes while doing a flush loop like this (which it isn't doing with the same traffic sans the flushes). I wonder if it's doing something dumb like linear scanning a cache hash table to find things to flush... or maybe something with bad cache locality?
I'm pretty sure, whatever it is, Apple could fix it in a firmware update.
> fsync() will both flush writes to the drive, and ask it to flush its write cache to stable storage.
Can someone explain what "flushing write cache to stable storage" means? Isn't that the same as "writes to the drive". I am obviously not well versed in this area. Also what is stable storage? Never heard that term before.
SSDs and other storage drives have two layers (or more). The last layer is stable storage (= when you disconnect power no data is lost or corrupted). When you write to such a device your writes are first made in an earlier layer that is more like your computer’s main memory than actual storage (when you lose power your data is gone or corrupted). Only after time or when the cache is full an actual persistent write is made.
It's still not clear why Apple SSD so slow. Surely there's more to it. May be other SSDs are cheating in firmware? Or may be it's just bug in Apple firmware? I'm really interested if there will be follow ups on Apple side.
Since this design is inherited from iDevices, my guess is they never bothered to optimize this command since software on a battery-powered device would almost never need to issue it. It should be something they can improve in firmware.
From my understanding, the thing that’s slow is writing data to “permanent storage” (aka the layer under all the caching).
Some storage tech is just slow at that, and manufacturers muddy the water by rating some (SSDs|Micro SDs|whatever) in GB/s overall when much of those big numbers are a combination of caches and trickery.
I would not be surprised if Apple is using a tech that just has slow write speeds in trade for fast read speeds since most Apple users will be happy with faster read speeds.
I'm not here to defend apple, but if you have a desktop and you don't want to lose data then get a UPS. Proper write handling on the disk wont help if you haven't saved your doc in ten minutes.
They do use system RAM as cache, but that has no effect on performance. If anything it should be way faster than the puny RAM cache chips on typical SSDs. It doesn't explain the slow flush perf.
That is only for enterprise SSDs. Consumer SSDs do not have capacitor banks to do a full DRAM flush. Some have capacitor banks to ensure data at rest integrity and some use another mechanism for that, but I'm not aware of any that will guarantee full cache stability.
Honestly I don’t know. The order-of-magnitude performance difference in deferring the flush feels worth it to me if the risk is mitigated to sudden power loss.
I would think when the last of Apple’s hardware moves to ARM they’ll ensure there’s enough onboard battery to ensure the flushes happen reliably across form factors even if there’s a power cut.
If anything, now that the reason for the performance difference has been identified, I’d hope to see numbers for Linux and Windows storage access come up to par with these numbers as they go down this road too (e.g. via the NVME flush toggle mentioned in the article).
Yeah. If the same thing happens to a brand-less garbage SSD you purchased from Aliexpress it's clearly cheat and plainly malicious and incompetence, but the Apple tag certainly made us believe there is a second reason.
Trading correctness for performance without shouting at the users "YOUR DATA IS NOT SAFE WHEN YOU DO THIS" multiple times a day in a storage is benchmark-snake-oil. Period.
I’m not arguing for or against, I’m just pointing out that trading the possibility of data loss in the few seconds after a power cut for a difference of this magnitude actually makes sense in a lot of use cases.
My point above was that the same “cheat” (to use your word) could be applied to the unbranded SSD too, with similar performance gains.
I’m not giving Apple a pass for low flush performance, I’m saying there’s nothing I can see here that’s uniquely available to Apple that would prevent others from deferring flushes in the same way for similar performance gains - which would make sense in many cases.
Agree, I mean that this fsync behavior is not what makes M1 fast as a platform (which the parent seemed to imply) - it just speeds up the disk part. The CPU is fast on its own.
They just direct people to use Time Machine or iCloud, then look quizzically at you when have issue with writing off lost hours of work as cost of doing business.
“You can lose some of your file changes in case of hard-reboot” is more correct.
It was a given truth for me all the time and I can tolerate some data losses if power was accidentally turned off for my desktop, or if OS panicked (it happens ~ once per year to me).
If this is a price for a 1000x speed increase - I’m more than happy they have implemented it this way.
You can lose some file changes even after asking the OS to make sure they don't get lost, the normal way.
That's a problem. It means e.g. transactional databases (which cannot afford to lose data like that) have a huge performance hit on these machines, since they have to use F_FULLFSYNC. And since that "no really, save my data" feature is not the standard fsync(), it means any portable software compiled for Linux will be safe, but will be unsafe on macOS, by default. That is a significant gotcha.
The question is why do other NVMe manufacturers not have such a performance penalty? 10x is fine; 1000x is not. This is something Apple should fix. It's a firmware problem.
The whole point of a transactional database is that even in the case of a power loss you do not lose data. If you UPS blows up, and so you lose power, you should not lose data.
The point here is that on the apple systems if you do the correct thing your performance drops to that of spinning disks.
Add a secondary UPS. What will be the next excuse?
It's ridiculous to think that in case of power loss you expect 100% data integrity - it might happen in the middle of the command execution. If the system should be unkillable, it should have an unkillable power source in the first place.
The entire point of modern journaling filesystems and properly designed transactional databases is to ensure 100% data integrity in case of power loss, every time, no matter what. The thinking you have is from the 1990s. We can (and do) do better today.
A properly designed transactional database will only ever "fail ahead". If power fails a transaction that was in the process of committing might commit without an ack, but will never return an ack and then be lost on the next startup. The ack means the data is safe, regardless of what happened afterwards.
That comment is about the semantics of OS APIs; filesystems are designed not to corrupt themselves in case of hard shutdown, and this is true as long as the underlying storage is well-behaved (e.g. honors flush requests). Databases on macOS already use F_FULLFSYNC (if they noticed this issue) to provide those guarantees. On Linux they just use fsync().
>Add a secondary UPS. What will be the next excuse?
You shouldn't have to add a secondary UPS at all, period, and still get that.
Databases are designed that way (for integrity under sudden power loss) - the OS just needs to provide a standard call for the sync that they can use.
Now, fsync not guaranteeing a write is one thing -- and it's common in other OSes, even Linux behaved like that.
The non-commital fullsync on the other hand (and the slow speed) are problematic, and that's not an excuse for the user having such a bizarro case as wanting to run a DB on their Mac Mini without 2 UPS, that's you excusing Apple.
Not to mention that 2 UPS wont solve the problem if you're not there to shut down the computer gracefully as they, themselves, are depleted (e.g. at night) when there's a powerloss.
Since no Mac device has two power support adding a second UPS means chaining which will only increase the risk of something going wrong in the chain.
Nobody expects 100% data integrity on power of. What is expected is that data that what was fsynced has 100% data integrity once that system call returns. This information is also used when moving files across the network, the file gets deleted on the sender when the receiver said fsync is completed. This means you could loose entire files of data when moving things over the network onto a Mac
For what it's worth, adding an external UPS to a Mac laptop counts as two, and in fact you can add one per Type C port + MagSafe, so you can have up to 5 battery backups for the 2021 Macbook Pro line (internal + MagSafe + 3 x Type C).
I never considered Mac Laptops as they are not vulnerable to data loss in case of a power outage. How is the situation with Mac Minis which don't have a battery? Are there multiple redundant power input the Mac Mini can switch in-between without tuning of?
There aren't; you'd have to hack that in. It should be possible, though, since I think the entire system is powered from a single primary 12V rail internally.
Doesn't sound like something (most) Mac Users would do. And I am also quiet honoured that the marcan_42 replied to my mere comment. Keep doing what you are doing, if you feel you are doing great doing so.
This means the world would be full of complaints from macOS users, but for some reason, we only know about this detail because of that “shocking” Twitter thread.
#!/usr/bin/python
import os, sys, time, datetime
t = datetime.datetime.now().isoformat()
print(t)
for i in range(5):
time.sleep(1)
print(4 - i)
fd = os.open(sys.argv[1], os.O_RDWR|os.O_CREAT)
os.lseek(fd, 0, 0)
os.write(fd, b"test: " + t.encode("ascii") + b"\n");
os.fsync(fd)
print("done!")
time.sleep(100)
Run that on a Mac Mini. Do it a couple times. Remember the timestamp of the last one. Let it count down, then pull the plug within a few seconds after "done!" shows up. Boot up again. The file contents will have reverted to a prior point in time.
This isn't some hypothetical thing, this is a trivial test you can do. fsync() on macOS does not guarantee data is on stable storage. And this is actually well documented.
Then if you want to see the performance problem, make it a loop instead and use `fcntl.fcntl(fd, fcntl.F_FULLFSYNC, 1)`. You'll get 40-odd IOPS, but at least your data won't disappear after power loss.
Yanked power to macmini.lan after the rsync completed, then turned it on again
marcan@raider:~/tmp -$ ls file.txt
ls: cannot access 'file.txt': No such file or directory
marcan@raider:~/tmp 2$ ssh macmini.lan
Last login: Thu Feb 17 23:45:26 2022 from 192.168.3.10
marcan@Mini-M1-2020 ~ % ls file.txt
ls: file.txt: No such file or directory
At least I’m not as rude as you. You’ll need to be persistent to accomplish your results with the Asahi Linux project - good luck with that.
I had no doubts that you can lose your file if you are moving it and some very lucky power outage hits. I was not “pretending” it's not real.
What I have doubts about, is that it's a real concern for 99.93% of the users. As we’ve found here, it's not even a rare case in other kinds of OS, so users would definitely notice it.
It is theoretically possible, of course. In practice, it's just too rare to consider.
But still, I hope this topic will be noticed by Apple and they fix that low performance issue for fullsync. Also, I hope they will not make fullsync as the default behavior - it doesn't worth the risks (and those for whom it does - should use some Linux for sure).
You're missing the point. Anything wanting data integrity will now have to use F_FULLFSYNC, which is slow on this firmware. You might say "you shouldn't care about data integrity to this degree", but databases do, and lots of developers run databases on their machines, and now they'll be real slow. Maybe they'll add some config options like `i_promise_i_dont_care_about_data_integrity = 1`, but come on.
I guess I am old but the assumption I live by is that if power is suddenly cut from a computer - no matter desktop or laptop - it can damage the FS and/or cause data loss.
For any mission critical stuff, I have it behind a UPS.
At least your thinking is old. Modern filesystems and databases are designed to prevent data loss in that scenario.
The last time I saw a modern filesystem eat itself on sudden power loss was when I was evaluating btrfs in a datacenter setting, and that absolutely told me it was not a reliable FS and we went with something else. I've never seen it happen with ext4 or XFS (configured properly) in over a decade, assuming the underlying storage is well-behaved.
OTOH, I've seen cases of e.g. data in files being replaced by zeroes and applications crashing due to that (it's pretty common that zsh complains about .zsh_history being corrupted after a crash due to a trailing block of zeroes). This happens when filesystems are mounted with metadata journaling but no data journaling. If you use data journaling (or a filesystem designed to inherently avoid this, e.g. COW cases), that situation can't happen either. Most databases would be designed to gracefully handle this kind of situation without requiring systemwide data journaling though. That's a tradeoff that is available to the user depending on their specific use case and whether the applications are designed with that in mind or not.
I've been using Macs (both desktop and laptops) since I have memory. I've had the M1 since launch day, and I use it all day, both for work and personal use.
Why this never happened to me? Why I don't know anyone which had this problem? Why nobody is complaining as it happened with the previous gen keyboards?
I think we might be missing something in this analysis. I don't think Apple engineers are idiots.
Most people don't unplug their Mac Mini in the middle of working, and most users who do lose data after that happens would just think it's normal and not realize there is an underlying problem and modern OSes aren't supposed to do that.
I've seen APFS filesystems eat themselves in production (and had to do data recovery), twice. Apple don't have a perfect data integrity track record.
On laptop, you would get data loss / corruption on sudden power loss. This is rare. With "flush to storage device's RAM", even a kernel panic would not lose data if you let the storage device flush to flash without power loss.
POSIX spec says no: https://pubs.opengroup.org/onlinepubs/9699919799/functions/f...
Maybe unrealistic expectation for all OSes to behave like linux.
Maybe linux fsync is more like F_BARRIERFSYNC than F_FULLFSYNC. You can retry with those for your benchmarks.
Also note that 3rd party drives are known to ignore F_FULLFSYNC, which is why there is an approved list of drives for mac pros. This could explain why you are seeing different figures if you are supplying F_FULLFSYNC in your benchmarks using those 3rd party drives.