> This old myth that mmap is the fast and efficient way to do IO just won't die....

TheCondor · on May 2, 2019

It’s not a myth at all, mmap is faster, you save on straight copies of data and the sys-calls to do it. It should be faster in nearly all circumstances, faster by at least a copy. In exchange you pick up a lot of complexity dealing with faults and you potentially put stress on the VM system. If you are doing to ‘O’ part of I/O then mmap starts to be really complex, fast. rg is kind of a special case, it’s not writing, it’s going to do mostly (maybe only, I assume it backtracks on matches) sequential reads of mostly static files, it’s really the easy case, its not clear that madvise would help and your brand is speed so saving on those copies is worth it. What might be interesting, on certain memory constrained systems you can slide a smaller map space through the file rather than mapping the whole thing; it’s been a while since I looked at it all but mapping the smaller pieces gives huge hints to the vmm and it would probably slow down rg incrementally but speed up overall system performance.

BeeOnRope · on May 2, 2019

I agree with you it's faster on the systems I've tried (modern x86 systems with local drives, mostly). The drive type or IO speed doesn't matter much because it's cached IO we are interested in: if actual IO needs to occur that is the the time that will dominate, not memcpy to/from user space (except perhaps with very, very fast devices, i.e., in the 5 GB/s range).

That said, the case isn't as obvious as you make it: you apparently save on copies and explicit system calls, but mmap replaces those with page table manipulation and "hidden" system calls (i.e., page faults).

These page faults have as much per-call overhead as regular system calls, and so if mmap actually faulted in every page, I'm pretty sure it would actually be slower than read(), since read with a 16K buffer (for example) would make only 25% as many syscalls as mmap bringing in every 4K page.

On modern Linux, by default, mmap doesn't fault in every page, due to "faultaround" which tries to map in additional nearby pages every time it faults (16 by default), so the number of faults is 1/16th what you'd expect if it faulted on every page. You can avoid additional mapping on access with MAP_POPULATE or madvise (? maybe) calls, but then this introduces the same kind of window management problem as read: you lose the abstraction of the entire file just mapped into memory.

Beyond that, mmap has to do "per page" work to map the file: adjusting VM and OS structures to map the page into the process address space, and then undoing that work on munmap (which is the more expensive half since it includes TLB invalidation, and possibly a cross-core shootdown). You'd thing that this work would be much faster than copying 4 KiB of memory, but it isn't - and on some systems with small pages sizes and/or slow TLB invalidations it can be slower overall.

burntsushi · on May 2, 2019

Yes... I know it's not a myth. :-) I was responding to someone who was saying that it was a myth.

> It should be faster in nearly all circumstances

As my previous comment showed, that's definitely not true. If you're searching a whole bunch of small files in a short period of time, then it appears that the overhead of memory mapping leads to a significant performance regression when compared to standard `read` calls.

> it’s really the easy case

I know. :-) That's why ripgrep has both modes. It chooses between them based on the predicted workload. It uses memory maps automatically if it's searching a file or two, but otherwise falls back to standard read calls.

Moreover, if ripgrep aborts once in a while because of a SIGBUS, then it's usually not a big deal. It's fairly rare for it to happen. And if it does happen to you a lot or you never want it to happen, then you just need to `alias rg="rg --no-mmap"`.

TheCondor · on May 2, 2019

I love ripgrep, btw, great work.

I was pondering this some more in the shower, the mmap for rg case is also sort of naturally cache oblivious, copies will consume hardware cache for the write and while there is a ton of hardware for cache on modern hardware, it’s a noticeable cost on some tests. If you’re searching through something big, then it’d be like doubling hardware cache which is probably really noticeable on smaller devices.

The small files case is interesting, copying the data is faster than patching up the page table tree, I bet there is a strong correlation to the hardware cache size vs the average file size in that case. The files probably need to be N pages in size for it to be worth it, might be an interesting heuristic to use.

zzzcpan · on May 2, 2019

You can surely win some laptop benchmarks by mmaping some files on certain close to the metal filesystems.

But for general production case mmap shouldn't even be considered a solution to the syscall and memory copy overhead problem. If that overhead is too big for you, other approaches work better, like buffering, application level caching, etc.

hyc_symas · on May 3, 2019

Wow, this is so wrong in so many ways.

Buffering and application level caching mean you're wasting memory, and also wasting code space and CPU time because you're duplicating work that the OS already does.

manwe150 · on May 2, 2019

Did you do each of these after a clean reboot, or are we looking at possible caching effects from the kernel? If any part was in cache, then we might be just comparing shared memory against IPC, and that's an obvious performance win, but not really what's intended to be examined here.

The first numbers seem to imply that it takes equally long for pread to copy bytes from memory as it does to fetch them from the disk. For a quick back-of-the-napkin attempt at checking this, lets assume that disk IO accounts for 100% of this workload, and that local memory is one order of magnitude faster. In that case, I would expect the difference for an optimized implementation to be at most 10%.

I do think it is true that there are scenarios where the file mmap is faster, or that certain operations on each kernel might fall off a cliff. I just find it hard to believe that `mmap` must be as much faster as shown here in a typical situation (e.g. after a clean reboot, doing about the same amount of work, issuing optimal syscalls, with the OS/kernel not doing anything foolish).

burntsushi · on May 2, 2019

Yes, the file is in cache, and that was my intent. That's why my `time` output says `faults 0` for both runs. That is, no page faults occurred. Everything is in RAM.

That is indeed a very common case for ripgrep, where you might execute many searches against the same corpus repeatedly. Optimizing that use case is important.

For cases where the file isn't cached, then it's much less interesting, because you're going to just be bottlenecked on disk I/O for the most part.

> then we might be just comparing shared memory against IPC, and that's an obvious performance win, but not really what's intended to be examined here.

Please don't take my comment out of context. I was specifically responding to this fairly broad sweeping claim with actual data:

> This old myth that mmap is the fast and efficient way to do IO just won't die.

You might think the fact that this isn't a myth is "obvious," but clearly, not everyone does. The right remedy to that is data, not back of the napkin theorizing. :-)

If you want to try your own benchmarks in your own environment, then you can: https://github.com/BurntSushi/ripgrep/

On Linux at least, you do not need to do a clean reboot to measure something without cache. You can drop the file cache with `sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'`.