> I don't understand your point about pipelining - OoO should mean that, as long...

dzaima · on July 29, 2024

The store will still take up throughput even if nothing depends on it right now - there is limited hardware available for copying data from the register file to the cache, and its limit is two 32-byte stores per cycle, which you'll have to pay one way or another at some point.

With out-of-order execution, the layout of instructions in the source just doesn't matter at all - the CPU will hold multiple iterations of the loop in the reorder buffer, and assign execution units from multiple iterations.

e.g. see: https://uica.uops.info/?code=vmovdqu64%20zmm3%2C%20zmmword%2...

(click run, then Open Trace); That's Tiger Lake, not Zen 4, but still displays how instructions from multiple iterations execute in parallel. Zen 4's double-pumping doesn't change the big picture, only essentially meaning that each zmm instr is split into two ymm ones (they might not even need to be on the same port, i.e. double-pumping is really the wrong term, but whatever).

kolbe · on July 29, 2024

> The store will still take up throughput even if nothing depends on it right now - there is limited hardware available for copying data from the register file to the cache, and its limit is two 32-byte stores per cycle, which you'll have to pay one way or another at some point.

Sure. But that limit is one cycle--not two. This is getting pretty above my pay grade.

That tool is nifty, but I couldn't really figure out why it supports your assertion. I plugged in the two loop bodies and got these predicted throughput results:

Unrolled 4x: uiCA 6.00 llvm-mca 6.20

Regular: uiCA 2.00 llvm-mca 2.40

LLVM pretty strongly supports my experience that unrolling/reordering should be a substantial gain here, no? uiCA still has a meaningful gain as well.

dzaima · on July 29, 2024

Sorry, mixed things up - store has a limit of one 32-byte store per cycle (load is what has two 32-byte/cycle). Of additional note is Zen 4's tiny store queue of 64 32-byte entries, which could be at play (esp. with it pointing at L3; or it might not be important, I don't know how exactly it interacts with things).

The uiCA link was just to show how out-of-order execution works; the exact timings don't apply as they're for Tiger Lake, not Zen 4. My assertion being that your diagram of spaces between instructions is completely meaningless with OoO execution, as those empty spaces will be filled with instructions from another iteration of the loop or other surrounding code.

Clang is extremely unroll-happy in general; from its perspective, it's ~free to do, has basically zero downsides (other than code size, but noone's benchmarking that too much) and can indeed maybe sometimes improve things slightly.

kolbe · on Aug 1, 2024

Is there a limit to how much the OoO execution buffer can handle? I don't want to dig up the benchmarks again, but I remember for very long calculations (like computing a bunch of exp(x[i]), which boil down to about 40 AVX512 instructions), the unrolling+reordering was very effective.

dzaima · on Aug 2, 2024

Zen 4 has two 32-entry schedulers for SIMD, which I think is the relevant buffer here (many of my numbers in this thread, including these, are coming from https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-fron...), which would get mostly exhausted by one iteration of a 40-instruction loop. Whereas the 5-instruction body here should easily manage many iterations.

I think, at least. Now I'm at the limit of my knowledge, it took me a bit to figure out whether the reorder buffer or scheduler is the relevant thing here, and I'm still not too sure. Though, either way, 32 is the smallest number of all for Zen 4, and still fits 6 iterations of the 5-instruction loop and should happily reorder them. (and the per-iteration-overhead scalar code goes to separate schedulers)

dzaima · on July 29, 2024

On clang masked load/store tail - there are some flags to change its behavior, but seems like nothing does just a masked tail, and what does exist still has work to do, e.g. here's a magic incantation that I found looking through LLVM's source code to make it do a single body for the entire loop: https://godbolt.org/z/1Kb6qTx31. It's hilariously inefficient though, and likely meant for SVE which is intentionally designed for things like this.