Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I don't understand your point about pipelining - OoO should mean that, as long as there's enough decode bandwidth and per-iteration scalar overhead doesn't overwhelm scalar execution resources, all SIMD ops can run at full force up to the most contended resource (store here), no?

You are reaching the limits of my understanding, but my level of knowledge is that store may have reciprocal throughput of 2, but it only occupies two ops (from double pumping a single one) over those two cycles, while the CPU pipeline can handle doing 10. For store in particular, nothing is dependent on it completing, so it can be "thrown into the wind" so to speak. But here's my approximation of the pipeline of a single thread, where dashes separate ops

LOADU.0 - LOADU.1 - _ - _ - _ - _ - ADD.0 - ADD.1 - _ - _ - CMP.0 - CMP.1 - _ - _ - _ - _ - ADD.0 - ADD.1 - _ - _ - STORE.0 - STORE.1 - [start again, because nothing is dependent on STORE completing]

So, that's 10 ops and 12 empty spots that can be filled by simultaneously doing 1.2 more loops simultaneously.

I do want to know why clang isn't using the masked load/store. If it's willing to do it on a dot-product, it should do it here as well. It makes me want to figure out what is blocking it (usually some guarantee that 99.9% of developers don't know they're making).



The store will still take up throughput even if nothing depends on it right now - there is limited hardware available for copying data from the register file to the cache, and its limit is two 32-byte stores per cycle, which you'll have to pay one way or another at some point.

With out-of-order execution, the layout of instructions in the source just doesn't matter at all - the CPU will hold multiple iterations of the loop in the reorder buffer, and assign execution units from multiple iterations.

e.g. see: https://uica.uops.info/?code=vmovdqu64%20zmm3%2C%20zmmword%2...

(click run, then Open Trace); That's Tiger Lake, not Zen 4, but still displays how instructions from multiple iterations execute in parallel. Zen 4's double-pumping doesn't change the big picture, only essentially meaning that each zmm instr is split into two ymm ones (they might not even need to be on the same port, i.e. double-pumping is really the wrong term, but whatever).


> The store will still take up throughput even if nothing depends on it right now - there is limited hardware available for copying data from the register file to the cache, and its limit is two 32-byte stores per cycle, which you'll have to pay one way or another at some point.

Sure. But that limit is one cycle--not two. This is getting pretty above my pay grade.

That tool is nifty, but I couldn't really figure out why it supports your assertion. I plugged in the two loop bodies and got these predicted throughput results:

Unrolled 4x: uiCA 6.00 llvm-mca 6.20

Regular: uiCA 2.00 llvm-mca 2.40

LLVM pretty strongly supports my experience that unrolling/reordering should be a substantial gain here, no? uiCA still has a meaningful gain as well.


Sorry, mixed things up - store has a limit of one 32-byte store per cycle (load is what has two 32-byte/cycle). Of additional note is Zen 4's tiny store queue of 64 32-byte entries, which could be at play (esp. with it pointing at L3; or it might not be important, I don't know how exactly it interacts with things).

The uiCA link was just to show how out-of-order execution works; the exact timings don't apply as they're for Tiger Lake, not Zen 4. My assertion being that your diagram of spaces between instructions is completely meaningless with OoO execution, as those empty spaces will be filled with instructions from another iteration of the loop or other surrounding code.

Clang is extremely unroll-happy in general; from its perspective, it's ~free to do, has basically zero downsides (other than code size, but noone's benchmarking that too much) and can indeed maybe sometimes improve things slightly.


Is there a limit to how much the OoO execution buffer can handle? I don't want to dig up the benchmarks again, but I remember for very long calculations (like computing a bunch of exp(x[i]), which boil down to about 40 AVX512 instructions), the unrolling+reordering was very effective.


Zen 4 has two 32-entry schedulers for SIMD, which I think is the relevant buffer here (many of my numbers in this thread, including these, are coming from https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-fron...), which would get mostly exhausted by one iteration of a 40-instruction loop. Whereas the 5-instruction body here should easily manage many iterations.

I think, at least. Now I'm at the limit of my knowledge, it took me a bit to figure out whether the reorder buffer or scheduler is the relevant thing here, and I'm still not too sure. Though, either way, 32 is the smallest number of all for Zen 4, and still fits 6 iterations of the 5-instruction loop and should happily reorder them. (and the per-iteration-overhead scalar code goes to separate schedulers)


On clang masked load/store tail - there are some flags to change its behavior, but seems like nothing does just a masked tail, and what does exist still has work to do, e.g. here's a magic incantation that I found looking through LLVM's source code to make it do a single body for the entire loop: https://godbolt.org/z/1Kb6qTx31. It's hilariously inefficient though, and likely meant for SVE which is intentionally designed for things like this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: