...there are modern integrated GPUs that *don't* use virtual memory for all glob...

vvanders · on April 14, 2022

I'm not privy to the security mechanisms of the integrated GPUs I've worked with however I've got a decent amount of experience with mostly tile based GPUs and I've never hit TLB misses as a reason the GPU is going slow. On the CPU? Sure, but unless they want to get a lot more specific on what these optimizations are it's a bit hard to piece together what's going on here.

GPUs don't like to read around in memory(neither do CPUs for that matter but then tend to branch and be a lot less predictable) so optimizing reads/writes are something you will do regardless. The painful parts of TBRs usually(but not universally) have to do with drawcalls setup and render target switching. I remember some of the early iPhones had pretty painfully low drawcalls limits associated with the overhead with dispatching the actual calls as opposed to triangle count or fillrate. It's fairly easy to put together the synthetic benchmarks to shake that out.

brigade · on April 14, 2022

Well the claim is also that it isn't generally an issue for the M1's architecture until you scale up to 32 cores or more [1], which is beefier than any other current GPU outside of AMD and NVIDIA's offerings. So it's completely believable that the GPUs you worked on could have used virtual memory for everything without it ever becoming a performance issue.

(alternatively: it looks like the M1 Pro only has 24 MB L3, so it's likely that actual cache misses become an issue before TLB misses, but the Max/Ultra have more cache so that could invert)

[1] https://twitter.com/VadimYuryev/status/1514295707481501700

vvanders · on April 14, 2022

The claim is that if you don't "optimize" for TBDR then you are gated by the TLB and that software written for immediate mode rendering isn't optimized this way.

However drawcall batching, minimizing render target(and state!) switching are all bread-and-butter performance optimizations you'd make for an immediate mode renderer as well. You can get away with more drawcalls and there are specific things you can do on a per-architecture basis but they don't involve anything having to do with TLBs to the best of my knowledge. If someone wants to spill the beans on how the M1 is somewhat different here I'm all ears but it just doesn't line up with most GPU architectures I've seen.

The big wins between TBR and IMR usually involve getting larger tiles through less usage of rendertargets[1] as it increases the total size of the tiles which drives efficiencies in dispatching the tiles.

[1] https://developer.qualcomm.com/sites/default/files/docs/adre...

Dylan16807 · on April 14, 2022

I don't know how modern GPUs work exactly, but you could enforce security on physical addresses with a significantly less demanding TLB circuit, and you could use big swaths of memory so the bounds are guaranteed to fit in your memory logic, no buffers or caches needed.