Instead of desiring atomic maps, would it work here to give each task its own ma...

tspeterkim · on April 14, 2024

This is the possible optimization that I mention at the end of the blog - using a private map for each thread block.

The catch is that this map must fit in shared memory, which is pretty limited on all current hardware: ~100KB.

I originally thought that my map (stats array) was too big to fit into this shared memory. Now, however, I realize it can. It'll interesting to see how much speedup (or not!) this optimization can bring.

speps · on April 14, 2024

Looks like TFA already uses partitioning per thread.

SJC_Hacker · on April 14, 2024

You can definitely have thread local storage