Instead of desiring atomic maps, would it work here to give each task its own map and then merge those after the parallel processing has ended? I have have absolutely no experience of CUDA, so no idea how applicable that approach would be. However, it seemed pretty practical and effective when I tried a parallel implementation of this challenge on plain CPUs.
This is the possible optimization that I mention at the end of the blog - using a private map for each thread block.
The catch is that this map must fit in shared memory, which is pretty limited on all current hardware: ~100KB.
I originally thought that my map (stats array) was too big to fit into this shared memory. Now, however, I realize it can. It'll interesting to see how much speedup (or not!) this optimization can bring.