Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Instead of desiring atomic maps, would it work here to give each task its own map and then merge those after the parallel processing has ended? I have have absolutely no experience of CUDA, so no idea how applicable that approach would be. However, it seemed pretty practical and effective when I tried a parallel implementation of this challenge on plain CPUs.


This is the possible optimization that I mention at the end of the blog - using a private map for each thread block.

The catch is that this map must fit in shared memory, which is pretty limited on all current hardware: ~100KB.

I originally thought that my map (stats array) was too big to fit into this shared memory. Now, however, I realize it can. It'll interesting to see how much speedup (or not!) this optimization can bring.


Looks like TFA already uses partitioning per thread.


You can definitely have thread local storage




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: