> The drawback however, is that these implementation only get that fast because ChaCha is embarrassingly parallel.
That is my point exactly. Why would you design, in this day and age, a generator that doesn't vectorize well? It will not take full advantage of the CPU. Even with parallel streams, large integer multiplication and variable-length rotation are SIMD-killers. Regarding the 1.5KB of data, I suspect you can get away with less than that if you specialize to this application, but note that this is still around half the state size of mt19937.
That is my point exactly. Why would you design, in this day and age, a generator that doesn't vectorize well? It will not take full advantage of the CPU. Even with parallel streams, large integer multiplication and variable-length rotation are SIMD-killers. Regarding the 1.5KB of data, I suspect you can get away with less than that if you specialize to this application, but note that this is still around half the state size of mt19937.