Pshufb is an old trick. An oldie but a goodie from SSE2 days.
All hail modern improvements!! pdep, pext, vpcompressb, vpexpandb, and especially vpermi2v.
Forget about 16 byte lookup table with pshufb. Try a 128-byte lookup with vpermi2v AVX512.
--------
I find that pext/pdep for 64-bits and vpcompressb/vpexpandb for 512-bits is also more flexible than pshufb (!!!!) in practice, if a bit slower as it is over multiple instructions.
Supercomputer programs are written with a gather-conpute-scatter pattern. Gather your data, compute on the data, and finally place the data where needed.
Pext and vpcompressb are gather operations.
Pdep and vpexpandb are scatter.
Pshufb is only a gather and thus fails to match the supercomputer pattern. GPUs provide a bpermute (backwards-permute) that scatters results back to a table and is really the missing link IMO.
Hopefully Intel will provide bpermute of their own eventually....
All hail modern improvements!! pdep, pext, vpcompressb, vpexpandb, and especially vpermi2v.
Forget about 16 byte lookup table with pshufb. Try a 128-byte lookup with vpermi2v AVX512.
--------
I find that pext/pdep for 64-bits and vpcompressb/vpexpandb for 512-bits is also more flexible than pshufb (!!!!) in practice, if a bit slower as it is over multiple instructions.
Supercomputer programs are written with a gather-conpute-scatter pattern. Gather your data, compute on the data, and finally place the data where needed.
Pext and vpcompressb are gather operations.
Pdep and vpexpandb are scatter.
Pshufb is only a gather and thus fails to match the supercomputer pattern. GPUs provide a bpermute (backwards-permute) that scatters results back to a table and is really the missing link IMO.
Hopefully Intel will provide bpermute of their own eventually....