Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pshufb is an old trick. An oldie but a goodie from SSE2 days.

All hail modern improvements!! pdep, pext, vpcompressb, vpexpandb, and especially vpermi2v.

Forget about 16 byte lookup table with pshufb. Try a 128-byte lookup with vpermi2v AVX512.

--------

I find that pext/pdep for 64-bits and vpcompressb/vpexpandb for 512-bits is also more flexible than pshufb (!!!!) in practice, if a bit slower as it is over multiple instructions.

Supercomputer programs are written with a gather-conpute-scatter pattern. Gather your data, compute on the data, and finally place the data where needed.

Pext and vpcompressb are gather operations.

Pdep and vpexpandb are scatter.

Pshufb is only a gather and thus fails to match the supercomputer pattern. GPUs provide a bpermute (backwards-permute) that scatters results back to a table and is really the missing link IMO.

Hopefully Intel will provide bpermute of their own eventually....



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: