I did a quick run under profiler and on my AVX2-laptop the slowest part (>50%) was matrix multiplication (sgemm).
In current version of GGML if OpenBLAS is enabled, they convert matrices to FP32 before running sgemm.
If OpenBLAS is disabled, on AVX2 plaftorm they convert FP16 to FP32 on every FMA operation, which even worse (due to repetition). After that, both ggml_vec_dot_f16 and ggml_vec_dot_f32 took first place in profiler.
But I agree, that in theory, and only with AVX512 BF16 (not exactly FP16, but similar) will be fast with VDPBF16PS instruction. Implementation is not there yet.
I saw some discussion on llama.cpp that, theoretically, implementing matmul for each quantization should be much faster since it can skip the conversion. But practically, its actually quite difficult since the various BLAS libraries are so good.