You should quantize the model, but 12s/iter seems about right.

kpw94 · on Aug 19, 2023

Nice. Tried the fp32, q8_0, and q4_0, and for some reason they all take ~12s/iter.

Must have something wrong with my setup, but no big deal, for my minimal usage of it, and the amount of time spent, fp32@12s/iter is fine

brucethemoose2 · on Aug 19, 2023

Hmm, theoretically FP16 might be the fastest, if thats an option in the implementation now.

Lockal · on Aug 19, 2023

I did a quick run under profiler and on my AVX2-laptop the slowest part (>50%) was matrix multiplication (sgemm).

In current version of GGML if OpenBLAS is enabled, they convert matrices to FP32 before running sgemm.

If OpenBLAS is disabled, on AVX2 plaftorm they convert FP16 to FP32 on every FMA operation, which even worse (due to repetition). After that, both ggml_vec_dot_f16 and ggml_vec_dot_f32 took first place in profiler.

Source: https://github.com/ggerganov/ggml/blob/master/src/ggml.c#L10...

But I agree, that in theory, and only with AVX512 BF16 (not exactly FP16, but similar) will be fast with VDPBF16PS instruction. Implementation is not there yet.

brucethemoose2 · on Aug 19, 2023

Interesting.

I saw some discussion on llama.cpp that, theoretically, implementing matmul for each quantization should be much faster since it can skip the conversion. But practically, its actually quite difficult since the various BLAS libraries are so good.