Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think MKL actually fixed Zen performance. That is, the workaround no longer makes any difference because it is no longer needed.

Odd. I am trying on my 3700X and it is definitely not using AVX, FMA or AVX2 code paths. Intel MKL 2020 update 2:

     ldd  ~/git/sticker2/target/release/sticker2  | grep mkl_intel
     libmkl_intel_lp64.so => /nix/store/jpjwkkv1dqk4nn8swjzr5qqzp0dpzk2f-mkl-2020.2.254/lib/libmkl_intel_lp64.so (0x00007fe786862000)
I checked the instructions in with perf and it is using an SSE code path. Also, as reported elsewhere, MKL_DEBUG_CPU_TYPE=5 does not enable AVX2 support as it used to do.


Comparing OpenBLAS and MKL with `peakflops` in Julia, there's definitely an advantage for MKL:

    julia> using LinearAlgebra

    julia> BLAS.vendor()
    :openblas64

    julia> BLAS.set_num_threads(1)

    julia> peakflops()
    3.9023447970402664e10


    julia> using LinearAlgebra
    
    julia> BLAS.vendor()
    :mkl
    
    julia> BLAS.set_num_threads(1)
    
    julia> peakflops()
    4.8113846984735275e10
That's close to the ~50 Gflops I saw in @celrod's benchmarks.


The plot thickens. As I reported elsewhere in the thread, the slow code paths were selected on my machine, unless I override the mkl_serv_intel_cpu_true function to always return true. However, this was with PyTorch.

I have now also compiled the ACE DGEMM benchmark and linked against MKL iomp:

    $ ./mt-dgemm 1000 | grep GFLOP
    GFLOP/s rate:         69.124168 GF/s
Most-used function is

   mt-dgemm  libmkl_def.so       [.] mkl_blas_def_dgemm_kernel_zen
So, it is clearly using a GEMM kernel. Now I wonder what is different between PyTorch and this simple benchmark, causing PyTorch to result in a slow SSE code path.


Found the discrepancy. I use single precision in PyTorch. When I benchmark sgemm, the SSE code path is selected.

Conclusion: MKL detects Zen now, but currently only implements a Zen code path for dgemm and not for sgemm. To get good performance for sgemm, you have to fake being an Intel CPU.

Edit, longer description: https://github.com/pytorch/builder/issues/504


Hmm.

FWIW, on my [Skylake/Cascadelake]-X Intel systems, Intel's compilers performed well, almost always outperforming GCC and Clang. But on Zen, their performance was terrible. So I was happy to see that MKL, unlike the compilers, did not appear to gimp AMD.

It's disappointing that MKL doesn't use optimized code paths on the 3700X.

I messaged the person who actually ran the benchmarks and owns the laptop, asking them to chime in with more information. I'm just the person who wrote that benchmark suite.


It seems I have found the issue. We were both right. MKL now uses a Zen-optimized kernel for dgemm, but not (yet?) for sgemm. More details:

https://github.com/pytorch/builder/issues/504




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: