*I think MKL actually fixed Zen performance. That is, the workaround no longer m...

stillyslalom · on Aug 29, 2020

Comparing OpenBLAS and MKL with `peakflops` in Julia, there's definitely an advantage for MKL:

    julia> using LinearAlgebra

    julia> BLAS.vendor()
    :openblas64

    julia> BLAS.set_num_threads(1)

    julia> peakflops()
    3.9023447970402664e10


    julia> using LinearAlgebra
    
    julia> BLAS.vendor()
    :mkl
    
    julia> BLAS.set_num_threads(1)
    
    julia> peakflops()
    4.8113846984735275e10

That's close to the ~50 Gflops I saw in @celrod's benchmarks.

microtonal · on Aug 30, 2020

The plot thickens. As I reported elsewhere in the thread, the slow code paths were selected on my machine, unless I override the mkl_serv_intel_cpu_true function to always return true. However, this was with PyTorch.

I have now also compiled the ACE DGEMM benchmark and linked against MKL iomp:

    $ ./mt-dgemm 1000 | grep GFLOP
    GFLOP/s rate:         69.124168 GF/s

Most-used function is

   mt-dgemm  libmkl_def.so       [.] mkl_blas_def_dgemm_kernel_zen

So, it is clearly using a GEMM kernel. Now I wonder what is different between PyTorch and this simple benchmark, causing PyTorch to result in a slow SSE code path.

microtonal · on Aug 30, 2020

Found the discrepancy. I use single precision in PyTorch. When I benchmark sgemm, the SSE code path is selected.

Conclusion: MKL detects Zen now, but currently only implements a Zen code path for dgemm and not for sgemm. To get good performance for sgemm, you have to fake being an Intel CPU.

Edit, longer description: https://github.com/pytorch/builder/issues/504

celrod · on Aug 29, 2020

Hmm.

FWIW, on my [Skylake/Cascadelake]-X Intel systems, Intel's compilers performed well, almost always outperforming GCC and Clang. But on Zen, their performance was terrible. So I was happy to see that MKL, unlike the compilers, did not appear to gimp AMD.

It's disappointing that MKL doesn't use optimized code paths on the 3700X.

I messaged the person who actually ran the benchmarks and owns the laptop, asking them to chime in with more information. I'm just the person who wrote that benchmark suite.

microtonal · on Aug 30, 2020

It seems I have found the issue. We were both right. MKL now uses a Zen-optimized kernel for dgemm, but not (yet?) for sgemm. More details:

https://github.com/pytorch/builder/issues/504