Fixing Intel compiler’s unfair CPU dispatcher (Part 2/2)

Testing the effect of patch on performance

Shoubhik R Maiti
CodeX

--

In the previous part of this post, I mentioned a way of patching the binary files compiled with Intel compilers or linked to Intel MKL to circumvent the unfair CPU dispatcher. The patching should allow the usage of SIMD instructions and other optimizations for AMD processors as well.

In this post, I will run benchmarks in order to see how much the performance increases due to the patch. Please note that using SIMD instructions only improves performance for numerical applications. If there is a software that does not do heavy calculations then there is no benefit in patching. Also note that binary patching is always dangerous and there is always a possibility of breaking a software by patching.

In all of the benchmarks, I used the Intel C/C++ compiler, Intel Fortran compiler and Intel MKL version 2021.6.0. The compiled executables were run on AMD Ryzen 7 5800H processor on a Windows system.

Benchmark #1: Matrix multiplication with Fortran

Fortran provides an intrinsic function matmul() which allows multiplying two matrices together. For this benchmark, I used a Fortran code that multiplies two square matrices of different sizes, filled with random real numbers (REAL*8, equivalent to C double on x64) The matrix multiplication is repeated 100 times, and the total run time has been measured.

Please note that in this case, I am not using any BLAS routines from Intel MKL. I am using the intrinsic function to test how much the compiler can use vectorization for heavy numerical calculations. The source code can be found here. I compiled it with QaxCORE-AVX2 as my laptop supports AVX2. Note that the default codepath on Windows is SSE2.

Result of matrix multiplication with Fortran

The results between the original and patched version of the software are quite different. On the original version (i.e. unpatched) the matrix multiplication uses SSE2 instructions. When the CPU dispatcher is patched to not discriminate against AMD processors, it runs the faster AVX2 codepath.

SSE2 XMM registers are 128-bit long, so they can work with 2 REAL*8 numbers (64-bit each) in one instruction. Whereas AVX2 YMM registers are 256-bit long, so they can work with 4 REAL*8 numbers. So AVX2 should roughly be twice as efficient as SSE2, and that’s what you find in the plot. The difference between the SSE2 and AVX2 are especially important for large matrix sizes (e.g. 3000x3000). On Linux you would see even larger difference because the default codepath in the compiler is x86 i.e. no SIMD instructions.

Benchmark #2: Matrix multiplication with Intel MKL

Intel MKL provides BLAS routines, among which DGEMM is one which is very commonly used. DGEMM can also be used to multiply two matrices. In this benchmark I used a Fortran code that multiplies two square matrices of a fixed size filled with random numbers. The matrix multiplication is repeated 100 times, and run time has been measured.

This time I used the Intel MKL to see whether MKL uses the dispatcher to run slow codepath on AMD. Intel has claimed to have solved performance issues in non-Intel processors in 2020.3 version of MKL. The source code can be found here.

In this case, there does not seem to be much of a different between patched and unpatched versions of the software. Patching the CPUID checker does seem to improve performance by a very small degree. So it seems that Intel has indeed fixed MKL to use correct vector instructions for non-Intel processors.

Another thing to notice is that Intel MKL’s DGEMM routine is faster than Fortran’s intrinsic matmul() . For example, the DGEMM on 3000x3000 matrix is about 25% faster than matmul() with AVX2. Developers of Math Libraries usually optimize their code by a great degree to push the efficiency as much as possible.

Benchmark #3: C++ dot product

Dot product between two arrays is an operation where you multiply corresponding elements of the array and then sum the results together. I use a C++ code to calculate the dot product of two arrays. Due to multiplication and addition of numbers, FMA instructions are very useful here.

In this case, I compiled it with -arch:pentium -QaxCORE-AVX2 . I have set the baseline codepath to x86 (old Pentium processors only used x86), so that the difference between default codepath and the SIMD vectorized code path is more apparent. Please note that the amount of calculation to be done in this case is quite small (compared to array multiplication for example), so the speed difference on FMA vectorization would be small (when expressed in seconds). The source code can be found here.

Results of the array dot product test

As you can see, the FMA instruction does provide a slight boost. But the amount of calculation being done is not heavy enough to show large difference in time as I had mentioned above. In real scientific applications, there would be heavy calculations and SIMD would become important. (As an aside, the modern scientific softwares are increasingly using GPU acceleration, which also uses SIMD, except that the SIMD “registers” are in the GPU, not the CPU. And also the number of “cores” and the size of “registers” is much larger than what is available in a CPU.)

Other uses of patching

Allowing the usage of Intel compiler’s automatic CPU dispatch on AMD and other processor is one of the benefits of patching. Another case where patching can be necessary is when the software is compiled with -Qx flag. If you use this flag then the resulting binary won’t even run on non-Intel systems.

You would see a message like this:

In the unfortunate event that you get a binary compiled in this manner, patching the CPUID checker will also remove this message and allow your code to run.

As you can see, patching removes the error message and the software runs perfectly well. (Note that if your CPU cannot run AVX2, and you compile for AVX2, you will still see an error message.)

Conclusion

So the upshot of all of this is that you can take advantage of Intel C++/Fortran compilers auto-vectorization and multiple dispatch on AMD and other non-Intel processors very easily. All you need is a python installation.

One problem with this approach is that you still need to make different modifications for different non-Intel CPUs. The “fix” does not disable the Intel’s CPU dispatch, it just changes the CPUID vendor string that is being compared to. So, if you patch the program and replace GenuineIntel with AuthenticAMD then the program now runs correctly only on AMD processors. The dispatcher will not work fairly on VIA nano or something else.

Additionally, there are risks to patching a binary in this manner. Some softwares do have legitimate reasons for checking the CPUID vendor string, for example to manage threading properly. (Many games do this because different CPUs need different threading schedule to work the best.) And as I showed above, Intel MKL does perform well in AMD processors without any patch, so patching is unecessary.

Of course, Intel might obfuscate this CPU checking code in future to prevent patching. Intel may also do away with the CPU vendor checking entirely in future and only use a fair dispatcher that works on CPUs of all brands, but that seems unlikely, given that Intel wants to provide best performance on its own CPUs.

Thanks for reading! Please feel free to leave comments or questions in the response.

The python script and benchmark data can be found here: https://github.com/shoubhikraj/intel-cpu-patch

--

--

Shoubhik R Maiti
CodeX
Writer for

PhD student in computational chemistry. Interested in theoretical chemistry, programming and data science.