Fixing Intel compiler’s unfair CPU dispatcher (Part 1/2)

A simple patch that improves performance on AMD processors

Published in

CodeX

18 min readJun 5, 2022

If you are someone who writes code in C, C++ or Fortran, chances are, you have heard about Intel C/C++ compiler and Intel Fortran compilers. In fact, in scientific computing, Intel compilers are often preferred instead of GNU or LLVM compilers because they provide a significant boost when it comes to number crunching. For example, this benchmark shows Intel C++ compiler outperforming various compiler vendors, providing upto 50% faster speed than g++ in numerical computations. This is primarily because the Intel compiler is generally excellent at producing optimized machine code from source code.

An additional reason Intel compilers are used is because it has many compiler extensions (i.e. features that are not declared in the programming language standard) which provide greater control in coding. However, many programmers consider it a bad practice to use non-standard compiler extensions since such code is not portable. Intel compilers also work equally well on all platforms (Linux, macOS and Windows), but the same cannot be said for GNU, LLVM or MSVC compilers. In fact, many PC video games are compiled with Intel C++ compiler. I am a researcher in computational chemistry, so most softwares in my field do heavy numerical calculations, and as such most of them are compiled with Intel for distribution.

Most softwares that do numerical calculations use linear algebra quite often (e.g. matrix multiplication, diagonalization etc.) These operations are so regularly used that there are optimized libraries available that allow fast linear algebra. These are called BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage). Common examples of such libraries include OpenBLAS, BLIS, Intel MKL. Here again, Intel’s MKL (Math Kernel Library) usually performs the best, and most software packages (e.g. Matlab) use Intel MKL for their linear algebra operations. Such libraries can run fast calculations since they use SIMD vector instructions, multithreading, cache optimizing etc.

One major problem with Intel’s compiler and MKL libraries is that the compiled software is optimized only for Intel CPUs. During compilation, the Intel compiler adds a little bit of extra code that checks the vendor string from CPUID. If the vendor string is “GenuineIntel” (i.e. an Intel processor) then the software uses the optimized code path, with SIMD instructions. If the vendor string is “AuthenticAMD” (i.e. AMD processor) or anything else, then the software runs the unoptimized path. (Intel seems to have fixed the problem for Intel MKL on a recent update.)

In this post, I will write about why and how this CPU dispatch happens, and how you can patch a compiled software to “fix” the issue.

TL,DR: Please skip to the How to fix section if you want to avoid the boring details and explanations. Also read the next part of the blog, where I will benchmark to see how much performance gain is possible from this patching.

A brief history

This unfair CPU vendor checking was known a long time ago. For example AMD had written about this issue as far back as 2005. The Danish computer scientist Agner Fog also wrote in detail about this in his blog in 2009:

Unfortunately, software compiled with the Intel compiler or the Intel function libraries has inferior performance on AMD and VIA processors. The reason is that the compiler or library can make multiple versions of a piece of code, each optimized for a certain processor and instruction set, for example SSE2, SSE3, etc. The system includes a function that detects which type of CPU it is running on and chooses the optimal code path for that CPU. This is called a CPU dispatcher. However, the Intel CPU dispatcher does not only check which instruction set is supported by the CPU, it also checks the vendor ID string. If the vendor string says “GenuineIntel” then it uses the optimal code path. If the CPU is not from Intel then, in most cases, it will run the slowest possible version of the code, even if the CPU is fully compatible with a better version.

The CPU dispatching is unfair, because the code does not check for the capabilities of the CPU (which would be perfectly reasonable), but rather it checks whether the CPU is Intel and does not execute code which would run perfectly on any non-Intel processor. Most modern AMD processors implement the same SIMD instructions that are available on Intel, so there is no reason to do this. Intel also did not choose to reveal this shortcoming of their compiler and libraries to the public.

This behaviour of Intel compiler was one of the complaints in the lawsuit against Intel, filed by AMD in 2005. They reached a legal settlement in 2009. AMD also reached an Federal Trade Commision settlement in 2010, which stipulated that Intel has to publicly disclose the fact that their softwares discriminate between Intel and non-Intel CPUs. This disclaimer is now found as a short sentence at the very end of the Intel’s page about performance:

Intel optimizations, for Intel compilers or other products, may not optimize to the same degree for non-Intel products.

In any case, Intel compilers and Intel MKL libraries continue to show subpar performance on non-Intel processors, even as of now (May 2022). Since AMD is the other major supplier of processors on consumer laptops and desktops, it affects AMD the most. Legally, Intel is allowed to do this, because their softwares are not discriminating against AMD specifically, but are only checking for processors of their own brand. Personally, I feel that Intel’s action is probably a little bit unethical, because while it does not specifically discriminate against AMD processors, it is the functional equivalent of discrimination while staying under the constraints of law. It clearly biases the results of benchmarking in the favour of Intel, and the results of benchmarking are used by many consumers for buying decisions.

This post is however, not about discussing the ethics of big companies!! So let’s go into the details of what kind of optimizations are used and how Intel’s CPU dispatch works.

What is SIMD?

The main difference between the optimized and unoptimized machine code in software compiled with Intel compiler or MKL, is the use of vector instructions or SIMD (Single Instruction Multiple Data). When you run a software on your computer, it sends instructions to your CPU, and with each cycle of the CPU, it fetches and executes one instruction.

In old processors, each instruction could only perform one arithmetic operation e.g. one addition or subtraction. However, modern processors have the capability to perform the same operation on multiple data in one instruction cycle. This is called SIMD, and it gives large performance benefits when number crunching is required.

It is easier to understand by looking at a concrete example in C/C++, where you add the elements of two arrays to get a final array:

int a[100]; // contains some numbers
int b[100]; // contains some numbers
int c[100]; // empty array// Now add a and b together
for (int i=0; i<100; i++) {
    c[i] = a[i] + b[i]; // each turn of loop does one addition
}

Now if this type of code is compiled without SIMD, then when the loop executes, each instruction cycle will perform one addition i.e. it will add two integers.

What if you use SIMD? Well, most modern Intel and AMD processors implement a form of SIMD called AVX2 which has 256-bit registers. In C/C++, int integers are usually 32-bit, so each register can hold 8 integers. Then, an AVX2 addition instruction (vpaddd) will add all of those 8 integers in one instruction cycle. Essentially, 8 passess of the for loop will happen in one cycle. Now, regardless of whether you use SIMD or not, each cycle of the CPU takes exactly the same time (which is determined by the clockspeed specification). Therefore, using SIMD gives a massive speed boost when dealing with numbers.

There are different varieties of SIMD, with modern vector instruction sets supporting a larger register (i.e. more data) and a wider range of available operations. The SIMD instruction sets include SSE, SSE2, SSE3, SSE4.1, SSE4.2, FMA, AVX, AVX2, AVX-512, in a rough timeline of release and increasing order of efficiency. Processors that support newer instructions sets usually always support olders instruction sets as well. Most consumer CPUs implement upto AVX2, with AVX-512 only being available in some high end Intel processors.

Assembly code

When a source code in C/C++ or Fortran is compiled, it is converted to machine code (binary) which is not human readable. The only way to understand them is to dissassemble the binary in assembly language. In assembly language, the CPU instructions are denoted with names such as mov or cmp or vpaddd as mentioned before. Even then, it is still quite difficult to read assembly, because human’s don’t think like machine algorithms!

If you are on Windows, a binary (like *.obj or *.exe) can be dissassembled by running dumpbin /disasm file.obj > output.txt , from the Visual Studio command line. On Linux and macOS (and mingw on Windows as well), binaries can be dissassembled by objdump -D -Mintel file.o > output.txt . Please note that the -Mintel flag chooses Intel syntax for the dissassembly, whereas on Windows it is the default. The syntax is just a personal preference.

A short section of dissassembly of a binary in Intel syntax, showing AVX-512 instructions

I will use assembly code in the next section to show how the CPU dispatch works. I will also explain what each instruction means when it comes up.

Intel compiler’s CPU dispatch

The type of CPU dispatch Intel compiler uses, is dependent on the compiler flags used for compiling. I am working on Windows, so the flags I use will be slightly different from Linux or macOS, however, I will mention the flags for those platforms as well.

Let’s take the following program in C++ to add two arrays filled with random numbers:

#include <array>
#include <cstdlib>
#include <ctime>
#include <algorithm>
#include <iostream>void calc() {
    constexpr long long LEN_OF_ARR = 20000;// the size of array
    std::array<int,LEN_OF_ARR> a;
    std::array<int,LEN_OF_ARR> b;
    std::array<int,LEN_OF_ARR> c;
    std::srand( (unsigned int)std::time(nullptr) );
    // fill a with random numbers    
    std::generate(a.begin(),a.end(),std::rand);
    // fill b with random numbers 
    std::generate(b.begin(),b.end(),std::rand);
    for (size_t i=0; i < LEN_OF_ARR; i++) {
        c[i] = a[i] + b[i]; // sum numbers
    }
    std::cout<<"First number of vector sum of arrays: "<<c[0]; 
}
// the calculation must be in a separate function not in main()int main() {
    calc();
}

Save this as test.cpp, then open the Intel compiler command line on Windows (on Linux source the Intel setvars script)and run :

icl -arch:CORE-AVX2 -O3 test.cpp

Early versions of Intel C++ compiler would use no SIMD (i.e. only x86) instructions if no compiler flags are used. The current version of Intel compilers on Windows uses the SSE2 instructions by default (likely because SSE is used by Windows x64 kernel so its not possible to run 64-bit Windows without SSE), for all processors including AMD. On Linux, the default instruction set is actually x86 (i.e. no vectorization). The -arch:CORE-AVX2 flag (-march=core-avx2 for Linux) tells the compiler to use AVX2 instruction set as default. However, this means that the software won’t run on any processor that does not have AVX2. This is obviously not something that software developers want. In my laptop the software with AVX2 runs. But if I compile with -arch:COMMON-AVX512 (-march=common-avx512 for Linux), the software does not run because my processor does not support AVX-512. (There is another flag -Qx or -x which performs extra optimisations only for Intel CPUs and the resulting executable does not even run on AMD CPUs, it stops showing an error message.)

Therefore, what many software developers tend to do is to take advantage of the compiler’s multiple dispatch with the compiler flag -Qax :

icl -QaxCORE-AVX2,COMMON-AVX512 -arch:SSE2 -Ob1 -O3 test.cpp

The -arch flag determines the default codepath. However, the extra -QaxCORE-AVX2,COMMON-AVX512 flag (-axCORE-AVX2,COMMON-AVX512 on Linux) tells the compiler to generate multiple dispatch pathways, one with AVX2 instructions and another one with AVX-512 instructions. This means that the compiler writes different versions of the same code, and adds a dispatcher that selects at runtime which version to run, based on the capabilities of the processor (on Intel CPUs). The -Ob1 flag (-inline-level=1) is required to prevent the compiler from inlining the function calc() , and I will explain why that’s necessary later.

This multiple dispatcher code first checks whether the CPU is Intel (“GenuineIntel” CPUID), if the check succeeds then the code checks for capabilities of the processor (i.e. can it run AVX2 instructions), and then runs the best one. On the other hand, if the CPU is non-Intel, then it does not check for capabilities at all, and instead runs the default codepath.

How does the dispatcher work?

First, dissassemble the object file (on Windows it is *.obj) from the multiple dispatch compilation (i.e. with -QaxCORE-AVX2,COMMON-AVX512 and -Ob1 flags). On Linux, you may need to add -c flag to compilation to get the object file (*.o).

I think it is easier to read the dissassembly of the object file because after linking the function and section names are removed so they are no longer present in the executable file.

The dissassembly should start from the beginning (i.e. at the main function). If not, then look for the section starting with “main:” . Let’s look at the first part-

main:
  0000000000000000: 48 83 EC 28        sub         rsp,28h
  0000000000000004: B9 03 00 00 00     mov         ecx,3
  0000000000000009: 33 D2              xor         edx,edx
  000000000000000B: E8 00 00 00 00     call        __intel_new_feature_proc_init
  0000000000000010: 0F AE 5C 24 20     stmxcsr     dword ptr [rsp+20h]
  0000000000000015: 81 4C 24 20 40 80  or          dword ptr [rsp+20h],8040h
                    00 00
  000000000000001D: 0F AE 54 24 20     ldmxcsr     dword ptr [rsp+20h]
  0000000000000022: E8 00 00 00 00     call        ?calc@@YAXXZ
  0000000000000027: 33 C0              xor         eax,eax
  0000000000000029: 48 83 C4 28        add         rsp,28h
  000000000000002D: C3                 ret
  000000000000002E: 66 90              nop

When the main program starts, it immediately calls the function __intel_new_feature_proc_init() , which calls another function __intel_cpu_features_init(). This is the function that detects the CPUID and I believe it is also the main culprit behind the unfair dispatching. I will describe it in detail later. After that function in called, it goes through some instructions, and calls the function calc() (here the name is mangled due to C++). Let’s look at the assembly of the function now.

?calc@@YAXXZ (void __cdecl calc(void)):
  0000000000000000: 48 83 EC 08        sub         rsp,8
  0000000000000004: 48 BA FF 97 9D 18  mov         rdx,4189D97FFh
                    04 00 00 00
  000000000000000E: 48 8B 05 00 00 00  mov         rax,qword ptr [__intel_cpu_feature_indicator]
                    00
  0000000000000015: 48 23 C2           and         rax,rdx
  0000000000000018: 48 3B C2           cmp         rax,rdx
  000000000000001B: 75 09              jne         0000000000000026
  000000000000001D: 48 83 C4 08        add         rsp,8
  0000000000000021: E9 00 00 00 00     jmp         ?calc@@YAXXZ.h     ; Jump to AVX-512 version of calc
  0000000000000026: 8B 05 00 00 00 00  mov         eax,dword ptr [__intel_cpu_feature_indicator]
  000000000000002C: 25 FF 97 9D 00     and         eax,9D97FFh
  0000000000000031: 3D FF 97 9D 00     cmp         eax,9D97FFh
  0000000000000036: 75 09              jne         0000000000000041
  0000000000000038: 48 83 C4 08        add         rsp,8
  000000000000003C: E9 00 00 00 00     jmp         ?calc@@YAXXZ.V     ; Jump to AVX2 version of calc
  0000000000000041: F6 05 00 00 00 00  test        byte ptr [__intel_cpu_feature_indicator],1
                    01
  0000000000000048: 74 09              je          0000000000000053
  000000000000004A: 48 83 C4 08        add         rsp,8
  000000000000004E: E9 00 00 00 00     jmp         ?calc@@YAXXZ.A    ; Jump to SSE2 version of calc
  0000000000000053: E8 00 00 00 00     call        __intel_cpu_features_init
  0000000000000058: EB B4              jmp         000000000000000E
  000000000000005A: 66 0F 1F 44 00 00  nop         word ptr [rax+rax]

After the function starts execution, rdx is set to a constant number (4189D97FFh). Then it gets the value of the __intel_cpu_feature_indicator , which I think is being set by the function __intel_cpu_features_init(). Then there is an and instruction, which sets only the common bits between rax and rdx to 1 and then puts that in rax . Then the values of rax and rdx are compared, which would return true only if rax contained the same or extra bits compared to rdx . I suspect that the CPU feature indicator returns an integer number (binary) with particular bits set, where each bit denotes the presence of certain features. Therefore, this part of the code checks whether the CPU can run the highest instruction set available in the executable (here AVX-512). If yes, then it continues and jumps(jmp) to ?calc@@YAXXZ.h . If not, it jumps (jne) to line 26, where the same check is done again, but with a different constant number. This time it checks for the lower instruction set (here AVX2). If yes, it continues to ?calc@@YAXXZ.V. If no, then it jumps to line 41, and then tests whether the value is 1. If yes, it proceeds to ?calc@@YAXXZ.A. (If not then it calls the CPU feature checker and starts the loop again).

Intel’s feature indicator returns the processor capabilities only for Intel CPUs. For other CPUs, the __intel_cpu_feature_indicator returns 1 consistently, so that it always goes into ?calc@@YAXXZ.A path.

Let’s look at the ?calc@@YAXXZ.h part:

?calc@@YAXXZ.h (?calc@@YAXXZ.h):
  0000000000000000: B8 78 9F 24 00     mov         eax,249F78h
.... (some lines)
  000000000000007B: 62 C1 7C 48 10 64  vmovups     zmm20,zmmword ptr [r13+rax*4+80h]
                    85 02
  0000000000000083: 62 C1 7C 48 10 74  vmovups     zmm22,zmmword ptr [r13+rax*4+0C0h]
                    85 03
  000000000000008B: 62 C1 7D 40 FE 8C  vpaddd      zmm17,zmm16,zmmword ptr [r13+rax*4+0C3500h]
                    85 00 35 0C 00
  0000000000000096: 62 C1 6D 40 FE 9C  vpaddd      zmm19,zmm18,zmmword ptr [r13+rax*4+0C3540h]
....

The 512-bit zmm registers indicate that this is AVX-512 code. The vmovups instructions load the integers into register from the locations indicated by pointer. Then the vpaddd instructions add the loaded integers (i.e. 16 integers) in one cycle. This should translate to a lot of performance gain. However, integer addition is inherently so fast, that you would need a huge amount of integer additions in your program before you see a noticeable difference.

What about ?calc@@YAXXZ.V ? Here you can see the 256-bit ymm registers for AVX2 code:

?calc@@YAXXZ.V
.... (some lines)
  000000000000007E: C4 C1 7E 6F 64 85  vmovdqu     ymm4,ymmword ptr [r13+rax*4+40h]
                    40
  0000000000000085: C4 C1 7D FE 8C 85  vpaddd      ymm1,ymm0,ymmword ptr [r13+rax*4+0C3500h]
                    00 35 0C 00
  000000000000008F: C4 C1 6D FE 9C 85  vpaddd      ymm3,ymm2,ymmword ptr [r13+rax*4+0C3520h]
....

The ?calc@@YAXXZ.A part should contain the slowest code, with SSE2, and that is what we find:

?calc@@YAXXZ.A
.... (some lines)
  000000000000006C: 66 0F 6F 54 84 40  movdqa      xmm2,xmmword ptr [rsp+rax*4+40h]
  0000000000000072: 66 0F 6F 5C 84 50  movdqa      xmm3,xmmword ptr [rsp+rax*4+50h]
  0000000000000078: 66 0F FE 84 84 20  paddd       xmm0,xmmword ptr [rsp+rax*4+0C3520h]
                    35 0C 00
....

The 128-bit xmm registers indicate SSE code and the movdqa and paddd instructions load the integers and perform the addition respectively.

So, it is clear that the dispatcher should work perfectly for AMD processors, if not for the fact that __intel_cpu_feature_indicator returned 1 for them. The value of that variable is set by a function void __intel_cpu_features_init() (the name of the function changes depending on the version of compiler and MKL) that is inserted into the program during linking.

The CPU dispatcher like the one I described above is also present in the Intel MKL library. So when you link to the Intel MKL library, the dispatcher is automatically put into the software.

The function __intel_cpu_features_init() can be examined in assembly form by compiling with the linker flag (/debug:full) and then dissassembling the executable. This is because on Windows the linker in default mode mangles the name of functions so they cannot be recovered from the binary. I am not sure what happens in Linux, but I believe that no extra flags are necessary.

This is how the function __intel_cpu_features_init() works:

__intel_cpu_features_init:
  000000014001C7E0: B8 01 00 00 00     mov         eax,1
  000000014001C7E5: E9 16 00 00 00     jmp         000000014001C800
....000000014001C867: 0F A2              cpuid
  000000014001C869: 89 84 24 A0 00 00  mov         dword ptr [rsp+0A0h],eax
                    00
  000000014001C870: 89 9C 24 A4 00 00  mov         dword ptr [rsp+0A4h],ebx
                    00
  000000014001C877: 89 8C 24 A8 00 00  mov         dword ptr [rsp+0A8h],ecx
                    00
  000000014001C87E: 89 94 24 AC 00 00  mov         dword ptr [rsp+0ACh],edx
                    00
....   000000014001D364: 81 BC 24 A4 00 00  cmp         dword ptr [rsp+0A4h],756E6547h # "Genu"
                    00 47 65 6E 75
  000000014001D36F: 75 90              jne         000000014001D301
  000000014001D371: 81 BC 24 AC 00 00  cmp         dword ptr [rsp+0ACh],49656E69h # "ineI"
                    00 69 6E 65 49
  000000014001D37C: 75 83              jne         000000014001D301
  000000014001D37E: 81 BC 24 A8 00 00  cmp         dword ptr [rsp+0A8h],6C65746Eh # "ntel"
                    00 6E 74 65 6C
  000000014001D389: 0F 85 72 FF FF FF  jne         000000014001D301

After the function is called, the cpuid instruction is used to get CPU vendor name, which is then stored in a memory location. The crucial part contains three pairs of cmp and jne instructions. The first instruction compares the first 4 bytes of the vendor string (rsp+0A4h) with the constant 756E6547h . Now, most consumer computers are small-endian, so the constant translates to the hex 47 65 6E 75 which represents the character string Genu . If the numbers are not equal, then it jumps (jne) to another line, which eventually would set the value of __intel_cpu_feature_indicator to 1. The next two lines check if the second 4 bytes of vendor string are equal to the character string ineI and does the same jump otherwise. The last two lines compares the third 4 bytes of vendor string with ntel and performs the same jump if they don’t match.

Effectively, the function is checking whether the CPU vendor string matches GenuineIntel . If all of the checks are successful, then the code continues running as normal, and actually returns the vector instructions that your CPU supports via __intel_cpu_feature_indicator .

You may have noticed that I have put the part of the code that does the actual calculation in a separate function and then called it from main() , and stopped inlining i.e. replacing a function call with the code inside the function itself. This then forces the Intel C++ compiler to make multiple dispatch copies of the function calc() . If you put your calculation in main, then the compiler makes different copies of main() and the CPUID checking happens before main() is called in some other part of the software. This makes it difficult to patch the binary (as I will explain in the next section). In fact, I have been unable to figure out where exactly the checking is happening when the main function is made into multiple dispatch copies. My patch does not work in this case. So, put your calculations in a separate function/subroutine etc. not in the main section of your code.

How to fix the unfair dispatcher?

If you have access to the source code, one way to fix the dispatcher would be to add your own __intel_cpu_features_init() function in the source code before compiling, so that it replaces Intel’s CPU checker function. This approach has been described in detail by Agner Fog in his blog and his C++ optimization manual. However, source code is not always available, so this is not very useful. Many commercial softwares like MATLAB use the Intel MKL library, and their performance suffers on AMD because of this. Although Intel seems to have fixed the performance issue with MKL on AMD in one of their recent updates.

Before 2020, Intel had an undocumented environment variable (MKL_DEBUG_CPU_TYPE) which could be set in order to avoid the unfair dispatching. But once the news spread about this feature, Intel removed it in their next update.

Another simple way of patching which is available without source code is to target the cmp instructions that check the CPU vendor ID. Even when the program is compiled (i.e. in binary form) the character string constants, i.e. Genu , ineI , ntel are present in the program. If we replace those constants with Auth , enti , cAMD then the CPU check will now look for AMD processors! So, the dispatcher will work correctly for AMD processors.

This type of patching can be done with a hex editor, by looking for the bytes in the file. I have made a simple python script that does this automatically. The script looks for three cmp instructions that are close to each other, with the three character strings for Intel CPUID. If it finds a pattern like that, then it replaces the character string constants to Auth , enti , cAMD (i.e. AuthenticAMD ). The patched code should then run successfully on AMD. There is a patcher available here that works on a similar principle, but it only works on Linux, and it does not check for varying orders of comparison to the character string constants (which newer versions of Intel compilers and MKL use). My python script should run on both Windows and Linux without any issue, and it checks for all variations in the order of character string constants. However, it is slightly more unsafe, as it does not do too many checks before patching.

The python script can be found here in my github repo. Please keep in mind, that editing binary files and executables is always dangerous. So always keep a backup of the file that you are modifying. The script replaces all instances of checking for Intel CPUID (GenuineIntel) that it can find, with AuthenticAMD. I have tested it on my computer, and it works quite well. However, it is not possible to guarantee whether this kind of modification will be safe. Some softwares have an anti-patching mechanism, where it stops working if you modify the executable. Some softwares may also have legitimate reasons for checking if the CPU is Intel or not (e.g. for identifying threading strategy). Therefore, patching an executable with the python script I provided, should be done with extreme caution. Nevertheless, patching the CPU dispatcher will show some performance gains, especially for numerical computations.

To patch a binary file, first run the python script:

python find_intel_replace.py myfile.exe

You can run the script on *.exe and *.dll files only. I have added this check to ensure that the script does not accidentally modify a file that is not a binary. The python script will go through the binary and check for locations where it can find possible Intel CPUID checks, but it will not modify the binary.

After checking if it has Intel CPUID checks, you can then modify the file by using the argument --force :

python find_intel_replace.py myfile.exe --force

To reiterate, do not use the script on a system DLL or EXE file that is used by the OS, because this might break your computer. Even if you are using it on a user-installed software, make backups !!

In the next part of this blog, I will discuss how much performance gain can be expected on AMD with this patch, and other possible uses.

Thanks for reading! Please feel free to leave comments or questions in the response.

Python script can be found here: https://github.com/shoubhikraj/intel-cpu-patch/blob/main/find_intel_replace.py.