Hardware

Exclusive: Why Apple M1 Single “Core” Comparisons Are Fundamentally Flawed (With Benchmarks)

I have something pretty exciting for our readers today; something that almost everyone appears to have missed in the clamor for Apple M1 benchmark comparisons. What if I told you that pretty much all of the single-core benchmark comparisons between the Apple M1 and modern x86 processors you see online are fundamentally flawed (assuming the intent is to see which core is the fastest)? Because you see, most single “core” benchmarks out there do not fully saturate a modern x86 core – but they likely do saturate the M1.

Why x86 “single-core” benchmarks do not indicate actual single-core performance when comparing with a non-SMT architecture like the Apple M1

Our story begins with an industry dominated by x86 processors. Almost all x86 processors on the market today (with the exception of some old families that have the feature deliberately disabled) would utilize an SMT implementation in their architecture. Enthusiasts would know this feature by HyperThreading (in Intel processors) although AMD has their own SMT implementation as well. You see, modern x86 cores are very wide and a single thread in Windows is usually not enough to saturate the core and utilize all of its resources. This is why each core is actually assigned two threads from which they receive their workload. Here is a technical explanation:

It is worth noting that SMT philosophy is embedded in the design. The decode to uOP, and subsequent optimizations for scheduling through retirement (including intermediate issues instruction dependencies, pipe-line bubbles and flushing, etc.), are a large part of why x86 embraced SMT. RISC load/store architectures simply have less front-end decoding complexity, versus decoupled CISC, and thus are able to obtain better Instruction per Thread, per clock. This is why dispatching multiple threads is required to maximize the performance of a single core (in x86).

-A friendly architect who wishes to not be named.

The “single-core” option in Cinebench (and pretty much all other benchmarks) is more accurately a “single thread” configuration.

Now here’s where the x86 dominated industry part comes in. Modern benchmarks, when run in “single-core” mode actually put the entire load on a single thread. Since you are usually comparing across SMT-based architectures, it’s an apples to apples comparison (ahem) because both cores are being similarly handicapped. However, when you are talking about a completely different, non-SMT based architecture, it becomes a different story altogether. Unlike x86, Apple’s M1 is not SMT-based and needs only one thread to saturate the core (or at least that’s what Apple believes by virtue of their design philosophy).

By now, our regular readers would have started to see the problem. When you run a “single core” benchmark on an Apple M1 – it is using all threads associated with the core – but when you run the same on a modern x86 CPU – it is only using half of the threads associated with the core. Keep in mind, however, the “half” number is a bit misleading because SMT speed up is usually in the 20-30% range. Now there are two possible ways to deal with this problem and get the results on a more even footing.

The first method would be to turn off SMT so each core has only one thread associated with it – just like Apple. Unfortunately, however, this would be unfair to said processor because modern x86 processors are fundamentally designed to be used with SMT. In fact, there is virtually no difference between single-thread results with HT on and HT off.

The second method then, would involve allowing the benchmark to utilize both threads associated with a single core. For the purposes of our tests, we used Thread 0 and 1 (both of which report to Core 0) and configured Cinebench to only use two threads in multicore mode while simultaneously applying the aforementioned affinity through task manager. The results were enlightening, to say the least.

We ran some Intel and AMD samples to quickly verify our theory.

We saw between 20% to 30% improvement in “single-core” results while allowing x86 SMT-based processors to utilize the second thread associated with the same core. For those interested, Geekbench also saw an average of 20-25% improvement with the same technique. You can check out our verified 9980XE comparison here. A big shout out to Joel Hruska over at Extremetech for running the Ryzen 4800U benchmark for us while being on vacation! Based on our limited sample set, pretty much all of the current generation high-efficiency x86 processors (read mobility) would beat the Apple M1’s original single-core/single thread score. We also threw in an old generation, 9980XE desktop processor for good measure – which saw similar gains.

Way forward: benchmark vendors need to move to SMT-enabled single-core tests to ensure better saturation of x86 cores when comparing across architectures

The speedup appears to be a function of the boost behavior of the core along with a minimum amount of speedup which is due to the full resources of the core being utilized. Considering one of the major advantages of x86 compared to ARM-based CPUs is the clock speed – SMT support becomes all the more important to deliver a much clearer picture of true core performance. It goes without saying that this flaw does not impact multi-core results. Those are still valid as Cinebench (and pretty much all other vendors) utilize all available threads for those.

Now here is the moment of truth that all of you have been waiting for. If you remember our original benchmark comparison, we showed you how Intel’s Tiger Lake platform actually outperformed the Apple M1 in single-core/single thread results. We were excited to see what would happen once we allowed the processor to use two threads and not surprisingly – it was in a league of its own. Compared to its original score of 1510, it saw a speedup of 19% which comfortably puts it over the Apple M1 by a very wide margin.

Considering we had seen speedups between 20% to 30%, we used an average (worst case) speedup of 20% to estimate the rough performances of CPUs that we did not have lying around. Fair warning though, for scores marked with an asterisk, you should almost certainly wait for verified scores and these are included just to give you a rough idea of the position they will occupy.

Wrapping up: One thing is clear though, benchmark vendors, at the very least, need to add a testing mode that allows both threads associated with a single core to be utilized. This is very important to ensure full core saturation and a more even comparison between SMT-based and non-SMT architectures. Since x86 cores are not fully saturated with a single thread, comparing across non-SMT architectures would not be an apples to apples comparison.

We had to wrestle with our Cinebench R23 program to get it to accept the load (threads locked to 2, affinity needed to be set after initiating the run but before the benchmark actually started and needed to be reapplied after the first pass) and a cleaner execution would almost certainly be welcome. It would also allow us to test cores working at their full potential. A final shout out to software engineer qjvar for helping me out with this piece and confirming our hypothesis on an architectural level.

LEAVE A RESPONSE

Your email address will not be published. Required fields are marked *