CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion
Jan Laukemann, Thomas Gruber, Georg Hager, Dossay Oryspayev, Gerhard Wellein
TL;DR
The paper analyzes the MPI-only CloverLeaf mini-app from SPEChpc 2021 on Intel Ice Lake SP and Sapphire Rapids CPUs, uncovering performance breakdowns at prime process counts due to write-allocate evasion dynamics. It develops first-principles memory-traffic models for the 22 hotspot loops, validated by microbenchmarks and full-node measurements, and demonstrates that SpecI2M activates near memory-bandwidth saturation to reduce WA traffic, with significant dependence on inner-loop length and data access patterns. By combining non-temporal stores, loop reorganizations, and SpecI2M, the authors achieve lower code balance and better performance, though the prime-number effect remains only partially explained and more pronounced on Sapphire Rapids. The findings highlight the relevance of WA evasion mechanisms for memory-bound streaming codes and call for more transparent vendor details to enable precise quantitative modeling across architectures.
Abstract
In this paper we analyze the MPI-only version of the CloverLeaf code from the SPEChpc 2021 benchmark suite on recent Intel Xeon "Ice Lake" and "Sapphire Rapids" server CPUs. We observe peculiar breakdowns in performance when the number of processes is prime. Investigating this effect, we create first-principles data traffic models for each of the stencil-like hotspot loops. With application measurements and microbenchmarks to study memory data traffic behavior, we can connect the breakdowns to SpecI2M, a new write-allocate evasion feature in current Intel CPUs. For serial and full-node cases we are able to predict the memory data volume analytically with an error of a few percent. We find that if the number of processes is prime, SpecI2M fails to work properly, which we can attribute to short inner loops emerging from the one-dimensional domain decomposition in this case. We can also rule out other possible causes of the prime number effect, such as breaking layer conditions, MPI communication overhead, and load imbalance.
