Table of Contents
Fetching ...

CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion

Jan Laukemann, Thomas Gruber, Georg Hager, Dossay Oryspayev, Gerhard Wellein

TL;DR

The paper analyzes the MPI-only CloverLeaf mini-app from SPEChpc 2021 on Intel Ice Lake SP and Sapphire Rapids CPUs, uncovering performance breakdowns at prime process counts due to write-allocate evasion dynamics. It develops first-principles memory-traffic models for the 22 hotspot loops, validated by microbenchmarks and full-node measurements, and demonstrates that SpecI2M activates near memory-bandwidth saturation to reduce WA traffic, with significant dependence on inner-loop length and data access patterns. By combining non-temporal stores, loop reorganizations, and SpecI2M, the authors achieve lower code balance and better performance, though the prime-number effect remains only partially explained and more pronounced on Sapphire Rapids. The findings highlight the relevance of WA evasion mechanisms for memory-bound streaming codes and call for more transparent vendor details to enable precise quantitative modeling across architectures.

Abstract

In this paper we analyze the MPI-only version of the CloverLeaf code from the SPEChpc 2021 benchmark suite on recent Intel Xeon "Ice Lake" and "Sapphire Rapids" server CPUs. We observe peculiar breakdowns in performance when the number of processes is prime. Investigating this effect, we create first-principles data traffic models for each of the stencil-like hotspot loops. With application measurements and microbenchmarks to study memory data traffic behavior, we can connect the breakdowns to SpecI2M, a new write-allocate evasion feature in current Intel CPUs. For serial and full-node cases we are able to predict the memory data volume analytically with an error of a few percent. We find that if the number of processes is prime, SpecI2M fails to work properly, which we can attribute to short inner loops emerging from the one-dimensional domain decomposition in this case. We can also rule out other possible causes of the prime number effect, such as breaking layer conditions, MPI communication overhead, and load imbalance.

CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion

TL;DR

The paper analyzes the MPI-only CloverLeaf mini-app from SPEChpc 2021 on Intel Ice Lake SP and Sapphire Rapids CPUs, uncovering performance breakdowns at prime process counts due to write-allocate evasion dynamics. It develops first-principles memory-traffic models for the 22 hotspot loops, validated by microbenchmarks and full-node measurements, and demonstrates that SpecI2M activates near memory-bandwidth saturation to reduce WA traffic, with significant dependence on inner-loop length and data access patterns. By combining non-temporal stores, loop reorganizations, and SpecI2M, the authors achieve lower code balance and better performance, though the prime-number effect remains only partially explained and more pronounced on Sapphire Rapids. The findings highlight the relevance of WA evasion mechanisms for memory-bound streaming codes and call for more transparent vendor details to enable precise quantitative modeling across architectures.

Abstract

In this paper we analyze the MPI-only version of the CloverLeaf code from the SPEChpc 2021 benchmark suite on recent Intel Xeon "Ice Lake" and "Sapphire Rapids" server CPUs. We observe peculiar breakdowns in performance when the number of processes is prime. Investigating this effect, we create first-principles data traffic models for each of the stencil-like hotspot loops. With application measurements and microbenchmarks to study memory data traffic behavior, we can connect the breakdowns to SpecI2M, a new write-allocate evasion feature in current Intel CPUs. For serial and full-node cases we are able to predict the memory data volume analytically with an error of a few percent. We find that if the number of processes is prime, SpecI2M fails to work properly, which we can attribute to short inner loops emerging from the one-dimensional domain decomposition in this case. We can also rule out other possible causes of the prime number effect, such as breaking layer conditions, MPI communication overhead, and load imbalance.
Paper Structure (17 sections, 2 equations, 12 figures, 2 tables)

This paper contains 17 sections, 2 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Visualization of layer conditions (LCs) with a 2D 4-point stencil (right-hand side shown only). Shaded areas are cached elements. Top: LC broken, the cache is not large enough to hold three successive grid rows. Three out of four accesses are cache misses. Bottom: LC satisfied, the cache can fit at least three rows. Only one of four accesses is a miss.
  • Figure 2: Speedup of the CloverLeaf mini-app versus number of MPI processes on an Intel Ice Lake SP server with compact pinning. The dotted gray line marks the end of the first ccNUMA domain. The data points represent the median out of ten separate runs with error bars omitted due to variations being negligible (maximum deviation of 2.5 %).
  • Figure 3: gprofng profile of a 72-rank run of CloverLeaf, showing the exclusive runtime of each function. The output was limited to the ten most time-consuming functions.
  • Figure 4: Code balance of the loops inside the hotspot functions of CloverLeaf on an Intel Ice Lake SP server. The dotted gray line indicates the end of the first ccNUMA domain. The data points represent the median out of ten separate runs with error bars omitted due to negligible fluctuations of 3.6 %. Note the different y-axis scales in both subfigures.
  • Figure 5: Relative distribution of code execution and MPI time for different numbers of ranks.
  • ...and 7 more figures