Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa

Jan Laukemann; Georg Hager; Gerhard Wellein

Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa

Jan Laukemann, Georg Hager, Gerhard Wellein

TL;DR

This work analyzes the performance of these state-of-the-art CPUs and creates an accurate in-core performance model for their microarchitectures Zen 4, Golden Cove, and Neoverse V2, extending the OSACA tool and comparing it with LLVM-MCA.

Abstract

With Nvidia's release of the Grace Superchip, all three big semiconductor companies in HPC (AMD, Intel, Nvidia) are currently competing in the race for the best CPU. In this work we analyze the performance of these state-of-the-art CPUs and create an accurate in-core performance model for their microarchitectures Zen 4, Golden Cove, and Neoverse V2, extending the Open Source Architecture Code Analyzer (OSACA) tool and comparing it with LLVM-MCA. Starting from the peculiarities and up- and downsides of a single core, we extend our comparison by a variety of microbenchmarks and the capabilities of a full node. The "write-allocate (WA) evasion" feature, which can automatically reduce the memory traffic caused by write misses, receives special attention; we show that the Grace Superchip has a next-to-optimal implementation of WA evasion, and that the only way to avoid write allocates on Zen 4 is the explicit use of non-temporal stores.

Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa

TL;DR

Abstract

Paper Structure (7 sections, 4 figures, 3 tables)

This paper contains 7 sections, 4 figures, 3 tables.

Introduction
Motivation
Brief overview of the in-core port models
Testbed and experimental methodology
Architectural Analysis
Case Study: Write-Allocate Evasion
Conclusion

Figures (4)

Figure 1: Arm Neoverse V2 core block diagram and port model, compiled from Arm's Software Optimization Guide NeoverseSOG.
Figure 2: Sustained CPU clock frequency for arithmetic-heavy code on GCS, SPR, and Genoa across one chip. If no ISA extension is specified, the architecture could sustain the same frequency for all supported ISA extensions.
Figure 3: Relative prediction error of 416 test blocks for LLVM-MCA and OSACA. Bars right of the red dotted line indicate a prediction faster than the actual measurement while bars left of the line indicate a slower prediction.
Figure 4: Ratio of actual memory traffic to stored data volume vs. number of cores for a store-only benchmark loop (working set 40 GB). A value of $1.0$ indicates perfect WA evasion, while a value uf $2.0$ indicates full WA traffic. The variants labeled "NT stores" use non-temporal store instructions, while the others use standard stores.

Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa

TL;DR

Abstract

Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa

Authors

TL;DR

Abstract

Table of Contents

Figures (4)