Performance bottlenecks detection through microarchitectural sensitivity
Hugo Pompougnac, Alban Dutilleul, Christophe Guillon, Nicolas Derumigny, Fabrice Rastello
TL;DR
The paper addresses the challenge of identifying performance bottlenecks in modern OoO CPUs, where PMU counters alone often fail to pinpoint causal sources. It introduces Gus, a sensitivity-oriented analyzer that combines dynamic binary instrumentation with an abstract resource-centric CPU model to simulate and perturb microarchitectural resources, enabling causal bottleneck detection. Throughput estimation and instruction-level sensitivity analyses are demonstrated on PolyBench-based microbenchmarks, showing Gus achieving state-of-the-art accuracy and richer bottleneck insights than existing tools. The approach provides a practical framework for automatic bottleneck discovery and guidance for performance optimization in complex CPU architectures.
Abstract
Modern Out-of-Order (OoO) CPUs are complex systems with many components interleaved in non-trivial ways. Pinpointing performance bottlenecks and understanding the underlying causes of program performance issues are critical tasks to make the most of hardware resources. We provide an in-depth overview of performance bottlenecks in recent OoO microarchitectures and describe the difficulties of detecting them. Techniques that measure resources utilization can offer a good understanding of a program's execution, but, due to the constraints inherent to Performance Monitoring Units (PMU) of CPUs, do not provide the relevant metrics for each use case. Another approach is to rely on a performance model to simulate the CPU behavior. Such a model makes it possible to implement any new microarchitecture-related metric. Within this framework, we advocate for implementing modeled resources as parameters that can be varied at will to reveal performance bottlenecks. This allows a generalization of bottleneck analysis that we call sensitivity analysis. We present Gus, a novel performance analysis tool that combines the advantages of sensitivity analysis and dynamic binary instrumentation within a resource-centric CPU model. We evaluate the impact of sensitivity on bottleneck analysis over a set of high-performance computing kernels.
