Table of Contents
Fetching ...

Performance Debugging through Microarchitectural Sensitivity and Causality Analysis

Alban Dutilleul, Hugo Pompougnac, Nicolas Derumigny, Gabriel Rodriguez, Valentin Trophime, Christophe Guillon, Fabrice Rastello

TL;DR

This work tackles the challenge of diagnosing performance bottlenecks in modern out-of-order CPUs by moving beyond purely resource-counter based analyses to a sensitivity-driven and causality-driven framework. It introduces Gus, a portable, abstract resource-centric simulator fed by dynamic binary instrumentation that performs sensitivity analysis to identify bottleneck resources and causality analysis to attribute execution time to specific instructions along dependencies. The approach is validated against a large suite of PolyBench kernels across Intel and ARM microarchitectures, showing superior accuracy and speed compared to cycle-level simulators, and is demonstrated through a case study on a correlation kernel that guides practical optimizations. Overall, Gus provides a practical, scalable method to pinpoint not only where bottlenecks occur, but why they occur, enabling targeted code and microarchitectural optimizations with broad hardware compatibility.

Abstract

Modern Out-of-Order (OoO) CPUs are complex systems with many components interleaved in non-trivial ways. Pinpointing performance bottlenecks and understanding the underlying causes of program performance issues are critical tasks to fully exploit the performance offered by hardware resources. Current performance debugging approaches rely either on measuring resource utilization, in order to estimate which parts of a CPU induce performance limitations, or on code-based analysis deriving bottleneck information from capacity/throughput models. These approaches are limited by instrumental and methodological precision, present portability constraints across different microarchitectures, and often offer factual information about resource constraints, but not causal hints about how to solve them. This paper presents a novel performance debugging and analysis tool that implements a resource-centric CPU model driven by dynamic binary instrumentation that is capable of detecting complex bottlenecks caused by an interplay of hardware and software factors. Bottlenecks are detected through sensitivity-based analysis, a sort of model parameterization that uses differential analysis to reveal constrained resources. It also implements a new technique we developed that we call causality analysis, that propagates constraints to pinpoint how each instruction contribute to the overall execution time. To evaluate our analysis tool, we considered the set of high-performance computing kernels obtained by applying a wide range of transformations from the Polybench benchmark suite and measured the precision on a few Intel CPU and Arm micro-architectures. We also took one of the benchmarks (correlation) as an illustrative example to illustrate how our tool's bottleneck analysis can be used to optimize a code.

Performance Debugging through Microarchitectural Sensitivity and Causality Analysis

TL;DR

This work tackles the challenge of diagnosing performance bottlenecks in modern out-of-order CPUs by moving beyond purely resource-counter based analyses to a sensitivity-driven and causality-driven framework. It introduces Gus, a portable, abstract resource-centric simulator fed by dynamic binary instrumentation that performs sensitivity analysis to identify bottleneck resources and causality analysis to attribute execution time to specific instructions along dependencies. The approach is validated against a large suite of PolyBench kernels across Intel and ARM microarchitectures, showing superior accuracy and speed compared to cycle-level simulators, and is demonstrated through a case study on a correlation kernel that guides practical optimizations. Overall, Gus provides a practical, scalable method to pinpoint not only where bottlenecks occur, but why they occur, enabling targeted code and microarchitectural optimizations with broad hardware compatibility.

Abstract

Modern Out-of-Order (OoO) CPUs are complex systems with many components interleaved in non-trivial ways. Pinpointing performance bottlenecks and understanding the underlying causes of program performance issues are critical tasks to fully exploit the performance offered by hardware resources. Current performance debugging approaches rely either on measuring resource utilization, in order to estimate which parts of a CPU induce performance limitations, or on code-based analysis deriving bottleneck information from capacity/throughput models. These approaches are limited by instrumental and methodological precision, present portability constraints across different microarchitectures, and often offer factual information about resource constraints, but not causal hints about how to solve them. This paper presents a novel performance debugging and analysis tool that implements a resource-centric CPU model driven by dynamic binary instrumentation that is capable of detecting complex bottlenecks caused by an interplay of hardware and software factors. Bottlenecks are detected through sensitivity-based analysis, a sort of model parameterization that uses differential analysis to reveal constrained resources. It also implements a new technique we developed that we call causality analysis, that propagates constraints to pinpoint how each instruction contribute to the overall execution time. To evaluate our analysis tool, we considered the set of high-performance computing kernels obtained by applying a wide range of transformations from the Polybench benchmark suite and measured the precision on a few Intel CPU and Arm micro-architectures. We also took one of the benchmarks (correlation) as an illustrative example to illustrate how our tool's bottleneck analysis can be used to optimize a code.

Paper Structure

This paper contains 37 sections, 2 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: An example in pseudo-asm code of a kernel computing ymm0 = ymm1*ymm3 + ymm1*ymm2 iteratively. Integer operations and branches supporting pointer arithmetic and loop iteration have been removed for clarity. ymm1 and ymm2 are loaded from two non-overlapping memory arrays. ymm3 is constant. In version (b), the vmovaps to ymm2 is hoisted out of the inner loop.
  • Figure 2: Simplified view of a pipelined OoO CPU core.
  • Figure 3: The four original formulas of TMA L1. On recent Intel microarchitectures, these are directly provided by ad-hoc PMC events.
  • Figure 4: Port occupancy over time during the execution of version (b) of the inner loop from Fig. \ref{['lst:Motivation:Running-example']}. Subindices indicate the iteration to which each instruction corresponds and, in the case of vfmadds, whether it is the first or the second FMA in the loop.
  • Figure 5: Untransformed correlation kernel.
  • ...and 1 more figures