Table of Contents
Fetching ...

CesASMe and Staticdeps: static detection of memory-carried dependencies for code analyzers

Théophile Bastian, Hugo Pompougnac, Alban Dutilleul, Fabrice Rastello

TL;DR

This work tackles the challenge of statically predicting kernel throughput by exposing memory-carried dependencies as a key source of imprecision in code analyzers. It introduces CesASMe, a benchmarking framework that generates in-context, L1-resident microbenchmarks, lifts block-level predictions to kernel-level throughput, and compares them to hardware measurements. To address dependency blind spots, it proposes staticdeps, a heuristic that statically detects memory-carried dependencies across loop iterations and enhances analyzers such as uiCA, yielding significant accuracy improvements. The evaluation across thousands of microbenchmarks demonstrates that memory dependencies are a major bottleneck for static predictors, and that incorporating staticdeps into existing models can notably tighten prediction errors and better guide performance-oriented optimizations. Collectively, CesASMe and staticdeps provide a practical methodology and toolchain for robust evaluation and improvement of static throughput analyzers in real-world benchmarking contexts.

Abstract

A variety of code analyzers, such as IACA, uiCA, llvm-mca or Ithemal, strive to statically predict the throughput of a computation kernel. Each analyzer is based on its own simplified CPU model reasoning at the scale of a basic block. Facing this diversity, evaluating their strengths and weaknesses is important to guide both their usage and their enhancement. We present CesASMe, a fully-tooled solution to evaluate code analyzers on C-level benchmarks composed of a benchmark derivation procedure that feeds an evaluation harness. We conclude that memory-carried data dependencies are a major source of imprecision for these tools. We tackle this issue with staticdeps, a static analyzer extracting memory-carried data dependencies, including across loop iterations, from an assembly basic block. We integrate its output to uiCA, a state-of-the-art code analyzer, to evaluate staticdeps' impact on a code analyzer's precision through CesASMe.

CesASMe and Staticdeps: static detection of memory-carried dependencies for code analyzers

TL;DR

This work tackles the challenge of statically predicting kernel throughput by exposing memory-carried dependencies as a key source of imprecision in code analyzers. It introduces CesASMe, a benchmarking framework that generates in-context, L1-resident microbenchmarks, lifts block-level predictions to kernel-level throughput, and compares them to hardware measurements. To address dependency blind spots, it proposes staticdeps, a heuristic that statically detects memory-carried dependencies across loop iterations and enhances analyzers such as uiCA, yielding significant accuracy improvements. The evaluation across thousands of microbenchmarks demonstrates that memory dependencies are a major bottleneck for static predictors, and that incorporating staticdeps into existing models can notably tighten prediction errors and better guide performance-oriented optimizations. Collectively, CesASMe and staticdeps provide a practical methodology and toolchain for robust evaluation and improvement of static throughput analyzers in real-world benchmarking contexts.

Abstract

A variety of code analyzers, such as IACA, uiCA, llvm-mca or Ithemal, strive to statically predict the throughput of a computation kernel. Each analyzer is based on its own simplified CPU model reasoning at the scale of a basic block. Facing this diversity, evaluating their strengths and weaknesses is important to guide both their usage and their enhancement. We present CesASMe, a fully-tooled solution to evaluate code analyzers on C-level benchmarks composed of a benchmark derivation procedure that feeds an evaluation harness. We conclude that memory-carried data dependencies are a major source of imprecision for these tools. We tackle this issue with staticdeps, a static analyzer extracting memory-carried data dependencies, including across loop iterations, from an assembly basic block. We integrate its output to uiCA, a state-of-the-art code analyzer, to evaluate staticdeps' impact on a code analyzer's precision through CesASMe.
Paper Structure (28 sections, 2 theorems, 6 equations, 5 figures, 4 tables)

This paper contains 28 sections, 2 theorems, 6 equations, 5 figures, 4 tables.

Key Result

Theorem 1

A dependency between two instructions that are separated by at least $R$ others $\mu$OPs can be ignored.

Figures (5)

  • Figure 1: Our analysis and measurement environment.
  • Figure 2: Relative error distribution wrt.perf
  • Figure 3: Statistical distribution of relative errors
  • Figure 4: Statistical distribution of relative errors, with and without pruning latency bound through memory-carried dependencies rows (llvm-mca outliers trimmed)
  • Figure 5: Statistical distribution of relative errors of uiCA, with and without staticdeps hints, with and without pruning latency bound through memory-carried dependencies rows

Theorems & Definitions (3)

  • Definition 1: Distance between instructions
  • Theorem 1: Long distance dependencies
  • Lemma 1: Distance of in-flight $\mu$OPs