Table of Contents
Fetching ...

A Priori Loop Nest Normalization: Automatic Loop Scheduling in Complex Applications

Lukas Trümper, Philipp Schaad, Berke Ates, Alexandru Calotoiu, Marcin Copik, Torsten Hoefler

TL;DR

This work tackles the problem of automatic loop scheduling robustness in complex applications, where loop permutations produce large performance variation. It introduces a priori loop nest normalization, anchored by maximal loop fission and stride minimization, to map memory-access patterns to a canonical form that is amenable to a single optimization recipe. The daisy auto-scheduler combines normalization with a similarity-based transfer-tuning database, enabling cross-variant and cross-language optimization (C, Python, Fortran) and achieving substantial speedups over state-of-the-art schedulers and frameworks, including up to $2.97\times$ over Polly and $7.03\times$ over the Tiramisu scheduler in some cases, plus a $10\%$ speedup on CLOUDSC. This approach improves robustness by making performance less sensitive to implementation choices, facilitating optimization across large code bases and real-world simulations, and it demonstrates practical impact on both synthetic benchmarks and production-scale cloud-physics simulations.

Abstract

The same computations are often expressed differently across software projects and programming languages. In particular, how computations involving loops are expressed varies due to the many possibilities to permute and compose loops. Since each variant may have unique performance properties, automatic approaches to loop scheduling must support many different optimization recipes. In this paper, we propose a priori loop nest normalization to align loop nests and reduce the variation before the optimization. Specifically, we define and apply normalization criteria, mapping loop nests with different memory access patterns to the same canonical form. Since the memory access pattern is susceptible to loop variations and critical for performance, this normalization allows many loop nests to be optimized by the same optimization recipe. To evaluate our approach, we apply the normalization with optimizations designed for only the canonical form, improving the performance of many different loop nest variants. Across multiple implementations of 15 benchmarks using different languages, we outperform a baseline compiler in C on average by a factor of $21.13$, state-of-the-art auto-schedulers such as Polly and the Tiramisu auto-scheduler by $2.31$ and $2.89$, as well as performance-oriented Python-based frameworks such as NumPy, Numba, and DaCe by $9.04$, $3.92$, and $1.47$. Furthermore, we apply the concept to the CLOUDSC cloud microphysics scheme, an actively used component of the Integrated Forecasting System, achieving a 10% speedup over the highly-tuned Fortran code.

A Priori Loop Nest Normalization: Automatic Loop Scheduling in Complex Applications

TL;DR

This work tackles the problem of automatic loop scheduling robustness in complex applications, where loop permutations produce large performance variation. It introduces a priori loop nest normalization, anchored by maximal loop fission and stride minimization, to map memory-access patterns to a canonical form that is amenable to a single optimization recipe. The daisy auto-scheduler combines normalization with a similarity-based transfer-tuning database, enabling cross-variant and cross-language optimization (C, Python, Fortran) and achieving substantial speedups over state-of-the-art schedulers and frameworks, including up to over Polly and over the Tiramisu scheduler in some cases, plus a speedup on CLOUDSC. This approach improves robustness by making performance less sensitive to implementation choices, facilitating optimization across large code bases and real-world simulations, and it demonstrates practical impact on both synthetic benchmarks and production-scale cloud-physics simulations.

Abstract

The same computations are often expressed differently across software projects and programming languages. In particular, how computations involving loops are expressed varies due to the many possibilities to permute and compose loops. Since each variant may have unique performance properties, automatic approaches to loop scheduling must support many different optimization recipes. In this paper, we propose a priori loop nest normalization to align loop nests and reduce the variation before the optimization. Specifically, we define and apply normalization criteria, mapping loop nests with different memory access patterns to the same canonical form. Since the memory access pattern is susceptible to loop variations and critical for performance, this normalization allows many loop nests to be optimized by the same optimization recipe. To evaluate our approach, we apply the normalization with optimizations designed for only the canonical form, improving the performance of many different loop nest variants. Across multiple implementations of 15 benchmarks using different languages, we outperform a baseline compiler in C on average by a factor of , state-of-the-art auto-schedulers such as Polly and the Tiramisu auto-scheduler by and , as well as performance-oriented Python-based frameworks such as NumPy, Numba, and DaCe by , , and . Furthermore, we apply the concept to the CLOUDSC cloud microphysics scheme, an actively used component of the Integrated Forecasting System, achieving a 10% speedup over the highly-tuned Fortran code.
Paper Structure (34 sections, 11 figures, 1 table)

This paper contains 34 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Characterization of loop nests.
  • Figure 2: Loop nest code samples subject to normalization.
  • Figure 3: Lifting a symbolic representation of loop nests with high-level information from source code translated to LLVM IR.
  • Figure 4: The normalization pipeline in two steps: Maximal loop fission and stride minimization.
  • Figure 5: Comparison of our model with state-of-the-art auto-scheduling methods and the icc compiler. The runtime is expressed relative to the runtime of the A variant of the benchmarks using daisy. Hence, a lower value is better. The implementation of the Tiramisu scheduler could not be applied to some of the benchmarks successfully. We mark those with X.
  • ...and 6 more figures