Table of Contents
Fetching ...

Inductive Loop Analysis for Practical HPC Application Optimization

Philipp Schaad, Tal Ben-Nun, Patrick Iff, Torsten Hoefler

TL;DR

HPC applications hinge on complex loop nests with non-affine, strided data accesses that challenge traditional polyhedral optimizers. The paper introduces SYMBOLIC, INDUCTIVE LOOP OPTIMIZATION (SILO), a framework that models data accesses and dependencies as functions of loop strides to unlock parallelism and reduce data movement. Key contributions include a formal inductive analysis of loops, consumer-producer and dependency-elimination techniques, a memory-schedule pass with automatic software prefetching and pointer incrementation, and a prototype implementation that achieves up to 12x speedups on real atmospheric modeling workloads. SILO complements existing HPC toolchains by exposing latent parallelism and providing fine-grained memory optimizations that are difficult for affine-based methods, with demonstrated gains on atmospheric kernels, NPBench, and matrix-multiplication workloads.

Abstract

Scientific computing applications heavily rely on multi-level loop nests operating on multidimensional arrays. This presents multiple optimization opportunities from exploiting parallelism to reducing data movement through prefetching and improved register usage. HPC frameworks often delegate fine-grained data movement optimization to compilers, but their low-level representations hamper analysis of common patterns, such as strided data accesses and loop-carried dependencies. In this paper, we introduce symbolic, inductive loop optimization (SILO), a novel technique that models data accesses and dependencies as functions of loop nest strides. This abstraction enables the automatic parallelization of sequentially-dependent loops, as well as data movement optimizations including software prefetching and pointer incrementation to reduce register spills. We demonstrate SILO on fundamental kernels from scientific applications with a focus on atmospheric models and numerical solvers, achieving up to 12$\times$ speedup over the state of the art.

Inductive Loop Analysis for Practical HPC Application Optimization

TL;DR

HPC applications hinge on complex loop nests with non-affine, strided data accesses that challenge traditional polyhedral optimizers. The paper introduces SYMBOLIC, INDUCTIVE LOOP OPTIMIZATION (SILO), a framework that models data accesses and dependencies as functions of loop strides to unlock parallelism and reduce data movement. Key contributions include a formal inductive analysis of loops, consumer-producer and dependency-elimination techniques, a memory-schedule pass with automatic software prefetching and pointer incrementation, and a prototype implementation that achieves up to 12x speedups on real atmospheric modeling workloads. SILO complements existing HPC toolchains by exposing latent parallelism and providing fine-grained memory optimizations that are difficult for affine-based methods, with demonstrated gains on atmospheric kernels, NPBench, and matrix-multiplication workloads.

Abstract

Scientific computing applications heavily rely on multi-level loop nests operating on multidimensional arrays. This presents multiple optimization opportunities from exploiting parallelism to reducing data movement through prefetching and improved register usage. HPC frameworks often delegate fine-grained data movement optimization to compilers, but their low-level representations hamper analysis of common patterns, such as strided data accesses and loop-carried dependencies. In this paper, we introduce symbolic, inductive loop optimization (SILO), a novel technique that models data accesses and dependencies as functions of loop nest strides. This abstraction enables the automatic parallelization of sequentially-dependent loops, as well as data movement optimizations including software prefetching and pointer incrementation to reduce register spills. We demonstrate SILO on fundamental kernels from scientific applications with a focus on atmospheric models and numerical solvers, achieving up to 12 speedup over the state of the art.

Paper Structure

This paper contains 30 sections, 5 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Variable loop strides cause missed optimization opportunities, even in modern polyhedral compilers.
  • Figure 2: An overview of the SILO architecture. Parallelism is improved through optimization passes in tandem with HPC frameworks' optimizations, while more fine-grained memory schedule optimizations are performed during lowering of the IR.
  • Figure 3: Loop-carried dependencies preventing parallelization of the k-loop. The dataflow graph of the i-loop body shows that reads from A are dominated by a write within the same iteration, potentially allowing for privatization.
  • Figure 4: Resolved write-after-write (WAW) and write-after-read (WAR) dependencies enable DOACROSS parallelization with synchronization if only read-after-write (RAW) dependencies prevent naive parallelization.
  • Figure 5: Generating prefetch hints prepares prefetch unit for unexpected strides.
  • ...and 4 more figures