Inductive Loop Analysis for Practical HPC Application Optimization
Philipp Schaad, Tal Ben-Nun, Patrick Iff, Torsten Hoefler
TL;DR
HPC applications hinge on complex loop nests with non-affine, strided data accesses that challenge traditional polyhedral optimizers. The paper introduces SYMBOLIC, INDUCTIVE LOOP OPTIMIZATION (SILO), a framework that models data accesses and dependencies as functions of loop strides to unlock parallelism and reduce data movement. Key contributions include a formal inductive analysis of loops, consumer-producer and dependency-elimination techniques, a memory-schedule pass with automatic software prefetching and pointer incrementation, and a prototype implementation that achieves up to 12x speedups on real atmospheric modeling workloads. SILO complements existing HPC toolchains by exposing latent parallelism and providing fine-grained memory optimizations that are difficult for affine-based methods, with demonstrated gains on atmospheric kernels, NPBench, and matrix-multiplication workloads.
Abstract
Scientific computing applications heavily rely on multi-level loop nests operating on multidimensional arrays. This presents multiple optimization opportunities from exploiting parallelism to reducing data movement through prefetching and improved register usage. HPC frameworks often delegate fine-grained data movement optimization to compilers, but their low-level representations hamper analysis of common patterns, such as strided data accesses and loop-carried dependencies. In this paper, we introduce symbolic, inductive loop optimization (SILO), a novel technique that models data accesses and dependencies as functions of loop nest strides. This abstraction enables the automatic parallelization of sequentially-dependent loops, as well as data movement optimizations including software prefetching and pointer incrementation to reduce register spills. We demonstrate SILO on fundamental kernels from scientific applications with a focus on atmospheric models and numerical solvers, achieving up to 12$\times$ speedup over the state of the art.
