Tightening I/O Lower Bounds through the Hourglass Dependency Pattern

Lionel Eyraud-Dubois; Guillaume Iooss; Julien Langou; Fabrice Rastello

Tightening I/O Lower Bounds through the Hourglass Dependency Pattern

Lionel Eyraud-Dubois, Guillaume Iooss, Julien Langou, Fabrice Rastello

TL;DR

The paper targets data movement lower bounds (I/O complexity) for linear algebra kernels by introducing the hourglass dependency pattern. It adapts the $K$-partitioning method to exploit this pattern, yielding tighter, parametric lower bounds for MGS, A2V, V2Q, GEBD2, and GEHD2, and shows tiled upper-bound algorithms that asymptotically match these bounds. The results are automated in the IOLB tool, enabling automatic derivation of I/O bounds for new kernels exhibiting hourglass patterns. This work sharpens the understanding of memory-bound behavior in core linear algebra routines and provides practical guidance for schedulers and tiling strategies to minimize data movement. The findings have broad relevance for performance and energy efficiency in high-performance computing workloads involving QR factorizations and related reductions.

Abstract

When designing an algorithm, one cares about arithmetic/computational complexity, but data movement (I/O) complexity plays an increasingly important role that highly impacts performance and energy consumption. For a given algorithm and a given I/O model, scheduling strategies such as loop tiling can reduce the required I/O down to a limit, called the I/O complexity, inherent to the algorithm itself. The objective of I/O complexity analysis is to compute, for a given program, its minimal I/O requirement among all valid schedules. We consider a sequential execution model with two memories, an infinite one, and a small one of size S on which the computations retrieve and produce data. The I/O is the number of reads and writes between the two memories. We identify a common "hourglass pattern" in the dependency graphs of several common linear algebra kernels. Using the properties of this pattern, we mathematically prove tighter lower bounds on their I/O complexity, which improves the previous state-of-the-art bound by a parametric ratio. This proof was integrated inside the IOLB automatic lower bound derivation tool.

Tightening I/O Lower Bounds through the Hourglass Dependency Pattern

TL;DR

The paper targets data movement lower bounds (I/O complexity) for linear algebra kernels by introducing the hourglass dependency pattern. It adapts the

-partitioning method to exploit this pattern, yielding tighter, parametric lower bounds for MGS, A2V, V2Q, GEBD2, and GEHD2, and shows tiled upper-bound algorithms that asymptotically match these bounds. The results are automated in the IOLB tool, enabling automatic derivation of I/O bounds for new kernels exhibiting hourglass patterns. This work sharpens the understanding of memory-bound behavior in core linear algebra routines and provides practical guidance for schedulers and tiling strategies to minimize data movement. The findings have broad relevance for performance and energy efficiency in high-performance computing workloads involving QR factorizations and related reductions.

Abstract

Paper Structure (43 sections, 9 theorems, 41 equations, 9 figures)

This paper contains 43 sections, 9 theorems, 41 equations, 9 figures.

Introduction
Contributions
Outline
Background - I/O complexity and the K-partitioning method
Memory model and I/O complexity
CDAG and red-white pebble game
$K$-partitioning method
Upper bound on the size of a $K$-bounded set
The hourglass pattern
Intuition of the hourglass pattern
Intuition
Running example
Consequences of the hourglass pattern
Hourglass pattern - formal definition
Preliminary notations
...and 28 more sections

Key Result

Theorem 1

Let $S$ be the size of the small memory, and for any $T>0$ let $U$ be the maximal size of a $(S+T)$-partition. Let $V$ be the set of nodes of the CDAG of the program. Then, a lower bound on the number $Q$ of data movement of the program is:

Figures (9)

Figure 1: Modified Gram-Schmidt - Right-Looking (from Polybench polybench). The input matrix $A$ is of size $M \times N$, and the output of the algorithm are matrices $Q$ (the orthonormalized column vector basis) and $R$ such that $A=QR$. The usual right-looking Gram-Schmidt reuses the matrix $A$, instead of defining a new matrix $Q$. $SR$ and $SU$ are labels of two statements, updating $R$ and $A$.
Figure 2: Shape of an hourglass pattern, inside the dependence graph. A node is an instance of a statement of the program, and an edge is a data dependency between two nodes. The $t$ dimension is an external loop surrounding the hourglass.
Figure 3: QR Householder computation - Part A2V (LAPACK routine GEQR2).
Figure 4: Summary of the new asymptotic data movement lower bounds.
Figure 5: Data movement lower-bounds (with constants) automatically derived by IOLB Olivry_pldi20 without/with hourglass detection. In GEHD2's new bound, a new parameter $M$ is introduced, corresponding to the place where we split the outer loop. Depending on $S$ and $N$, it can be instantiated with a different parametric expression (cf Section \ref{['subsec:gehd2_lb']}).
...and 4 more figures

Theorems & Definitions (12)

Theorem 1: $(S+T)$-partitioning I/O lower bound Elango15
Theorem 2: Brascamp-Lieb theorem Christ13
Lemma 3: Structure of $E'$
proof
Lemma 4: Bounds on the size of some of the projections of $I'$
proof
Theorem 5: Lower bounds for MGS
proof
Theorem 6: Lower bounds for HH - part A2V
Theorem 7: Lower bound for HH - Part V2Q
...and 2 more

Tightening I/O Lower Bounds through the Hourglass Dependency Pattern

TL;DR

Abstract

Tightening I/O Lower Bounds through the Hourglass Dependency Pattern

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (12)