On Fine-Grained I/O Complexity of Attention Backward Passes
Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Song Yue, Jiahao Zhang
TL;DR
This work addresses the I/O bottlenecks of attention in long-context transformers by analyzing data movement between a fast cache and large memory using the red-blue pebble game. It derives a tight backward-pass I/O bound that scales as $\Theta\left(\min\left\{\frac{n^2 d^2 + n d^3}{M}, \frac{n^2 d + n d^2}{\sqrt{M}}\right\}\right)$ with a crossover at $M=\Theta(d^2)$, and shows FlashAttention is optimal in the large-cache regime while introducing a new efficient small-cache algorithm. The paper also provides fine-grained lower bounds for sparse attention and integrates these results with existing forward-pass analyses to yield a complete I/O complexity picture for attention. These findings offer practical guidance for designing hardware-aware, memory-efficient training and inference pipelines for large language models.
Abstract
Large Language Models (LLMs) exhibit exceptional proficiency in handling extensive context windows in natural language. Nevertheless, the quadratic scaling of attention computation relative to sequence length creates substantial efficiency bottlenecks, necessitating the development of I/O-optimized algorithms. In this work, we conduct a systematic examination of the I/O complexity inherent in attention mechanisms, with a specific emphasis on the backward pass under both small and large cache settings. By leveraging the red-blue pebble game framework, we derive tight bounds for I/O complexity across the full spectrum of cache sizes. We validate that FlashAttention, one of the current industry standards, achieves optimality in the large-cache scenario for both forward and backward passes. Conversely, for small-cache environments, we introduce a novel algorithm that outperforms contemporary methods and successfully attains theoretical tight bounds. Furthermore, we expand our investigation to include sparse attention by establishing granular lower bounds for both forward and backward passes across all cache configurations. Ultimately, our results solidify the theoretical framework regarding I/O complexity in attention mechanisms, providing critical guidance for the development of efficient LLM training and inference systems.
