Table of Contents
Fetching ...

MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

Can Yaras, Alec S. Xu, Pierre Abillama, Changwoo Lee, Laura Balzano

TL;DR

This work tackles the quadratic $Θ(N^2 d)$ time and $Θ(N^2)$ space bottleneck of softmax attention in Transformers. It introduces MonarchAttention, which replaces dense attention with a sub-quadratic Monarch-matrix approximation by optimizing the variational form of softmax under a Monarch-structure constraint, achieving $Θ(N \sqrt{N} d)$ time and $Θ(N d)$ memory. The method is zero-shot transferable and hardware-friendly, offering substantial wall-clock speedups on modern GPUs while preserving accuracy across vision and language tasks. Across ViT, RoBERTa, BART, DiT, and GraphGPS, MonarchAttention demonstrates competitive performance with significant reductions in attention FLOPs, enabling longer sequences and faster training/inference.

Abstract

Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with $Θ(N\sqrt{N} d)$ computational complexity and $Θ(Nd)$ memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the Transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: $1.4\times$ for shorter sequences $(N=256)$, $4.5\times$ for medium-length sequences $(N=4K)$, and $8.2\times$ for longer sequences $(N=16K)$. We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at https://github.com/cjyaras/monarch-attention.

MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

TL;DR

This work tackles the quadratic time and space bottleneck of softmax attention in Transformers. It introduces MonarchAttention, which replaces dense attention with a sub-quadratic Monarch-matrix approximation by optimizing the variational form of softmax under a Monarch-structure constraint, achieving time and memory. The method is zero-shot transferable and hardware-friendly, offering substantial wall-clock speedups on modern GPUs while preserving accuracy across vision and language tasks. Across ViT, RoBERTa, BART, DiT, and GraphGPS, MonarchAttention demonstrates competitive performance with significant reductions in attention FLOPs, enabling longer sequences and faster training/inference.

Abstract

Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with computational complexity and memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the Transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: for shorter sequences , for medium-length sequences , and for longer sequences . We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at https://github.com/cjyaras/monarch-attention.

Paper Structure

This paper contains 45 sections, 36 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Approximation of softmax attention via MonarchAttention. By directly optimizing the softmax variational objective constrained to Monarch matrices, MonarchAttention yields accurate zero-shot approximation to softmax attention compared to other hardware-friendly, efficient attention baselines. Attention maps extracted from RoBERTa on the SQuAD dataset in \ref{['sec:experiments']}.
  • Figure 2: Zero-shot conversion of attention layers for image classification and question answering. We vary hyperparameters for various baselines to evaluate model quality vs compute tradeoff. Left. Top-5 accuracy vs. total attention FLOPs across all layers for ViT on ImageNet. Right. F1 score vs total attention FLOPs across all layers for RoBERTa on SQuAD.
  • Figure 3: Zero-shot conversion of attention layers for long sequence summarization. We vary the sequence length of the text to be summarized to evaluate model quality vs compute tradeoff. We report recall-based ROUGE-1 and ROUGE-L scores vs. total attention FLOPs across all layers for BART on BookSum-chapters.
  • Figure 4: Python-like code for MonarchAttention. Each kernel materializes all intermediate arrays in SRAM to reduce data movement.
  • Figure 5: Visual quality of generated images for zero-shot conversion of attention layers. Example images generated by with softmax (left), MonarchAttention (middle), and Nyströmformer (right). Only the first half of the attention layers of DiT are replaced.
  • ...and 3 more figures