Flash Window Attention: speedup the attention computation for Swin Transformer

Zhendong Zhang

Flash Window Attention: speedup the attention computation for Swin Transformer

Zhendong Zhang

TL;DR

This work tackles the high computational cost of attention in high‑resolution vision models by combining Swin Transformer window attention with a tailored flash‑style scheme. It introduces Flash Window Attention, which tiles attention along the feature dimension and stores the entire short attention matrix on‑chip, enabling efficient forward and backward passes. The approach achieves up to 300% speedup in attention computation and about 30% end‑to‑end speedup on GPUs, demonstrated with a Triton/PyTorch implementation on an RTX 4090. While effective for typical window sizes, it acknowledges limitations for very large windows and points to future work on broader window patterns.

Abstract

To address the high resolution of image pixels, the Swin Transformer introduces window attention. This mechanism divides an image into non-overlapping windows and restricts attention computation to within each window, significantly enhancing computational efficiency. To further optimize this process, one might consider replacing standard attention with flash attention, which has proven to be more efficient in language models. However, a direct substitution is ineffective. Flash attention is designed for long sequences, whereas window attention deals with shorter sequences but must handle numerous of them in parallel. In this report, we present an optimized solution called Flash Window Attention, tailored specifically for window attention. Flash Window Attention improves attention computation efficiency by up to 300% and enhances end-to-end runtime efficiency by up to 30%. Our code is available online.

Flash Window Attention: speedup the attention computation for Swin Transformer

TL;DR

Abstract

Paper Structure (10 sections, 4 equations, 5 figures, 2 algorithms)

This paper contains 10 sections, 4 equations, 5 figures, 2 algorithms.

Introduction
Methodology
Problem Formulation
Tiling along Feature Dimension
Implementation and Analysis
Benchmark
Attention Computation
End-to-End Running
Compare with Flash Attention
Discussion

Figures (5)

Figure 1: Comparison of forward attention computation. $C=64$ for left figure and $C=256$ for right figure. Bars for running time while lines for memory usage. Note that $batch$ means the number of sequences after window partition, instead of the batch size.
Figure 2: Comparison of forward-backward attention computation. $C=64$ for left figure and $C=256$ for right figure. Bars for running time while lines for memory usage.
Figure 3: Comparison of end-to-end running of Swin Transformer.
Figure 4: Comparison of forward attention computation. $C=64$ for left figure and $C=256$ for right figure. Bars for running time while lines for memory usage. Note that $batch$ means the number of sequences after window partition, instead of the batch size.
Figure 5: Comparison of forward-backward attention computation. $C=64$ for left figure and $C=256$ for right figure. Bars for running time while lines for memory usage.

Flash Window Attention: speedup the attention computation for Swin Transformer

TL;DR

Abstract

Flash Window Attention: speedup the attention computation for Swin Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (5)