Table of Contents
Fetching ...

Hilbert-Guided Block-Sparse Local Attention

Yunge Li, Lanyu Xu

TL;DR

The paper tackles the high computational burden of global self-attention on high-resolution images by introducing Hilbert-guided local attention patterns. By reordering image tokens along a Hilbert curve, 2D local attention is transformed into a contiguous 1D block-sparse computation, increasing empty blocks and reducing partial blocks to accelerate attention via block-sparse kernels. It presents two end-to-end models, the Hilbert Window Transformer (HWT) and Hilbert Neighborhood Transformer (HNT), built around Hilbert Window Attention (HWA), Hilbert Slide Attention (HSA), and Hilbert Neighborhood Attention (HNA), and demonstrates substantial speedups with minimal accuracy loss on ImageNet. The approach is hardware-agnostic through FlexAttention and delivers practical, general acceleration for 2D local attention on images, with code available for replication and extension.

Abstract

The quadratic compute and memory costs of global self-attention severely limit its use in high-resolution images. Local attention reduces complexity by restricting attention to neighborhoods. Block-sparse kernels can further improve the efficiency of local attention, but conventional local attention patterns often fail to deliver significant speedups because tokens within a window are not contiguous in the 1D sequence. This work proposes a novel method for constructing windows and neighborhoods based on the Hilbert curve. Image tokens are first reordered along a Hilbert curve, and windows and neighborhoods are then formed on the reordered 1D sequence. From a block-sparse perspective, this strategy significantly increases block sparsity and can be combined with existing block-sparse kernels to improve the efficiency of 2D local attention. Experiments show that the proposed Hilbert Window Attention and Hilbert Slide Attention can accelerate window attention and slide attention by about $4\times$ and $18\times$, respectively. To assess practicality, the strategy is instantiated as the Hilbert Window Transformer and the Hilbert Neighborhood Transformer, both of which achieve end-to-end speedups with minimal accuracy loss. Overall, combining Hilbert-guided local attention with block-sparse kernels offers a general and practical approach to enhancing the efficiency of 2D local attention for images. The code is available at https://github.com/Yunge6666/Hilbert-Local-Attention.

Hilbert-Guided Block-Sparse Local Attention

TL;DR

The paper tackles the high computational burden of global self-attention on high-resolution images by introducing Hilbert-guided local attention patterns. By reordering image tokens along a Hilbert curve, 2D local attention is transformed into a contiguous 1D block-sparse computation, increasing empty blocks and reducing partial blocks to accelerate attention via block-sparse kernels. It presents two end-to-end models, the Hilbert Window Transformer (HWT) and Hilbert Neighborhood Transformer (HNT), built around Hilbert Window Attention (HWA), Hilbert Slide Attention (HSA), and Hilbert Neighborhood Attention (HNA), and demonstrates substantial speedups with minimal accuracy loss on ImageNet. The approach is hardware-agnostic through FlexAttention and delivers practical, general acceleration for 2D local attention on images, with code available for replication and extension.

Abstract

The quadratic compute and memory costs of global self-attention severely limit its use in high-resolution images. Local attention reduces complexity by restricting attention to neighborhoods. Block-sparse kernels can further improve the efficiency of local attention, but conventional local attention patterns often fail to deliver significant speedups because tokens within a window are not contiguous in the 1D sequence. This work proposes a novel method for constructing windows and neighborhoods based on the Hilbert curve. Image tokens are first reordered along a Hilbert curve, and windows and neighborhoods are then formed on the reordered 1D sequence. From a block-sparse perspective, this strategy significantly increases block sparsity and can be combined with existing block-sparse kernels to improve the efficiency of 2D local attention. Experiments show that the proposed Hilbert Window Attention and Hilbert Slide Attention can accelerate window attention and slide attention by about and , respectively. To assess practicality, the strategy is instantiated as the Hilbert Window Transformer and the Hilbert Neighborhood Transformer, both of which achieve end-to-end speedups with minimal accuracy loss. Overall, combining Hilbert-guided local attention with block-sparse kernels offers a general and practical approach to enhancing the efficiency of 2D local attention for images. The code is available at https://github.com/Yunge6666/Hilbert-Local-Attention.

Paper Structure

This paper contains 17 sections, 1 equation, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Speedup from block sparsity. When the sequence length is fixed, a higher empty blocks ratio leads to faster computation, with the effect especially pronounced at high sparsity ($>80\%$). Additionally, longer sequences yield greater speedups.
  • Figure 2: Row-major sequence vs. Hilbert-curve sequence
  • Figure 3: Local attention patterns. (1) In this case, the feature map is assumed to $4\times4$, therefore there are $16$ tokens. The window size is $2\times2$, and $b_q = b_k = 4$. (2) Hilbert reordering not only preserves spatial locality but also increases the proportion of empty blocks in Hilbert local attention patterns, thereby reducing computational and memory access overhead. This further unleashes the potential of block-sparse attention and accelerates local attention computation.
  • Figure 4: The architecture of Hilbert Window Transformer. After the token sequence is reordered according to the Hilbert curve, windows are constructed on the 1D sequence for window attention. Cross-window interaction is achieved by shifting these windows along the 1D sequence. Although the windows formed in the Hilbert-ordered sequence may correspond to irregular shapes in the original 2D space, the tokens within each window remain spatially adjacent in the 2D image.
  • Figure 5: The architecture of Hilbert Neighborhood Transformer. Hilbert reordering enables the extraction of neighborhoods on the 1D token sequence while preserving spatial proximity. As a result, the 2D neighborhood attention can be converted into a 1D neighborhood attention.
  • ...and 4 more figures