Table of Contents
Fetching ...

Rank-Factorized Implicit Neural Bias: Scaling Super-Resolution Transformer with FlashAttention

Dongheon Lee, Seokju Yun, Jaegyun Im, Youngmin Ro

TL;DR

This paper proposes Rank-factorized Implicit Neural Bias~(RIB), an alternative to RPB that enables FlashAttention in SR Transformers and introduces a convolutional local attention and a cyclic window strategy to fully leverage the advantages of long-range interactions enabled by RIB and FlashAttention.

Abstract

Recent Super-Resolution~(SR) methods mainly adopt Transformers for their strong long-range modeling capability and exceptional representational capacity. However, most SR Transformers rely heavily on relative positional bias~(RPB), which prevents them from leveraging hardware-efficient attention kernels such as FlashAttention. This limitation imposes a prohibitive computational burden during both training and inference, severely restricting attempts to scale SR Transformers by enlarging the training patch size or the self-attention window. Consequently, unlike other domains that actively exploit the inherent scalability of Transformers, SR Transformers remain heavily focused on effectively utilizing limited receptive fields. In this paper, we propose Rank-factorized Implicit Neural Bias~(RIB), an alternative to RPB that enables FlashAttention in SR Transformers. Specifically, RIB approximates positional bias using low-rank implicit neural representations and concatenates them with pixel content tokens in a channel-wise manner, turning the element-wise bias addition in attention score computation into a dot-product operation. Further, we introduce a convolutional local attention and a cyclic window strategy to fully leverage the advantages of long-range interactions enabled by RIB and FlashAttention. We enlarge the window size up to \textbf{96$\times$96} while jointly scaling the training patch size and the dataset size, maximizing the benefits of Transformers in the SR task. As a result, our network achieves \textbf{35.63\,dB PSNR} on Urban100$\times$2, while reducing training and inference time by \textbf{2.1$\times$} and \textbf{2.9$\times$}, respectively, compared to the RPB-based SR Transformer~(PFT).

Rank-Factorized Implicit Neural Bias: Scaling Super-Resolution Transformer with FlashAttention

TL;DR

This paper proposes Rank-factorized Implicit Neural Bias~(RIB), an alternative to RPB that enables FlashAttention in SR Transformers and introduces a convolutional local attention and a cyclic window strategy to fully leverage the advantages of long-range interactions enabled by RIB and FlashAttention.

Abstract

Recent Super-Resolution~(SR) methods mainly adopt Transformers for their strong long-range modeling capability and exceptional representational capacity. However, most SR Transformers rely heavily on relative positional bias~(RPB), which prevents them from leveraging hardware-efficient attention kernels such as FlashAttention. This limitation imposes a prohibitive computational burden during both training and inference, severely restricting attempts to scale SR Transformers by enlarging the training patch size or the self-attention window. Consequently, unlike other domains that actively exploit the inherent scalability of Transformers, SR Transformers remain heavily focused on effectively utilizing limited receptive fields. In this paper, we propose Rank-factorized Implicit Neural Bias~(RIB), an alternative to RPB that enables FlashAttention in SR Transformers. Specifically, RIB approximates positional bias using low-rank implicit neural representations and concatenates them with pixel content tokens in a channel-wise manner, turning the element-wise bias addition in attention score computation into a dot-product operation. Further, we introduce a convolutional local attention and a cyclic window strategy to fully leverage the advantages of long-range interactions enabled by RIB and FlashAttention. We enlarge the window size up to \textbf{9696} while jointly scaling the training patch size and the dataset size, maximizing the benefits of Transformers in the SR task. As a result, our network achieves \textbf{35.63\,dB PSNR} on Urban1002, while reducing training and inference time by \textbf{2.1} and \textbf{2.9}, respectively, compared to the RPB-based SR Transformer~(PFT).
Paper Structure (28 sections, 11 equations, 7 figures, 11 tables)

This paper contains 28 sections, 11 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Overview of our contributions. We enable FlashAttention for Super-Resolution (SR) Transformers by replacing the commonly used relative positional bias (RPB) with our proposed Rank-factorized Implicit Neural Bias (RIB). The resulting efficiency allows us to scale up the SR Transformer significantly, including the self-attention window size and training patch size up to $96\times96$, and also datasets from DF2K to DFLIP. Consequently, our network achieves remarkable performance while drastically reducing training/inference costs.
  • Figure 2: Preliminaries and motivation. (a) Overview of the general SR architecture. (b) Comparison of the attention probability matrix $\mathbf{P}=\mathrm{SoftMax}(\mathbf{Q}\mathbf{K}^{\top}/\sqrt{D})$ with (right) and without (left) RoPE RoPEViT. For intuition, $\mathbf{P}$ is computed in a toy setting where $\mathbf{Q}=\mathbf{K}$ is obtained by sampling random two orthogonal 32-dimensional vectors and spatially tiling them to form a $32\times32$ feature map. Typical SR Transformers set $D$ to 30 SwinIRSRFormerHAT. With RoPE, similarity between repeated patterns becomes unstable and often attenuates as the spatial offset increases, or vice versa, due to phase-wrapping effects, as highlighted by the red arrow.
  • Figure 3: Overall illustration for proposed Rank-factorized Implicit Neural Bias (RIB).
  • Figure 4: Visualized positional score matrix ($\mathbf{S}_{\mathrm{p}}$) and relative positional bias table ($\mathbf{R}_{\Delta}$) calculated from the second block of SST. We average the biases that correspond to the same 2D distance, since positional bias predicted by our RIB does not guarantee equivalence across the same relative offsets.
  • Figure 5: Comparisons of (a) visual results and (b) local attribution maps (LAM) LAM. Since padding/cropping for windowing inside the networks hinders accurate comparison, we separately compare SST-L ($512\times512$ patch) and SST-L+ ($768\times768$ patch).
  • ...and 2 more figures