Table of Contents
Fetching ...

LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution

Xiaohui Li, Shaobin Zhuang, Shuo Cao, Yang Yang, Yuandong Pu, Qi Qin, Siqi Luo, Bin Fu, Yihao Liu

TL;DR

This work introduces LinearSR, the first framework to robustly apply linear attention to high-fidelity image super-resolution. It combines a Diffusion Transformer backbone with an ESGF-based stable two-stage fine-tuning, a SNR-driven four-expert MoE, and TAG-guided precision-over-volume guidance, achieving state-of-the-art perceptual quality at linear time complexity. The approach delivers a 1-NFE forward time of $0.036$ seconds for 1024×1024 outputs and a competitive total inference time of $0.830$ seconds, outperforming many quadratic-attention baselines on perceptual metrics while maintaining fidelity. By providing a reproducible, efficient baseline, LinearSR enables further speedups via distillation and offers a practical paradigm for scalable, photorealistic diffusion-based SR.

Abstract

Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention's quadratic complexity (O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N) solution, but its promise for photorealistic SR has remained largely untapped, historically hindered by a cascade of interrelated and previously unsolved challenges. This paper introduces LinearSR, a holistic framework that, for the first time, systematically overcomes these critical hurdles. Specifically, we resolve a fundamental, training instability that causes catastrophic model divergence using our novel "knee point"-based Early-Stopping Guided Fine-tuning (ESGF) strategy. Furthermore, we mitigate the classic perception-distortion trade-off with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we establish an effective and lightweight guidance paradigm, TAG, derived from our "precision-over-volume" principle. Our resulting LinearSR model simultaneously delivers state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its overall multi-step inference time remains highly competitive. This work provides the first robust methodology for applying Linear Attention in the photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.

LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution

TL;DR

This work introduces LinearSR, the first framework to robustly apply linear attention to high-fidelity image super-resolution. It combines a Diffusion Transformer backbone with an ESGF-based stable two-stage fine-tuning, a SNR-driven four-expert MoE, and TAG-guided precision-over-volume guidance, achieving state-of-the-art perceptual quality at linear time complexity. The approach delivers a 1-NFE forward time of seconds for 1024×1024 outputs and a competitive total inference time of seconds, outperforming many quadratic-attention baselines on perceptual metrics while maintaining fidelity. By providing a reproducible, efficient baseline, LinearSR enables further speedups via distillation and offers a practical paradigm for scalable, photorealistic diffusion-based SR.

Abstract

Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention's quadratic complexity (O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N) solution, but its promise for photorealistic SR has remained largely untapped, historically hindered by a cascade of interrelated and previously unsolved challenges. This paper introduces LinearSR, a holistic framework that, for the first time, systematically overcomes these critical hurdles. Specifically, we resolve a fundamental, training instability that causes catastrophic model divergence using our novel "knee point"-based Early-Stopping Guided Fine-tuning (ESGF) strategy. Furthermore, we mitigate the classic perception-distortion trade-off with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we establish an effective and lightweight guidance paradigm, TAG, derived from our "precision-over-volume" principle. Our resulting LinearSR model simultaneously delivers state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its overall multi-step inference time remains highly competitive. This work provides the first robust methodology for applying Linear Attention in the photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.

Paper Structure

This paper contains 36 sections, 8 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: LinearSR enables high-fidelity super-resolution at a linear computational cost. Left: LinearSR produces high-fidelity visual results, restoring fine details and textures. Right: The plots highlight the dramatic efficiency advantage of our Linear Attention. As input size grows, its cost in time and GFLOPs scales linearly, versus the quadratic growth of vanilla attention.
  • Figure 2: The Integrated LinearSR Framework. This figure illustrates how our contributions synergize: the tag-guided Mixture of Experts (MoE) architecture (a), built upon an efficient linear attention backbone (b), is made stable and effective by our Early-Stopping Guided Fine-tuning (ESGF) strategy (c), which initiates fine-tuning at the critical "knee point" to maximize performance.
  • Figure 3: Justification for ESGF through Instability Analysis. (a) Representative feature maps from the same linear attention layer reveal a stark structural degradation from the knee-point to a later unstable peak. (b) The training dynamics confirm this phenomenon is universal, with PSNR and LPIPS metrics exhibiting the characteristic "Plateau and Oscillation Phase" post-knee-point.
  • Figure 4: Hierarchical log-SNR bisection defines operational boundaries for 4-expert MoE.
  • Figure 5: Qualitative comparison with state-of-the-art methods. Our LinearSR consistently restores intricate textures and realistic details, outperforming competing methods across diverse real-world degradations. This is particularly evident in its ability to reconstruct the flower's delicate stamens and petal textures, as well as the axolotl's complex skin pattern and sharp eye.
  • ...and 4 more figures