Table of Contents
Fetching ...

A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift

Sanath Budakegowdanadoddi Nagaraju, Brian Bernhard Moser, Tobias Christian Nauen, Stanislav Frolov, Federico Raue, Andreas Dengel

TL;DR

The paper tackles high-fidelity image super-resolution with transformer architectures that suffer from quadratic attention costs and coarse patch embeddings. It introduces TaylorIR, combining pixel-wise 1×1 patch embeddings with TaylorShift, a Taylor-series-based attention that approximates full token interactions with near-linear complexity. Integrating TaylorIR into SwinIR yields TaylorSwinIR, enabling large window sizes (e.g., 48×48) with substantially reduced VRAM usage and improved PSNR/SSIM on standard benchmarks. Across DIV2K and five SR datasets, TaylorSwinIR consistently outperforms prior SR transformers, delivering better detail preservation and efficiency, and it remains a plug-and-play component for existing architectures.

Abstract

Transformer-based architectures have recently advanced the image reconstruction quality of super-resolution (SR) models. Yet, their scalability remains limited by quadratic attention costs and coarse patch embeddings that weaken pixel-level fidelity. We propose TaylorIR, a plug-and-play framework that enforces 1x1 patch embeddings for true pixel-wise reasoning and replaces conventional self-attention with TaylorShift, a Taylor-series-based attention mechanism enabling full token interactions with near-linear complexity. Across multiple SR benchmarks, TaylorIR delivers state-of-the-art performance while reducing memory consumption by up to 60%, effectively bridging the gap between fine-grained detail restoration and efficient transformer scaling.

A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift

TL;DR

The paper tackles high-fidelity image super-resolution with transformer architectures that suffer from quadratic attention costs and coarse patch embeddings. It introduces TaylorIR, combining pixel-wise 1×1 patch embeddings with TaylorShift, a Taylor-series-based attention that approximates full token interactions with near-linear complexity. Integrating TaylorIR into SwinIR yields TaylorSwinIR, enabling large window sizes (e.g., 48×48) with substantially reduced VRAM usage and improved PSNR/SSIM on standard benchmarks. Across DIV2K and five SR datasets, TaylorSwinIR consistently outperforms prior SR transformers, delivering better detail preservation and efficiency, and it remains a plug-and-play component for existing architectures.

Abstract

Transformer-based architectures have recently advanced the image reconstruction quality of super-resolution (SR) models. Yet, their scalability remains limited by quadratic attention costs and coarse patch embeddings that weaken pixel-level fidelity. We propose TaylorIR, a plug-and-play framework that enforces 1x1 patch embeddings for true pixel-wise reasoning and replaces conventional self-attention with TaylorShift, a Taylor-series-based attention mechanism enabling full token interactions with near-linear complexity. Across multiple SR benchmarks, TaylorIR delivers state-of-the-art performance while reducing memory consumption by up to 60%, effectively bridging the gap between fine-grained detail restoration and efficient transformer scaling.

Paper Structure

This paper contains 20 sections, 4 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Overview of TaylorIR’s impact on image SR. Using 1×1 patch embeddings, TaylorIR models the image at pixel-level resolution. TaylorShift nauen2024taylorshift replaces standard self-attention with a Taylor-series-based alternative that maintains full token interaction while reducing memory load.
  • Figure 2: TaylorIR has two pieces: (left) pixel-wise ($1{\times}1$) patch embedding and (right) TaylorShift attention in place of windowed softmax attention. Together, they enable long-range context with lower memory and stable runtime.
  • Figure 3: Visual comparison on $2{\times}$ SR. Red boxes in the HR images mark the regions shown for comparison. On (left) Urban100 image img_044, TaylorSwinIR achieves 44.87 dB / 0.9921 versus SwinIR’s 44.13 dB / 0.9905. On (right) Urban100 image img_009, TaylorSwinIR achieves 42.65 dB / 0.9834 versus SwinIR’s 42.20 dB / 0.9834.
  • Figure 4: Visual comparison on $3{\times}$ SR. Yellow and red boxes in the HR images indicate the regions shown for comparison. On (left) Manga109 image AosugiruHaru, TaylorSwinIR achieves 44.30 dB / 0.9898 versus SwinIR’s 44.15 dB / 0.9896. On (right) image HaruichibanNoFukukoro, TaylorSwinIR attains 42.79 dB / 0.9872 compared to SwinIR’s 42.47 dB / 0.9870.
  • Figure 5: Visual comparison on $4{\times}$ SR. Red boxes in the HR images indicate the cropped regions for comparison. On (left) Urban100 image img_090, TaylorSwinIR achieves 40.54 dB / 0.9836 versus SwinIR’s 40.34 dB / 0.9832. On (right) image img_081, TaylorSwinIR reaches 39.81 dB / 0.9807 compared to SwinIR’s 39.80 dB / 0.9802.
  • ...and 2 more figures