A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift
Sanath Budakegowdanadoddi Nagaraju, Brian Bernhard Moser, Tobias Christian Nauen, Stanislav Frolov, Federico Raue, Andreas Dengel
TL;DR
The paper tackles high-fidelity image super-resolution with transformer architectures that suffer from quadratic attention costs and coarse patch embeddings. It introduces TaylorIR, combining pixel-wise 1×1 patch embeddings with TaylorShift, a Taylor-series-based attention that approximates full token interactions with near-linear complexity. Integrating TaylorIR into SwinIR yields TaylorSwinIR, enabling large window sizes (e.g., 48×48) with substantially reduced VRAM usage and improved PSNR/SSIM on standard benchmarks. Across DIV2K and five SR datasets, TaylorSwinIR consistently outperforms prior SR transformers, delivering better detail preservation and efficiency, and it remains a plug-and-play component for existing architectures.
Abstract
Transformer-based architectures have recently advanced the image reconstruction quality of super-resolution (SR) models. Yet, their scalability remains limited by quadratic attention costs and coarse patch embeddings that weaken pixel-level fidelity. We propose TaylorIR, a plug-and-play framework that enforces 1x1 patch embeddings for true pixel-wise reasoning and replaces conventional self-attention with TaylorShift, a Taylor-series-based attention mechanism enabling full token interactions with near-linear complexity. Across multiple SR benchmarks, TaylorIR delivers state-of-the-art performance while reducing memory consumption by up to 60%, effectively bridging the gap between fine-grained detail restoration and efficient transformer scaling.
