Table of Contents
Fetching ...

Plug-and-play linear attention with provable guarantees for training-free image restoration

Srinivasan Kidambi, Karthik Palaniappan, Pravin Nair

TL;DR

This work tackles the quadratic complexity of multi-head self-attention in vision transformers for image restoration by introducing PnP-Nystra, a training-free Nyström-based linear-attention module that plugs into pretrained window-based models. It recasts attention as a kernel operation with an exponential kernel and applies a generalized Nyström approximation using a small set of landmarks, achieving linear-time and linear-memory complexity with provable error guarantees. The method demonstrates strong, task-agnostic performance across dehazing, denoising, deblurring, and super-resolution, delivering up to $1.8$–$3.6\times$ GPU and $1.8$–$7\times$ CPU speedups with minimal quality degradation compared to the original pretrained models, outperforming other training-free linear-attention baselines. The approach is poised to enable real-time and resource-constrained deployment of high-performing restoration transformers and suggests future extensions to global attention, diffusion architectures, and video restoration.

Abstract

Multi-head self-attention (MHSA) is a key building block in modern vision Transformers, yet its quadratic complexity in the number of tokens remains a major bottleneck for real-time and resource-constrained deployment. We present PnP-Nystra, a training-free Nyström-based linear attention module designed as a plug-and-play replacement for MHSA in {pretrained} image restoration Transformers, with provable kernel approximation error guarantees. PnP-Nystra integrates directly into window-based architectures such as SwinIR, Uformer, and Dehazeformer, yielding efficient inference without finetuning. Across denoising, deblurring, dehazing, and super-resolution on images, PnP-Nystra delivers $1.8$--$3.6\times$ speedups on an NVIDIA RTX 4090 GPU and $1.8$--$7\times$ speedups on CPU inference. Compared with the strongest training-free linear-attention baselines we evaluate, our method incurs the smallest quality drop and stays closest to the original model's outputs.

Plug-and-play linear attention with provable guarantees for training-free image restoration

TL;DR

This work tackles the quadratic complexity of multi-head self-attention in vision transformers for image restoration by introducing PnP-Nystra, a training-free Nyström-based linear-attention module that plugs into pretrained window-based models. It recasts attention as a kernel operation with an exponential kernel and applies a generalized Nyström approximation using a small set of landmarks, achieving linear-time and linear-memory complexity with provable error guarantees. The method demonstrates strong, task-agnostic performance across dehazing, denoising, deblurring, and super-resolution, delivering up to GPU and CPU speedups with minimal quality degradation compared to the original pretrained models, outperforming other training-free linear-attention baselines. The approach is poised to enable real-time and resource-constrained deployment of high-performing restoration transformers and suggests future extensions to global attention, diffusion architectures, and video restoration.

Abstract

Multi-head self-attention (MHSA) is a key building block in modern vision Transformers, yet its quadratic complexity in the number of tokens remains a major bottleneck for real-time and resource-constrained deployment. We present PnP-Nystra, a training-free Nyström-based linear attention module designed as a plug-and-play replacement for MHSA in {pretrained} image restoration Transformers, with provable kernel approximation error guarantees. PnP-Nystra integrates directly into window-based architectures such as SwinIR, Uformer, and Dehazeformer, yielding efficient inference without finetuning. Across denoising, deblurring, dehazing, and super-resolution on images, PnP-Nystra delivers -- speedups on an NVIDIA RTX 4090 GPU and -- speedups on CPU inference. Compared with the strongest training-free linear-attention baselines we evaluate, our method incurs the smallest quality drop and stays closest to the original model's outputs.

Paper Structure

This paper contains 11 sections, 2 theorems, 15 equations, 4 figures, 6 tables, 1 algorithm.

Key Result

Lemma 2.1

Assume $\mathbf{G}_A$ is nonsingular. Then, where $\sigma_k(\cdot)$ denotes the $k$-th singular value, and the constant $C$ depends quadratically on $\sigma_1(\widetilde{\mathbf{G}})$.

Figures (4)

  • Figure 1: Variation of the top $50$ singular values of attention maps ($N = 32^2$) from SwinIR and Uformer, averaged over all heads and layers. The steep decay within the first $20$ singular values highlights the low-rank structure of the attention matrices.
  • Figure 2: Single-image dehazing. Top: hazy input and restored outputs. Bottom: ground truth and pixel-wise error maps (scaled to [0--255]). PnP-Nystra matches the pretrained model closely while producing cleaner dehazing than other training-free linear attention baselines, as evidenced by metrics (PSNR(dB), SSIM) and reconstruction of trees and houses in the zoomed area.
  • Figure 3: Inference time vs. token count $N$: Unlike MHSA which grows quadratically with $N$, PnP-Nystra scales linearly.
  • Figure 4: Attention map visualization for Uformer-B: (a) original model and (b) PnP-Nystra. Both maps have the same salient structures with strong responses along object boundaries

Theorems & Definitions (2)

  • Lemma 2.1: Spectral norm error bound
  • Corollary 2.2: Exact recovery