Table of Contents
Fetching ...

Reciprocal Attention Mixing Transformer for Lightweight Image Restoration

Haram Choi, Cheolwoong Na, Jihyeon Oh, Seungjae Lee, Jinseop Kim, Subeen Choe, Jeongmin Lee, Taehoon Kim, Jihoon Yang

TL;DR

RAMiT addresses the need for lightweight image restoration by fusing local and global context through bi-dimensional self-attention in parallel, coupled with reciprocal helper interactions and a hierarchical attention mixer. The core innovations are the Dimensional Reciprocal Attention Mixing Transformer (D-RAMiT) blocks and the Hierarchical Reciprocal Attention Mixer (H-RAMi), both engineered for efficiency via MobileNet-inspired variants (MobiVari). Across five tasks—super-resolution, color/grayscale denoising, low-light enhancement, and deraining—RAMiT achieves state-of-the-art results among lightweight models while demanding fewer parameters and computations. The approach demonstrates robust performance benefits and paves the way for extending reciprocal attention mechanisms to broader low-level vision applications.

Abstract

Although many recent works have made advancements in the image restoration (IR) field, they often suffer from an excessive number of parameters. Another issue is that most Transformer-based IR methods focus only on either local or global features, leading to limited receptive fields or deficient parameter issues. To address these problems, we propose a lightweight IR network, Reciprocal Attention Mixing Transformer (RAMiT). It employs our proposed dimensional reciprocal attention mixing Transformer (D-RAMiT) blocks, which compute bi-dimensional (spatial and channel) self-attentions in parallel with different numbers of multi-heads. The bi-dimensional attentions help each other to complement their counterpart's drawbacks and are then mixed. Additionally, we introduce a hierarchical reciprocal attention mixing (H-RAMi) layer that compensates for pixel-level information losses and utilizes semantic information while maintaining an efficient hierarchical structure. Furthermore, we revisit and modify MobileNet V1 and V2 to attach efficient convolutions to our proposed components. The experimental results demonstrate that RAMiT achieves state-of-the-art performance on multiple lightweight IR tasks, including super-resolution, color denoising, grayscale denoising, low-light enhancement, and deraining. Codes are available at https://github.com/rami0205/RAMiT.

Reciprocal Attention Mixing Transformer for Lightweight Image Restoration

TL;DR

RAMiT addresses the need for lightweight image restoration by fusing local and global context through bi-dimensional self-attention in parallel, coupled with reciprocal helper interactions and a hierarchical attention mixer. The core innovations are the Dimensional Reciprocal Attention Mixing Transformer (D-RAMiT) blocks and the Hierarchical Reciprocal Attention Mixer (H-RAMi), both engineered for efficiency via MobileNet-inspired variants (MobiVari). Across five tasks—super-resolution, color/grayscale denoising, low-light enhancement, and deraining—RAMiT achieves state-of-the-art results among lightweight models while demanding fewer parameters and computations. The approach demonstrates robust performance benefits and paves the way for extending reciprocal attention mechanisms to broader low-level vision applications.

Abstract

Although many recent works have made advancements in the image restoration (IR) field, they often suffer from an excessive number of parameters. Another issue is that most Transformer-based IR methods focus only on either local or global features, leading to limited receptive fields or deficient parameter issues. To address these problems, we propose a lightweight IR network, Reciprocal Attention Mixing Transformer (RAMiT). It employs our proposed dimensional reciprocal attention mixing Transformer (D-RAMiT) blocks, which compute bi-dimensional (spatial and channel) self-attentions in parallel with different numbers of multi-heads. The bi-dimensional attentions help each other to complement their counterpart's drawbacks and are then mixed. Additionally, we introduce a hierarchical reciprocal attention mixing (H-RAMi) layer that compensates for pixel-level information losses and utilizes semantic information while maintaining an efficient hierarchical structure. Furthermore, we revisit and modify MobileNet V1 and V2 to attach efficient convolutions to our proposed components. The experimental results demonstrate that RAMiT achieves state-of-the-art performance on multiple lightweight IR tasks, including super-resolution, color denoising, grayscale denoising, low-light enhancement, and deraining. Codes are available at https://github.com/rami0205/RAMiT.
Paper Structure (23 sections, 3 equations, 18 figures, 13 tables)

This paper contains 23 sections, 3 equations, 18 figures, 13 tables.

Figures (18)

  • Figure 1: The importance of locality and global dependency in image restoration tasks. (Blue boxes) Local features are informative enough to recover most parts, meaning that the contribution of locally adjacent pixels is crucial. (Red boxes) Some areas seem more challenging due to high levels of distortion (blurring, noise, darkness, or obstruction). They require global dependency, which can often be detected in repeated patterns or textures distributed throughout the entire image.
  • Figure 2: Overall architecture of RAMiT. (a) The size indicates dimension of output from each component. The operation of $I_{LQ}+I_{res}$ is omitted for super-resolution tasks. $I_{RC}$ equals to $I_{res}\in\mathbb{R}^{3\times rH \times rW}$ ($r$: an upscale factor). (b) The different multi-heads ($L_{sp}, L_{ch}$) are assigned to each self-attention (SA) module. Being multiplied to value of each counterpart, both SAs help each other (white arrows, optional depending on tasks). The bi-dimensional attentions are mixed by our MobileNet variant, MobiVari$^{\textcolor{red}{1}}$. (c) H-RAMi mixes the hierarchical attentions resulting from the last blocks of each stage. Before MobiVari enhances and mixes the attentions, this module upsamples and concatenates multi-scale attentions. (d) Our bottleneck adopts the SCDP bottleneck of NGswin choi2023n.
  • Figure 3: (a) The depth of the red areas indicates the extent to which the regions contribute to recovering a red box of an input. D-RAMiT utilizes both local and global dependencies, meaningfully expanding the receptive field compared to the pure SPSA (see Appendix Sec. \ref{['appendix_lam']}). (b) Our bi-dimensional self-attention schemes help each other to further boost image restoration performances.
  • Figure 4: Impacts of H-RAMi. (a) A ground-truth high-quality image. (b), (c) The feature maps after stage $4$ and H-RAMi. (d) Element-wise product of (b) and (c) (Remind Fig. \ref{['fig_overall']}a). (b), (c), (d) are obtained by max-pooling along channel and standardization. More are in Appendix Sec. \ref{['appendix_hrami']}.
  • Figure 5: Visual comparisons of multiple lightweight image restoration tasks. LQ: Low-Quality input. HQ: High-Quality target. (1st row) Super-Resolution. (2nd row) Denoising. (3rd row) Low-Light Enhancement. (4th row) Deraining. More results are provided in Appendix Sec. \ref{['appendix_viscomp']}.
  • ...and 13 more figures