Table of Contents
Fetching ...

Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration

Shihao Zhou, Dayu Li, Jinshan Pan, Juncheng Zhou, Jinglei Shi, Jufeng Yang

TL;DR

This work addresses redundancy in standard Transformer self-attention for image restoration by introducing HMHA, which assigns heads to hierarchical subspaces of varying sizes after channel similarity ranking, and QKCU, which provides intra- and inter-layer attention modulation. Implemented as HINT (Hierarchical multi-head atteNtion driven Transformer), these components enable diverse learners and richer head interactions, improving restoration quality with manageable complexity. Across 12 benchmarks and 5 tasks (low-light, dehazing, desnowing, denoising, deraining), HINT delivers superior PSNR/SSIM and robust performance on real-world data, while maintaining competitive efficiency. The results highlight the practical impact of promoting diverse attention heads and cache-based head collaboration for high-quality image restoration in transformer architectures.

Abstract

Transformer-based approaches have gained significant attention in image restoration, where the core component, i.e, Multi-Head Attention (MHA), plays a crucial role in capturing diverse features and recovering high-quality results. In MHA, heads perform attention calculation independently from uniform split subspaces, and a redundancy issue is triggered to hinder the model from achieving satisfactory outputs. In this paper, we propose to improve MHA by exploring diverse learners and introducing various interactions between heads, which results in a Hierarchical multI-head atteNtion driven Transformer model, termed HINT, for image restoration. HINT contains two modules, i.e., the Hierarchical Multi-Head Attention (HMHA) and the Query-Key Cache Updating (QKCU) module, to address the redundancy problem that is rooted in vanilla MHA. Specifically, HMHA extracts diverse contextual features by employing heads to learn from subspaces of varying sizes and containing different information. Moreover, QKCU, comprising intra- and inter-layer schemes, further reduces the redundancy problem by facilitating enhanced interactions between attention heads within and across layers. Extensive experiments are conducted on 12 benchmarks across 5 image restoration tasks, including low-light enhancement, dehazing, desnowing, denoising, and deraining, to demonstrate the superiority of HINT. The source code is available in the supplementary materials.

Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration

TL;DR

This work addresses redundancy in standard Transformer self-attention for image restoration by introducing HMHA, which assigns heads to hierarchical subspaces of varying sizes after channel similarity ranking, and QKCU, which provides intra- and inter-layer attention modulation. Implemented as HINT (Hierarchical multi-head atteNtion driven Transformer), these components enable diverse learners and richer head interactions, improving restoration quality with manageable complexity. Across 12 benchmarks and 5 tasks (low-light, dehazing, desnowing, denoising, deraining), HINT delivers superior PSNR/SSIM and robust performance on real-world data, while maintaining competitive efficiency. The results highlight the practical impact of promoting diverse attention heads and cache-based head collaboration for high-quality image restoration in transformer architectures.

Abstract

Transformer-based approaches have gained significant attention in image restoration, where the core component, i.e, Multi-Head Attention (MHA), plays a crucial role in capturing diverse features and recovering high-quality results. In MHA, heads perform attention calculation independently from uniform split subspaces, and a redundancy issue is triggered to hinder the model from achieving satisfactory outputs. In this paper, we propose to improve MHA by exploring diverse learners and introducing various interactions between heads, which results in a Hierarchical multI-head atteNtion driven Transformer model, termed HINT, for image restoration. HINT contains two modules, i.e., the Hierarchical Multi-Head Attention (HMHA) and the Query-Key Cache Updating (QKCU) module, to address the redundancy problem that is rooted in vanilla MHA. Specifically, HMHA extracts diverse contextual features by employing heads to learn from subspaces of varying sizes and containing different information. Moreover, QKCU, comprising intra- and inter-layer schemes, further reduces the redundancy problem by facilitating enhanced interactions between attention heads within and across layers. Extensive experiments are conducted on 12 benchmarks across 5 image restoration tasks, including low-light enhancement, dehazing, desnowing, denoising, and deraining, to demonstrate the superiority of HINT. The source code is available in the supplementary materials.

Paper Structure

This paper contains 12 sections, 9 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Comparisons between the vanilla MHA zamir2022restormericcv2021_swinIRretinexformer (Left) and the proposed HMHA equipped with the QKCU module (Right), for the low-light enhancement task. The standard MHA assigns $h$ heads with subspaces of the same size ($C'$), and each head performs attention calculation independently. As a result, these heads intend to focus on the same regions (red boxes) and neglect the restoration of some degraded areas (yellow boxes), leading to an unsatisfactory output (losing details and introducing blur effect). In contrast, HMHA implements the reranking operation before a hierarchical subspace split, which encourages the model to learn diverse representative features. The QKCU enhances interactions between heads via intra-/inter-layer ways, modulating predicted features in HMHA and leading to better outputs.
  • Figure 2: Illustration of the proposed Hierarchical multi-head atteNtion driven Transformer model (HINT). (a) Overview architecture of the proposed HINT. (b) Hierarchical Multi-Head Attention (HMHA) mechanism.
  • Figure 3: Query-Key Cache Updating Mecanism.
  • Figure 4: Qualitative results on LOL-v2 yang2021sparse for low-light enhancement. The top case is from the synthetic subset, whereas the bottom one is from the real subset. Compared to other techniques, HINT generates vivid images without introducing noticeable color distortion.
  • Figure 5: Qualitative results on Snow100K liu2018desnownet for snow removal. HINT offers a clear result, while the images generated by other considered approaches remain noticeable snow artifacts.
  • ...and 5 more figures