Table of Contents
Fetching ...

CFAT: Unleashing TriangularWindows for Image Super-resolution

Abhisek Ray, Gaurav Kumar, Maheshkumar H. Kolekar

TL;DR

This work introduces CFAT, a Composite Fusion Attention Transformer for image super-resolution that blends non-overlapping triangular and rectangular window self-attention with a channel-based global attention mechanism. The architecture deploys Dense Window Attention Blocks and Sparse Window Attention Blocks, augmented by Overlapping Cross Fusion Attention and Channel-Wise Attention, enabling richer local and global feature interactions while mitigating boundary distortions common to rectangular-window designs. Extensive ablations demonstrate the benefits of the triangular window, optimal hyperparameters, and the synergy between dense and sparse attentions, achieving up to 0.7 dB improvements over strong SOTA SR models. The results across DIV2K/Flickr2K training and five standard SR benchmarks show CFAT’s clear superiority in PSNR/SSIM, with competitive computational cost, suggesting practical impact for high-quality, efficient image upscaling in real-world pipelines.

Abstract

Transformer-based models have revolutionized the field of image super-resolution (SR) by harnessing their inherent ability to capture complex contextual features. The overlapping rectangular shifted window technique used in transformer architecture nowadays is a common practice in super-resolution models to improve the quality and robustness of image upscaling. However, it suffers from distortion at the boundaries and has limited unique shifting modes. To overcome these weaknesses, we propose a non-overlapping triangular window technique that synchronously works with the rectangular one to mitigate boundary-level distortion and allows the model to access more unique sifting modes. In this paper, we propose a Composite Fusion Attention Transformer (CFAT) that incorporates triangular-rectangular window-based local attention with a channel-based global attention technique in image super-resolution. As a result, CFAT enables attention mechanisms to be activated on more image pixels and captures long-range, multi-scale features to improve SR performance. The extensive experimental results and ablation study demonstrate the effectiveness of CFAT in the SR domain. Our proposed model shows a significant 0.7 dB performance improvement over other state-of-the-art SR architectures.

CFAT: Unleashing TriangularWindows for Image Super-resolution

TL;DR

This work introduces CFAT, a Composite Fusion Attention Transformer for image super-resolution that blends non-overlapping triangular and rectangular window self-attention with a channel-based global attention mechanism. The architecture deploys Dense Window Attention Blocks and Sparse Window Attention Blocks, augmented by Overlapping Cross Fusion Attention and Channel-Wise Attention, enabling richer local and global feature interactions while mitigating boundary distortions common to rectangular-window designs. Extensive ablations demonstrate the benefits of the triangular window, optimal hyperparameters, and the synergy between dense and sparse attentions, achieving up to 0.7 dB improvements over strong SOTA SR models. The results across DIV2K/Flickr2K training and five standard SR benchmarks show CFAT’s clear superiority in PSNR/SSIM, with competitive computational cost, suggesting practical impact for high-quality, efficient image upscaling in real-world pipelines.

Abstract

Transformer-based models have revolutionized the field of image super-resolution (SR) by harnessing their inherent ability to capture complex contextual features. The overlapping rectangular shifted window technique used in transformer architecture nowadays is a common practice in super-resolution models to improve the quality and robustness of image upscaling. However, it suffers from distortion at the boundaries and has limited unique shifting modes. To overcome these weaknesses, we propose a non-overlapping triangular window technique that synchronously works with the rectangular one to mitigate boundary-level distortion and allows the model to access more unique sifting modes. In this paper, we propose a Composite Fusion Attention Transformer (CFAT) that incorporates triangular-rectangular window-based local attention with a channel-based global attention technique in image super-resolution. As a result, CFAT enables attention mechanisms to be activated on more image pixels and captures long-range, multi-scale features to improve SR performance. The extensive experimental results and ablation study demonstrate the effectiveness of CFAT in the SR domain. Our proposed model shows a significant 0.7 dB performance improvement over other state-of-the-art SR architectures.
Paper Structure (38 sections, 11 equations, 9 figures, 8 tables)

This paper contains 38 sections, 11 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Proposed CFAT vs other SOTA models. RW/TW: Rectangular/Triangular Window, MSA: Multi-Head Attention, (D): Dense, (SD): Shifted Dense, (S): Sparse, (O): Overlapping
  • Figure 2: The overall architecture of CFAT with all internal units.
  • Figure 3: A rectangular and triangular patch in $32 \times 32$ window.
  • Figure 4: Shifting modes of rectangular and triangular windows in a $64 \times 64$ image patch
  • Figure 5: Visual Comparison of CFAT with other state-of-the-art methods.
  • ...and 4 more figures