Table of Contents
Fetching ...

Partial Large Kernel CNNs for Efficient Super-Resolution

Dongheon Lee, Seokju Yun, Youngmin Ro

TL;DR

This paper shows that convolutional networks can outperform transformers in super-resolution when evaluated by direct efficiency metrics, challenging the assumption that Transformers are always more efficient. It introduces Partial Large Kernel CNNs for Efficient SR (PLKSR), a CNN architecture that integrates Transformer-like long-range handling through Partial Large Kernel Convolution (PLKC), a Double Convolutional Channel Mixer (DCCM), and an Element-wise Attention (EA) module. PLKSR achieves state-of-the-art PSNR on four SR datasets at $\times$4 with substantial latency and memory reductions relative to strong Transformer baselines, and its tiny variant maintains high performance with minimal resource demands, including on mobile. The work demonstrates that carefully designed large-receptive-field CNNs can match or surpass MHSA-based SR models in both accuracy and practicality, potentially revitalizing CNNs for SR on edge devices.

Abstract

Recently, in the super-resolution (SR) domain, transformers have outperformed CNNs with fewer FLOPs and fewer parameters since they can deal with long-range dependency and adaptively adjust weights based on instance. In this paper, we demonstrate that CNNs, although less focused on in the current SR domain, surpass Transformers in direct efficiency measures. By incorporating the advantages of Transformers into CNNs, we aim to achieve both computational efficiency and enhanced performance. However, using a large kernel in the SR domain, which mainly processes large images, incurs a large computational overhead. To overcome this, we propose novel approaches to employing the large kernel, which can reduce latency by 86\% compared to the naive large kernel, and leverage an Element-wise Attention module to imitate instance-dependent weights. As a result, we introduce Partial Large Kernel CNNs for Efficient Super-Resolution (PLKSR), which achieves state-of-the-art performance on four datasets at a scale of $\times$4, with reductions of 68.1\% in latency and 80.2\% in maximum GPU memory occupancy compared to SRFormer-light.

Partial Large Kernel CNNs for Efficient Super-Resolution

TL;DR

This paper shows that convolutional networks can outperform transformers in super-resolution when evaluated by direct efficiency metrics, challenging the assumption that Transformers are always more efficient. It introduces Partial Large Kernel CNNs for Efficient SR (PLKSR), a CNN architecture that integrates Transformer-like long-range handling through Partial Large Kernel Convolution (PLKC), a Double Convolutional Channel Mixer (DCCM), and an Element-wise Attention (EA) module. PLKSR achieves state-of-the-art PSNR on four SR datasets at 4 with substantial latency and memory reductions relative to strong Transformer baselines, and its tiny variant maintains high performance with minimal resource demands, including on mobile. The work demonstrates that carefully designed large-receptive-field CNNs can match or surpass MHSA-based SR models in both accuracy and practicality, potentially revitalizing CNNs for SR on edge devices.

Abstract

Recently, in the super-resolution (SR) domain, transformers have outperformed CNNs with fewer FLOPs and fewer parameters since they can deal with long-range dependency and adaptively adjust weights based on instance. In this paper, we demonstrate that CNNs, although less focused on in the current SR domain, surpass Transformers in direct efficiency measures. By incorporating the advantages of Transformers into CNNs, we aim to achieve both computational efficiency and enhanced performance. However, using a large kernel in the SR domain, which mainly processes large images, incurs a large computational overhead. To overcome this, we propose novel approaches to employing the large kernel, which can reduce latency by 86\% compared to the naive large kernel, and leverage an Element-wise Attention module to imitate instance-dependent weights. As a result, we introduce Partial Large Kernel CNNs for Efficient Super-Resolution (PLKSR), which achieves state-of-the-art performance on four datasets at a scale of 4, with reductions of 68.1\% in latency and 80.2\% in maximum GPU memory occupancy compared to SRFormer-light.
Paper Structure (35 sections, 9 equations, 9 figures, 11 tables)

This paper contains 35 sections, 9 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Comparison on performance, latency, and Maximum GPU memory Occupancy (MGO) at BSD100$\times$2 dataset. Our PLKSR performs the best when compared to the SOTA SR methods, with 68.1 and 80.2 less latency and MGO compared to SRFormer-light, respectively. All metrics are measured by restoring an HD (1280×720) image using RTX4090 GPU at FP16 precision.
  • Figure 2: Analysis of convolution operation. (a) demonstrates the change in latency of convolution operation as the channel and kernel size change, and (b) demonstrates the latency, MGO, and parameters of using a large kernel directly and using successive small kernels (w/ and w/o GELU) with the same receptive field as the large kernel when computing convolution on 16 channels. All metrics are measured by processing a feature map with a size of 640$\times$360 using RTX4090 GPU at FP16 precision.
  • Figure 3: Overview of our architecture. The PLK (Partial Large Kernel) Block, the main block of PLKSR, consists of three modules: DCCM (Double Conv Channel Mixer), PLKC (Partial Large Kernel Conv), and EA (Element-wise Attention).
  • Figure 4: $\Delta$Log amplitude of Fourier-transformed MHSA feature maps of SRFormer-light$\times$2 and large/small kernel feature maps of PLKSR$\times$2. We visualize diagonal values after the center of each Fourier-transformed feature map following previous research HowDoViTWorks.
  • Figure 5: Feature map activation visualization. The large/small kernel feature maps of the last PLK block from PLKSR$\times$2 are averaged channel-wise and normalized for visualization.
  • ...and 4 more figures