Table of Contents
Fetching ...

CubeFormer: A Simple yet Effective Baseline for Lightweight Image Super-Resolution

Jikai Wang, Huan Zheng, Jianbing Shen

TL;DR

CubeFormer addresses limited feature diversity in lightweight image super-resolution by introducing cube attention, which extends 2D attention into 3D cubes to enable exhaustive spatial–channel interactions. It employs cascaded Cube Transformer Groups with intra- and inter-cube transformer blocks to model local and global information, and offers CubeFormer-lite for efficiency via channel splitting. The model is trained with a joint spatial and frequency reconstruction loss to preserve texture and high-frequency details. Across standard benchmarks, CubeFormer achieves state-of-the-art performance for lightweight SR and demonstrates improved texture detail over prior methods, with CubeFormer-lite offering strong efficiency gains.

Abstract

Lightweight image super-resolution (SR) methods aim at increasing the resolution and restoring the details of an image using a lightweight neural network. However, current lightweight SR methods still suffer from inferior performance and unpleasant details. Our analysis reveals that these methods are hindered by constrained feature diversity, which adversely impacts feature representation and detail recovery. To respond this issue, we propose a simple yet effective baseline called CubeFormer, designed to enhance feature richness by completing holistic information aggregation. To be specific, we introduce cube attention, which expands 2D attention to 3D space, facilitating exhaustive information interactions, further encouraging comprehensive information extraction and promoting feature variety. In addition, we inject block and grid sampling strategies to construct intra-cube transformer blocks (Intra-CTB) and inter-cube transformer blocks (Inter-CTB), which perform local and global modeling, respectively. Extensive experiments show that our CubeFormer achieves state-of-the-art performance on commonly used SR benchmarks. Our source code and models will be publicly available.

CubeFormer: A Simple yet Effective Baseline for Lightweight Image Super-Resolution

TL;DR

CubeFormer addresses limited feature diversity in lightweight image super-resolution by introducing cube attention, which extends 2D attention into 3D cubes to enable exhaustive spatial–channel interactions. It employs cascaded Cube Transformer Groups with intra- and inter-cube transformer blocks to model local and global information, and offers CubeFormer-lite for efficiency via channel splitting. The model is trained with a joint spatial and frequency reconstruction loss to preserve texture and high-frequency details. Across standard benchmarks, CubeFormer achieves state-of-the-art performance for lightweight SR and demonstrates improved texture detail over prior methods, with CubeFormer-lite offering strong efficiency gains.

Abstract

Lightweight image super-resolution (SR) methods aim at increasing the resolution and restoring the details of an image using a lightweight neural network. However, current lightweight SR methods still suffer from inferior performance and unpleasant details. Our analysis reveals that these methods are hindered by constrained feature diversity, which adversely impacts feature representation and detail recovery. To respond this issue, we propose a simple yet effective baseline called CubeFormer, designed to enhance feature richness by completing holistic information aggregation. To be specific, we introduce cube attention, which expands 2D attention to 3D space, facilitating exhaustive information interactions, further encouraging comprehensive information extraction and promoting feature variety. In addition, we inject block and grid sampling strategies to construct intra-cube transformer blocks (Intra-CTB) and inter-cube transformer blocks (Inter-CTB), which perform local and global modeling, respectively. Extensive experiments show that our CubeFormer achieves state-of-the-art performance on commonly used SR benchmarks. Our source code and models will be publicly available.

Paper Structure

This paper contains 12 sections, 14 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An overview of SwinIR liang2021swinir, SwinIR-CA zamir2022restormer, Omni-SR wang2023omni, and the proposed CubeFormer. We visualize the feature maps extracted by the backbone network and restored HR images of these methods. Notably, SwinIR, SwinIR-CA, and Omni-SR capture fewer low-level details in their feature maps, leading to diminished texture clarity in the resulting HR images. In contrast, CubeFormer exhibits an enhanced capacity for learning informative representations, effectively enriching fine details in features and delivering HR images of superior quality.
  • Figure 2: Overall architecture of CubeFormer. We utilize a convolutional layer to extract shallow features from the LR image and a series of Cube Transformer Groups (CTGs) for deep feature extraction. As shown in the bottom-left sub-figure, CTG incorporates the proposed Intra-CTB and Inter-CTB, channel shuffle, MBConv block, and enhanced spatial attention (ESA) block. In detail, Intra/Inter-CTB inherits the structure of the vanilla ViT and replaces self-attention with intra/inter cube attention as the bottom-right sub-figure illustrates.
  • Figure 3: Illustration of intra-cube and inter-cube attention. The inputs are fed into convolutional layers to generate query, key, and value. Next, these components are further divided into 3D cubes and 3D cube grids by block and grid strategy, which are further used for obtaining affinity matrix.
  • Figure 4: Illustration of the lite version of Cube Transformer Group (CTG-lite). In CTG-lite, an additional channel splitting operation is introduced following the channel shuffle, partitioning the features into two groups with half the original channel count.
  • Figure 5: PSNR vs. the number of model parameters of different lightweight SR methods on Urban100 dataset under x4 scale. Both the proposed CubeFormer and CubeFormer-lite achieve the best performance at their respective parameter scales.
  • ...and 1 more figures