Table of Contents
Fetching ...

Burst Image Super-Resolution via Multi-Cross Attention Encoding and Multi-Scan State-Space Decoding

Tengda Huang, Yu Zhang, Tianren Li, Yufu Qu, Fulin Liu, Zhenzhong Wei

TL;DR

This work tackles burst image super-resolution by introducing a Multi-Cross Attention (MCA) encoder that fuses overlapping cross-window and cross-frame cues to better capture sub-pixel information across $N$ frames. It pairs MCA with a Multi-Scan State-Space Module (MS-SSM) in the decoder, using DCN-assisted alignment and a cross-frame attention mechanism to robustly fuse multi-frame features with linear complexity. Extensive experiments on synthetic and real datasets, plus ISO 12233 resolution tests, show state-of-the-art or competitive performance in PSNR/SSIM/LPIPS and superior fine-detail/texture reconstruction with fewer artifacts. The approach demonstrates strong practical potential for burst photography and real-world imaging where alignment and high-frequency detail are critical.

Abstract

Multi-image super-resolution (MISR) can achieve higher image quality than single-image super-resolution (SISR) by aggregating sub-pixel information from multiple spatially shifted frames. Among MISR tasks, burst super-resolution (BurstSR) has gained significant attention due to its wide range of applications. Recent methods have increasingly adopted Transformers over convolutional neural networks (CNNs) in super-resolution tasks, due to their superior ability to capture both local and global context. However, most existing approaches still rely on fixed and narrow attention windows that restrict the perception of features beyond the local field. This limitation hampers alignment and feature aggregation, both of which are crucial for high-quality super-resolution. To address these limitations, we propose a novel feature extractor that incorporates two newly designed attention mechanisms: overlapping cross-window attention and cross-frame attention, enabling more precise and efficient extraction of sub-pixel information across multiple frames. Furthermore, we introduce a Multi-scan State-Space Module with the cross-frame attention mechanism to enhance feature aggregation. Extensive experiments on both synthetic and real-world benchmarks demonstrate the superiority of our approach. Additional evaluations on ISO 12233 resolution test charts further confirm its enhanced super-resolution performance.

Burst Image Super-Resolution via Multi-Cross Attention Encoding and Multi-Scan State-Space Decoding

TL;DR

This work tackles burst image super-resolution by introducing a Multi-Cross Attention (MCA) encoder that fuses overlapping cross-window and cross-frame cues to better capture sub-pixel information across frames. It pairs MCA with a Multi-Scan State-Space Module (MS-SSM) in the decoder, using DCN-assisted alignment and a cross-frame attention mechanism to robustly fuse multi-frame features with linear complexity. Extensive experiments on synthetic and real datasets, plus ISO 12233 resolution tests, show state-of-the-art or competitive performance in PSNR/SSIM/LPIPS and superior fine-detail/texture reconstruction with fewer artifacts. The approach demonstrates strong practical potential for burst photography and real-world imaging where alignment and high-frequency detail are critical.

Abstract

Multi-image super-resolution (MISR) can achieve higher image quality than single-image super-resolution (SISR) by aggregating sub-pixel information from multiple spatially shifted frames. Among MISR tasks, burst super-resolution (BurstSR) has gained significant attention due to its wide range of applications. Recent methods have increasingly adopted Transformers over convolutional neural networks (CNNs) in super-resolution tasks, due to their superior ability to capture both local and global context. However, most existing approaches still rely on fixed and narrow attention windows that restrict the perception of features beyond the local field. This limitation hampers alignment and feature aggregation, both of which are crucial for high-quality super-resolution. To address these limitations, we propose a novel feature extractor that incorporates two newly designed attention mechanisms: overlapping cross-window attention and cross-frame attention, enabling more precise and efficient extraction of sub-pixel information across multiple frames. Furthermore, we introduce a Multi-scan State-Space Module with the cross-frame attention mechanism to enhance feature aggregation. Extensive experiments on both synthetic and real-world benchmarks demonstrate the superiority of our approach. Additional evaluations on ISO 12233 resolution test charts further confirm its enhanced super-resolution performance.

Paper Structure

This paper contains 16 sections, 12 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Compared with the current SOTA method without DCN, our method shows better reconstruction results for real scenes.
  • Figure 2: An overview of the proposed method. The network takes as input a burst of degraded RAW images and outputs a clean, high-quality sRGB image. First, all RAW inputs are upsampled into an 'RGGB' format using PixelShuffle, and then expanded to 3 channels using a 3×3 convolution. These are subsequently fed into the optical flow estimation module to compute multi-scale flows between each frame and the reference. Meanwhile, features are directly extracted from the RAW inputs. Each RAW image is first projected into the feature space using a $3\times3$ convolution and then fed into our designed set of $N$ encoders. The resulting features are pixel-shuffled and aligned using the estimated optical flows. These aligned features are then processed by $N$ decoders for feature fusion, followed by residual upsampling to reconstruct the high-resolution image.
  • Figure 3: Details of the proposed Encoder. Parallel Cross-Window Attention (CWA) and Cross-Frame Attention (CFA) are integrated to form a Multi-Cross Attention mechanism, with their detailed structures illustrated in Fig. (b) and (c), respectively.
  • Figure 4: Details of the proposed Decoder (a). The core pipeline is shown in (b). The designed Residual Mamba Block with Multi-Scan State-Space Module is shown in (c).
  • Figure 5: Visual comparison results on Synthetic Datasets bhat_deep_2021. The odd rows the Ground truth and the reference image in the input low-resolution images, as well as the corresponding zoom regions. The even rows depict the results of state-of-the-art methods.
  • ...and 8 more figures