Burst Image Super-Resolution via Multi-Cross Attention Encoding and Multi-Scan State-Space Decoding
Tengda Huang, Yu Zhang, Tianren Li, Yufu Qu, Fulin Liu, Zhenzhong Wei
TL;DR
This work tackles burst image super-resolution by introducing a Multi-Cross Attention (MCA) encoder that fuses overlapping cross-window and cross-frame cues to better capture sub-pixel information across $N$ frames. It pairs MCA with a Multi-Scan State-Space Module (MS-SSM) in the decoder, using DCN-assisted alignment and a cross-frame attention mechanism to robustly fuse multi-frame features with linear complexity. Extensive experiments on synthetic and real datasets, plus ISO 12233 resolution tests, show state-of-the-art or competitive performance in PSNR/SSIM/LPIPS and superior fine-detail/texture reconstruction with fewer artifacts. The approach demonstrates strong practical potential for burst photography and real-world imaging where alignment and high-frequency detail are critical.
Abstract
Multi-image super-resolution (MISR) can achieve higher image quality than single-image super-resolution (SISR) by aggregating sub-pixel information from multiple spatially shifted frames. Among MISR tasks, burst super-resolution (BurstSR) has gained significant attention due to its wide range of applications. Recent methods have increasingly adopted Transformers over convolutional neural networks (CNNs) in super-resolution tasks, due to their superior ability to capture both local and global context. However, most existing approaches still rely on fixed and narrow attention windows that restrict the perception of features beyond the local field. This limitation hampers alignment and feature aggregation, both of which are crucial for high-quality super-resolution. To address these limitations, we propose a novel feature extractor that incorporates two newly designed attention mechanisms: overlapping cross-window attention and cross-frame attention, enabling more precise and efficient extraction of sub-pixel information across multiple frames. Furthermore, we introduce a Multi-scan State-Space Module with the cross-frame attention mechanism to enhance feature aggregation. Extensive experiments on both synthetic and real-world benchmarks demonstrate the superiority of our approach. Additional evaluations on ISO 12233 resolution test charts further confirm its enhanced super-resolution performance.
