Table of Contents
Fetching ...

Feature Alignment with Equivariant Convolutions for Burst Image Super-Resolution

Xinyi Liu, Feiyu Tan, Qi Xie, Qian Zhao, Deyu Meng

TL;DR

This paper tackles Burst Image Super-Resolution (BISR) with a focus on alignment, proposing an equivariant-convolution (Eq-CNN) framework that ensures transformation consistency between image and feature domains. By learning rotation-translation alignments in the image domain and applying the inverse in the feature domain, the method achieves robust alignment; it is complemented by a reconstruction module leveraging MDTA for cross-frame interaction and INR-based multi-scale upsampling. The contributions include (i) an Eq-CNN–based alignment module with explicit image-domain supervision, (ii) a reconstruction pipeline using MDTA and INR, and (iii) theoretical justification bounding the impact of discretized transformations on feature-domain alignment. Experiments on SyntheticBurst and BurstSR demonstrate state-of-the-art PSNR/SSIM with favorable model efficiency, highlighting practical improvements in detail preservation and artifact suppression for real-world burst imaging.

Abstract

Burst image processing (BIP), which captures and integrates multiple frames into a single high-quality image, is widely used in consumer cameras. As a typical BIP task, Burst Image Super-Resolution (BISR) has achieved notable progress through deep learning in recent years. Existing BISR methods typically involve three key stages: alignment, upsampling, and fusion, often in varying orders and implementations. Among these stages, alignment is particularly critical for ensuring accurate feature matching and further reconstruction. However, existing methods often rely on techniques such as deformable convolutions and optical flow to realize alignment, which either focus only on local transformations or lack theoretical grounding, thereby limiting their performance. To alleviate these issues, we propose a novel framework for BISR, featuring an equivariant convolution-based alignment, ensuring consistent transformations between the image and feature domains. This enables the alignment transformation to be learned via explicit supervision in the image domain and easily applied in the feature domain in a theoretically sound way, effectively improving alignment accuracy. Additionally, we design an effective reconstruction module with advanced deep architectures for upsampling and fusion to obtain the final BISR result. Extensive experiments on BISR benchmarks show the superior performance of our approach in both quantitative metrics and visual quality.

Feature Alignment with Equivariant Convolutions for Burst Image Super-Resolution

TL;DR

This paper tackles Burst Image Super-Resolution (BISR) with a focus on alignment, proposing an equivariant-convolution (Eq-CNN) framework that ensures transformation consistency between image and feature domains. By learning rotation-translation alignments in the image domain and applying the inverse in the feature domain, the method achieves robust alignment; it is complemented by a reconstruction module leveraging MDTA for cross-frame interaction and INR-based multi-scale upsampling. The contributions include (i) an Eq-CNN–based alignment module with explicit image-domain supervision, (ii) a reconstruction pipeline using MDTA and INR, and (iii) theoretical justification bounding the impact of discretized transformations on feature-domain alignment. Experiments on SyntheticBurst and BurstSR demonstrate state-of-the-art PSNR/SSIM with favorable model efficiency, highlighting practical improvements in detail preservation and artifact suppression for real-world burst imaging.

Abstract

Burst image processing (BIP), which captures and integrates multiple frames into a single high-quality image, is widely used in consumer cameras. As a typical BIP task, Burst Image Super-Resolution (BISR) has achieved notable progress through deep learning in recent years. Existing BISR methods typically involve three key stages: alignment, upsampling, and fusion, often in varying orders and implementations. Among these stages, alignment is particularly critical for ensuring accurate feature matching and further reconstruction. However, existing methods often rely on techniques such as deformable convolutions and optical flow to realize alignment, which either focus only on local transformations or lack theoretical grounding, thereby limiting their performance. To alleviate these issues, we propose a novel framework for BISR, featuring an equivariant convolution-based alignment, ensuring consistent transformations between the image and feature domains. This enables the alignment transformation to be learned via explicit supervision in the image domain and easily applied in the feature domain in a theoretically sound way, effectively improving alignment accuracy. Additionally, we design an effective reconstruction module with advanced deep architectures for upsampling and fusion to obtain the final BISR result. Extensive experiments on BISR benchmarks show the superior performance of our approach in both quantitative metrics and visual quality.

Paper Structure

This paper contains 18 sections, 2 theorems, 14 equations, 7 figures, 2 tables.

Key Result

Theorem 1

For an image $I_0$ of size $H \times W \times C$, a rotation-translation Eq-CNN $g(\cdot)$ with discretized angles, and a rotation-translation transformation $f_j(\cdot)$, under certain assumptions, the following result holds: where $t,p,h, C_1, C_2$ are constants.

Figures (7)

  • Figure 1: Illustration of transformation consistency in vanilla (V-CNN) and equivariant (Eq-CNN) convolutional networks. $f_1$ denotes a transformation (rotation in this example) and $g$ is a CNN that extracts features from images. Suppose $I_1$ is the image obtained by applying $f_1$ to $I_0$, i.e., $I_1=f_1(I_0)$, and $Z_0$ and $Z_1$ are features extracted from $I_0$ and $I_1$, respectively. We expect that $Z_1$ can be close to $f_1(Z_0)$, the transformation of $Z_0$, such that one can align $Z_1$ to $Z_0$ in the feature domain by applying the inverse transformation $f_1^{-1}$, which can be learned by explicit supervision in the image domain. The right box compares the error between $f_1^{-1}(Z_1)$ and $Z_0$, and it can be observed that Eq-CNN can more effectively achieve this goal than V-CNN.
  • Figure 2: Overview of our proposed method. The top row shows the whole workflow. The bottom left shows the detailed equivariant convolution layers of ENet. The bottom right shows the process of feature alignment by predicted transformation, as in Section \ref{['sec:align']}.
  • Figure 3: Visual comparison of ×4 BISR on the SyntheticBurst dataset.
  • Figure 4: Visual comparison of ×4 BISR on the BurstSR dataset.
  • Figure 5: Visual results of the ablation study for x4 BISR on SyntheticBurst. The settings of (a)-(d) are referred to Table \ref{['tab:ab']} and Section \ref{['sec:ab']}.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Corollary 1