Table of Contents
Fetching ...

SGDFormer: One-stage Transformer-based Architecture for Cross-Spectral Stereo Image Guided Denoising

Runmin Zhang, Zhu Yu, Zehua Sheng, Jiacheng Ying, Si-Yuan Cao, Shu-Jie Chen, Bailin Yang, Junwei Li, Hui-Liang Shen

TL;DR

SGDFormer addresses cross-spectral stereo image guided denoising by eliminating explicit alignment and directly modeling stereo correspondence within a unified transformer framework. The core innovations are the Noise-Robust Cross-Attention (NRCA) and Spatially Variant Feature Fusion (SVFF), which together enable robust long-range correspondence and adaptive fusion under spectral and noise variations. Empirical results on synthetic and real-world datasets show state-of-the-art PSNR/SSIM/LPIPS gains with favorable compute cost, and ablations validate the contribution of NRCA and SVFF. The approach also demonstrates potential for unaligned guided restoration tasks, such as guided depth super-resolution, highlighting practical impact for mobile and cross-modal imaging pipelines.

Abstract

Cross-spectral image guided denoising has shown its great potential in recovering clean images with rich details, such as using the near-infrared image to guide the denoising process of the visible one. To obtain such image pairs, a feasible and economical way is to employ a stereo system, which is widely used on mobile devices. Current works attempt to generate an aligned guidance image to handle the disparity between two images. However, due to occlusion, spectral differences and noise degradation, the aligned guidance image generally exists ghosting and artifacts, leading to an unsatisfactory denoised result. To address this issue, we propose a one-stage transformer-based architecture, named SGDFormer, for cross-spectral Stereo image Guided Denoising. The architecture integrates the correspondence modeling and feature fusion of stereo images into a unified network. Our transformer block contains a noise-robust cross-attention (NRCA) module and a spatially variant feature fusion (SVFF) module. The NRCA module captures the long-range correspondence of two images in a coarse-to-fine manner to alleviate the interference of noise. The SVFF module further enhances salient structures and suppresses harmful artifacts through dynamically selecting useful information. Thanks to the above design, our SGDFormer can restore artifact-free images with fine structures, and achieves state-of-the-art performance on various datasets. Additionally, our SGDFormer can be extended to handle other unaligned cross-model guided restoration tasks such as guided depth super-resolution.

SGDFormer: One-stage Transformer-based Architecture for Cross-Spectral Stereo Image Guided Denoising

TL;DR

SGDFormer addresses cross-spectral stereo image guided denoising by eliminating explicit alignment and directly modeling stereo correspondence within a unified transformer framework. The core innovations are the Noise-Robust Cross-Attention (NRCA) and Spatially Variant Feature Fusion (SVFF), which together enable robust long-range correspondence and adaptive fusion under spectral and noise variations. Empirical results on synthetic and real-world datasets show state-of-the-art PSNR/SSIM/LPIPS gains with favorable compute cost, and ablations validate the contribution of NRCA and SVFF. The approach also demonstrates potential for unaligned guided restoration tasks, such as guided depth super-resolution, highlighting practical impact for mobile and cross-modal imaging pipelines.

Abstract

Cross-spectral image guided denoising has shown its great potential in recovering clean images with rich details, such as using the near-infrared image to guide the denoising process of the visible one. To obtain such image pairs, a feasible and economical way is to employ a stereo system, which is widely used on mobile devices. Current works attempt to generate an aligned guidance image to handle the disparity between two images. However, due to occlusion, spectral differences and noise degradation, the aligned guidance image generally exists ghosting and artifacts, leading to an unsatisfactory denoised result. To address this issue, we propose a one-stage transformer-based architecture, named SGDFormer, for cross-spectral Stereo image Guided Denoising. The architecture integrates the correspondence modeling and feature fusion of stereo images into a unified network. Our transformer block contains a noise-robust cross-attention (NRCA) module and a spatially variant feature fusion (SVFF) module. The NRCA module captures the long-range correspondence of two images in a coarse-to-fine manner to alleviate the interference of noise. The SVFF module further enhances salient structures and suppresses harmful artifacts through dynamically selecting useful information. Thanks to the above design, our SGDFormer can restore artifact-free images with fine structures, and achieves state-of-the-art performance on various datasets. Additionally, our SGDFormer can be extended to handle other unaligned cross-model guided restoration tasks such as guided depth super-resolution.
Paper Structure (19 sections, 11 equations, 14 figures, 6 tables)

This paper contains 19 sections, 11 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Comparison between the previous state-of-the-art approach SANet SANet_CVPR23 and our SGDFormer. SANet separates stereo guided denoising into two steps: aligned guidance image generation and guided denoising. The latter step has to tolerate the undesired guidance (structure map) estimated by the former, generally leading to the unsatisfactory denoised image. In contrast, our SGDFormer integrates the correspondence modeling and feature fusion of two images into a one-stage architecture. In this way, information of the guidance image is preserved to the best extent, thus effectively removing noise while restoring fine structures.
  • Figure 2: The overall architecture of our stereo guided denoising network SGDFormer.
  • Figure 3: Illustration of (a) the noise-robust cross-attention (NRCA) module, which consists of (b) feature aggregation, coarse attention map computing, (c) attention map propagation, and aligned guidance feature generation.
  • Figure 4: Visualization of aligned guidance feature maps. Compared to the vanilla cross-attention, the proposed noise-robust cross-attention can generate aligned guidance features with more salient structures under a high noise level.
  • Figure 5: Comparison of different feature fusion strategies. (a) Add. (b) Concat. (c) Attention. (d) Our proposed spatially variant feature fusion (SVFF) module.
  • ...and 9 more figures