SGDFormer: One-stage Transformer-based Architecture for Cross-Spectral Stereo Image Guided Denoising
Runmin Zhang, Zhu Yu, Zehua Sheng, Jiacheng Ying, Si-Yuan Cao, Shu-Jie Chen, Bailin Yang, Junwei Li, Hui-Liang Shen
TL;DR
SGDFormer addresses cross-spectral stereo image guided denoising by eliminating explicit alignment and directly modeling stereo correspondence within a unified transformer framework. The core innovations are the Noise-Robust Cross-Attention (NRCA) and Spatially Variant Feature Fusion (SVFF), which together enable robust long-range correspondence and adaptive fusion under spectral and noise variations. Empirical results on synthetic and real-world datasets show state-of-the-art PSNR/SSIM/LPIPS gains with favorable compute cost, and ablations validate the contribution of NRCA and SVFF. The approach also demonstrates potential for unaligned guided restoration tasks, such as guided depth super-resolution, highlighting practical impact for mobile and cross-modal imaging pipelines.
Abstract
Cross-spectral image guided denoising has shown its great potential in recovering clean images with rich details, such as using the near-infrared image to guide the denoising process of the visible one. To obtain such image pairs, a feasible and economical way is to employ a stereo system, which is widely used on mobile devices. Current works attempt to generate an aligned guidance image to handle the disparity between two images. However, due to occlusion, spectral differences and noise degradation, the aligned guidance image generally exists ghosting and artifacts, leading to an unsatisfactory denoised result. To address this issue, we propose a one-stage transformer-based architecture, named SGDFormer, for cross-spectral Stereo image Guided Denoising. The architecture integrates the correspondence modeling and feature fusion of stereo images into a unified network. Our transformer block contains a noise-robust cross-attention (NRCA) module and a spatially variant feature fusion (SVFF) module. The NRCA module captures the long-range correspondence of two images in a coarse-to-fine manner to alleviate the interference of noise. The SVFF module further enhances salient structures and suppresses harmful artifacts through dynamically selecting useful information. Thanks to the above design, our SGDFormer can restore artifact-free images with fine structures, and achieves state-of-the-art performance on various datasets. Additionally, our SGDFormer can be extended to handle other unaligned cross-model guided restoration tasks such as guided depth super-resolution.
