Table of Contents
Fetching ...

DiffStereo: High-Frequency Aware Diffusion Model for Stereo Image Restoration

Huiyun Cao, Yuan Shi, Bin Xia, Xiaoyu Jin, Wenming Yang

TL;DR

DiffStereo tackles the challenge of restoring HQ stereo images with diffusion models by performing diffusion in a compressed latent space that preserves high-frequency texture. The method learns Latent High-Frequency Representations (LHFR) of HQ stereo pairs via a Latent Representation Extraction Network (LREN) and uses a diffusion model to estimate LHFR from degraded inputs, guiding a transformer-based stereo restoration network (SIRN) through a depth-aware fusion scheme. The two-stage training (Stage One: LREN+SIRN; Stage Two: DM estimation) achieves stronger reconstruction fidelity and perceptual quality across stereo SR, deblurring, and low-light enhancement, while reducing computational burden and mitigating diffusion artifacts. The work demonstrates the value of combining diffusion-based priors with long-range transformer modeling for stereo texture recovery, with practical implications for 3D vision pipelines requiring robust texture and detail preservation.

Abstract

Diffusion models (DMs) have achieved promising performance in image restoration but haven't been explored for stereo images. The application of DM in stereo image restoration is confronted with a series of challenges. The need to reconstruct two images exacerbates DM's computational cost. Additionally, existing latent DMs usually focus on semantic information and remove high-frequency details as redundancy during latent compression, which is precisely what matters for image restoration. To address the above problems, we propose a high-frequency aware diffusion model, DiffStereo for stereo image restoration as the first attempt at DM in this domain. Specifically, DiffStereo first learns latent high-frequency representations (LHFR) of HQ images. DM is then trained in the learned space to estimate LHFR for stereo images, which are fused into a transformer-based stereo image restoration network providing beneficial high-frequency information of corresponding HQ images. The resolution of LHFR is kept the same as input images, which preserves the inherent texture from distortion. And the compression in channels alleviates the computational burden of DM. Furthermore, we devise a position encoding scheme when integrating the LHFR into the restoration network, enabling distinctive guidance in different depths of the restoration network. Comprehensive experiments verify that by combining generative DM and transformer, DiffStereo achieves both higher reconstruction accuracy and better perceptual quality on stereo super-resolution, deblurring, and low-light enhancement compared with state-of-the-art methods.

DiffStereo: High-Frequency Aware Diffusion Model for Stereo Image Restoration

TL;DR

DiffStereo tackles the challenge of restoring HQ stereo images with diffusion models by performing diffusion in a compressed latent space that preserves high-frequency texture. The method learns Latent High-Frequency Representations (LHFR) of HQ stereo pairs via a Latent Representation Extraction Network (LREN) and uses a diffusion model to estimate LHFR from degraded inputs, guiding a transformer-based stereo restoration network (SIRN) through a depth-aware fusion scheme. The two-stage training (Stage One: LREN+SIRN; Stage Two: DM estimation) achieves stronger reconstruction fidelity and perceptual quality across stereo SR, deblurring, and low-light enhancement, while reducing computational burden and mitigating diffusion artifacts. The work demonstrates the value of combining diffusion-based priors with long-range transformer modeling for stereo texture recovery, with practical implications for 3D vision pipelines requiring robust texture and detail preservation.

Abstract

Diffusion models (DMs) have achieved promising performance in image restoration but haven't been explored for stereo images. The application of DM in stereo image restoration is confronted with a series of challenges. The need to reconstruct two images exacerbates DM's computational cost. Additionally, existing latent DMs usually focus on semantic information and remove high-frequency details as redundancy during latent compression, which is precisely what matters for image restoration. To address the above problems, we propose a high-frequency aware diffusion model, DiffStereo for stereo image restoration as the first attempt at DM in this domain. Specifically, DiffStereo first learns latent high-frequency representations (LHFR) of HQ images. DM is then trained in the learned space to estimate LHFR for stereo images, which are fused into a transformer-based stereo image restoration network providing beneficial high-frequency information of corresponding HQ images. The resolution of LHFR is kept the same as input images, which preserves the inherent texture from distortion. And the compression in channels alleviates the computational burden of DM. Furthermore, we devise a position encoding scheme when integrating the LHFR into the restoration network, enabling distinctive guidance in different depths of the restoration network. Comprehensive experiments verify that by combining generative DM and transformer, DiffStereo achieves both higher reconstruction accuracy and better perceptual quality on stereo super-resolution, deblurring, and low-light enhancement compared with state-of-the-art methods.
Paper Structure (23 sections, 17 equations, 6 figures, 6 tables)

This paper contains 23 sections, 17 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: An overview of DiffStereo in training stage one. The latent representation extraction network (LREN) learns a compressed latent space which preserves high-frequency details like structural information and texture in HR stereo images and obtains LHFR of left and right views. The LHFR are then fused into stereo image restoration network (SIRN) and assist the texture recovery.
  • Figure 2: The architecture of our proposed: (a) channel interaction block (CIB), (b) Position Encoding scheme, (c) channel interaction layer (CIL) in CIB.
  • Figure 3: An overview of DiffStereo in training stage two. The DM learns to estimate the LHFR extracted by pretrained LREN, whose parameters are frozen in stage two. During inference, the DM estimates LHFR from pure Gaussian noise under the guidance of LR stereo images.
  • Figure 4: Visiual comparisons for ×4 SR by different methods on Flickr1024 and Middlebury dataset.
  • Figure 5: Visiual comparisons in Low-Light Enhancement by different methods.
  • ...and 1 more figures