Table of Contents
Fetching ...

DIffSteISR: Harnessing Diffusion Prior for Superior Real-world Stereo Image Super-Resolution

Yuanbo Zhou, Xinlin Zhang, Wei Deng, Tao Wang, Tao Tan, Qinquan Gao, Tong Tong

TL;DR

DiffSteISR tackles real-world stereo image super-resolution by leveraging diffusion priors from pre-trained text-to-image models to recover textures while enforcing cross-view texture and semantic consistency. It introduces a stereo-specific conditioning framework with a Stereo Semantic Extractor, a Stereo Omni Attention ControlNet, and a Time-aware Stereo Cross Attention with a Temperature Adapter to drive synchronized left-right reconstruction via a Dual-UNet and VAE decoder. The method shows competitive performance on synthetic and real datasets, balancing no-reference realism and disparity consistency better than prior DM-based and GAN-based approaches, and it receives favorable qualitative and user-study feedback. This work advances practical Real-SteISR by integrating diffusion priors with stereo-aware conditioning to produce natural textures and coherent stereo pairs, offering a solid foundation for future diffusion-guided stereo vision tasks.

Abstract

We introduce DiffSteISR, a pioneering framework for reconstructing real-world stereo images. DiffSteISR utilizes the powerful prior knowledge embedded in pre-trained text-to-image model to efficiently recover the lost texture details in low-resolution stereo images. Specifically, DiffSteISR implements a time-aware stereo cross attention with temperature adapter (TASCATA) to guide the diffusion process, ensuring that the generated left and right views exhibit high texture consistency thereby reducing disparity error between the super-resolved images and the ground truth (GT) images. Additionally, a stereo omni attention control network (SOA ControlNet) is proposed to enhance the consistency of super-resolved images with GT images in the pixel, perceptual, and distribution space. Finally, DiffSteISR incorporates a stereo semantic extractor (SSE) to capture unique viewpoint soft semantic information and shared hard tag semantic information, thereby effectively improving the semantic accuracy and consistency of the generated left and right images. Extensive experimental results demonstrate that DiffSteISR accurately reconstructs natural and precise textures from low-resolution stereo images while maintaining a high consistency of semantic and texture between the left and right views.

DIffSteISR: Harnessing Diffusion Prior for Superior Real-world Stereo Image Super-Resolution

TL;DR

DiffSteISR tackles real-world stereo image super-resolution by leveraging diffusion priors from pre-trained text-to-image models to recover textures while enforcing cross-view texture and semantic consistency. It introduces a stereo-specific conditioning framework with a Stereo Semantic Extractor, a Stereo Omni Attention ControlNet, and a Time-aware Stereo Cross Attention with a Temperature Adapter to drive synchronized left-right reconstruction via a Dual-UNet and VAE decoder. The method shows competitive performance on synthetic and real datasets, balancing no-reference realism and disparity consistency better than prior DM-based and GAN-based approaches, and it receives favorable qualitative and user-study feedback. This work advances practical Real-SteISR by integrating diffusion priors with stereo-aware conditioning to produce natural textures and coherent stereo pairs, offering a solid foundation for future diffusion-guided stereo vision tasks.

Abstract

We introduce DiffSteISR, a pioneering framework for reconstructing real-world stereo images. DiffSteISR utilizes the powerful prior knowledge embedded in pre-trained text-to-image model to efficiently recover the lost texture details in low-resolution stereo images. Specifically, DiffSteISR implements a time-aware stereo cross attention with temperature adapter (TASCATA) to guide the diffusion process, ensuring that the generated left and right views exhibit high texture consistency thereby reducing disparity error between the super-resolved images and the ground truth (GT) images. Additionally, a stereo omni attention control network (SOA ControlNet) is proposed to enhance the consistency of super-resolved images with GT images in the pixel, perceptual, and distribution space. Finally, DiffSteISR incorporates a stereo semantic extractor (SSE) to capture unique viewpoint soft semantic information and shared hard tag semantic information, thereby effectively improving the semantic accuracy and consistency of the generated left and right images. Extensive experimental results demonstrate that DiffSteISR accurately reconstructs natural and precise textures from low-resolution stereo images while maintaining a high consistency of semantic and texture between the left and right views.
Paper Structure (19 sections, 7 equations, 12 figures, 3 tables)

This paper contains 19 sections, 7 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: The visual results of the state-of-the-art Real-ISR methods based on diffusion mode for processing stereo images.
  • Figure 2: The framework of the proposed method consists of five parts: the stereo semantic extractor, the Tag Encoder, the SOA ControlNet, the Dual-UNet, and the VAE Decoder.
  • Figure 3: The architecture of the stereo semantic extractor consists of an image encoder, tag head, and tag merging module.
  • Figure 4: The architecture diagram of stereo omni attention control network.
  • Figure 5: The architecture of time-aware stereo cross attention with temperature adapter.
  • ...and 7 more figures