Table of Contents
Fetching ...

StereoDiffusion: Training-Free Stereo Image Generation Using Latent Diffusion Models

Lezhong Wang, Jeppe Revall Frisvad, Mark Bo Jensen, Siavash Arjomand Bigdeli

TL;DR

StereoDiffusion presents a training-free approach to stereo image generation by directly manipulating the latent space of a Stable Diffusion model. By guiding a disparity map-derived Stereo Pixel Shift during early denoising and enforcing left-right coherence through Symmetric Pixel Shift Masking Denoise and Self-Attention modifications, it achieves end-to-end T2SI, D2SI, and I2SI without fine-tuning. The method demonstrates state-of-the-art quantitative results on Middlebury and KITTI, along with favorable user evaluations, while emphasizing practical advantages such as integration simplicity and speed. Its reliance on disparity inputs from depth models enables flexible, scalable stereo content generation suitable for XR/VR applications, with clear limitations tied to depth accuracy and potential inpainting artifacts.

Abstract

The demand for stereo images increases as manufacturers launch more XR devices. To meet this demand, we introduce StereoDiffusion, a method that, unlike traditional inpainting pipelines, is trainning free, remarkably straightforward to use, and it seamlessly integrates into the original Stable Diffusion model. Our method modifies the latent variable to provide an end-to-end, lightweight capability for fast generation of stereo image pairs, without the need for fine-tuning model weights or any post-processing of images. Using the original input to generate a left image and estimate a disparity map for it, we generate the latent vector for the right image through Stereo Pixel Shift operations, complemented by Symmetric Pixel Shift Masking Denoise and Self-Attention Layers Modification methods to align the right-side image with the left-side image. Moreover, our proposed method maintains a high standard of image quality throughout the stereo generation process, achieving state-of-the-art scores in various quantitative evaluations.

StereoDiffusion: Training-Free Stereo Image Generation Using Latent Diffusion Models

TL;DR

StereoDiffusion presents a training-free approach to stereo image generation by directly manipulating the latent space of a Stable Diffusion model. By guiding a disparity map-derived Stereo Pixel Shift during early denoising and enforcing left-right coherence through Symmetric Pixel Shift Masking Denoise and Self-Attention modifications, it achieves end-to-end T2SI, D2SI, and I2SI without fine-tuning. The method demonstrates state-of-the-art quantitative results on Middlebury and KITTI, along with favorable user evaluations, while emphasizing practical advantages such as integration simplicity and speed. Its reliance on disparity inputs from depth models enables flexible, scalable stereo content generation suitable for XR/VR applications, with clear limitations tied to depth accuracy and potential inpainting artifacts.

Abstract

The demand for stereo images increases as manufacturers launch more XR devices. To meet this demand, we introduce StereoDiffusion, a method that, unlike traditional inpainting pipelines, is trainning free, remarkably straightforward to use, and it seamlessly integrates into the original Stable Diffusion model. Our method modifies the latent variable to provide an end-to-end, lightweight capability for fast generation of stereo image pairs, without the need for fine-tuning model weights or any post-processing of images. Using the original input to generate a left image and estimate a disparity map for it, we generate the latent vector for the right image through Stereo Pixel Shift operations, complemented by Symmetric Pixel Shift Masking Denoise and Self-Attention Layers Modification methods to align the right-side image with the left-side image. Moreover, our proposed method maintains a high standard of image quality throughout the stereo generation process, achieving state-of-the-art scores in various quantitative evaluations.
Paper Structure (20 sections, 14 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 14 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Our method takes one of three types of user input and generates a stereo image. Accepted user inputs: (a) a photo, (b) a text prompt, or (c) a user's image as a depth map and a prompt. We use a latent diffusion model pretrained on either images (a, b) or depth maps (c).
  • Figure 2: The pipeline of our Stereo Diffusion. The process starts with random noise and denoising of it to generate a stereo image pair. The operation of Stereo Pixel Shift is represented by Eq. \ref{['eq:sps']}. The Disparity Map for generating stereo image pairs can be obtained from depth models such as DPT Ranftl2021 or MiDas Ranftl2022. The pipeline only shows the Unidirectional Self-Attention operation, designed to align the right-side image with the left-side image, a method that satisfies general needs. Bidirectional Self-Attention, being a mutual operation, would be represented by bidirectional arrows in the image. The orange box in the image depicts the concept of Symmetric Pixel Shift Masking Denoise, with details explained in Sec. \ref{['sec:spsmd']}. The cross attention part of the sampling process is omitted for brevity.
  • Figure 3: Comparing the outcomes of applying stereo shifts at different steps of denoising, reveals varying optimal configurations for different images. Implementing shifts too early could result in significant content alterations, while shifts applied too late might lead to noticeable artifacts in the images.
  • Figure 4: The same image generated using different methods. The rows present, respectively, the image with the lowest (worst) SSIM score, the image closest to the average SSIM score, and the image with the highest (best) SSIM score generated using our method. The other methods are represented solely by their results on these specific images, which do not necessarily reflect the best, average, or worst SSIM scores achievable by those methods. We do this to facilitate a direct comparison of the effects of each method on the same image. We also provide LPIPS scores for reference and close-ups of the images generated by the primary benchmark methods for inspection of details.
  • Figure 5: Ablation example of Middlebury (up) and KITTI (down). In the images, 'P' and 'G' respectively denote whether the image has been guided by a Pseudo disparity map or a Groundtruth disparity map. 'A', 'S', and 'D' indicate the use of Attention layers modification, Symmetric Pixel Shift Masking Denoise, and Deblur technique, respectively. The lower scores associated with the use of Groundtruth disparity maps in Middlebury may be attributed to their generally higher precision and complexity. This heightened detail can render pixel shift operations during image generation more intricate and sensitive. Our Stereo Pixel Shift operation is executed within a smaller latent space (64×64), where minor pixels, such as those around tree trunks and leaves, might be overlooked. In contrast, disparity maps generated by depth estimation models, with their lower precision, are more conducive to Pixel Shift in the latent space without sacrificing image detail.
  • ...and 7 more figures