Table of Contents
Fetching ...

StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart

Jian Shi, Qian Wang, Zhenyu Li, Ramzi Idoughi, Peter Wonka

TL;DR

StereoCrafter-Zero addresses the challenge of zero-shot stereo video generation from a single image by leveraging video diffusion priors without requiring paired training data. The method introduces a noisy restart for stereo-aware latent initialization, an iterative refinement loop to harmonize latents, and dissolved depth maps that retain only low-frequency structural depth to improve temporal and inter-view coherence. Quantitative metrics (e.g., CLIP-based scores for semantic and view-consistency) and user studies demonstrate superior depth consistency and temporal stability compared with baselines, with robustness across diffusion models. The approach enables high-quality, immersive stereo videos with reduced dependence on precise depth maps, and its code is available to facilitate broader adoption and extension.

Abstract

Generating high-quality stereo videos that mimic human binocular vision requires consistent depth perception and temporal coherence across frames. Despite advances in image and video synthesis using diffusion models, producing high-quality stereo videos remains a challenging task due to the difficulty of maintaining consistent temporal and spatial coherence between left and right views. We introduce StereoCrafter-Zero, a novel framework for zero-shot stereo video generation that leverages video diffusion priors without requiring paired training data. Our key innovations include a noisy restart strategy to initialize stereo-aware latent representations and an iterative refinement process that progressively harmonizes the latent space, addressing issues like temporal flickering and view inconsistencies. In addition, we propose the use of dissolved depth maps to streamline latent space operations by reducing high-frequency depth information. Our comprehensive evaluations, including quantitative metrics and user studies, demonstrate that StereoCrafter-Zero produces high-quality stereo videos with enhanced depth consistency and temporal smoothness, even when depth estimations are imperfect. Our framework is robust and adaptable across various diffusion models, setting a new benchmark for zero-shot stereo video generation and enabling more immersive visual experiences. Our code is in https://github.com/shijianjian/StereoCrafter-Zero.

StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart

TL;DR

StereoCrafter-Zero addresses the challenge of zero-shot stereo video generation from a single image by leveraging video diffusion priors without requiring paired training data. The method introduces a noisy restart for stereo-aware latent initialization, an iterative refinement loop to harmonize latents, and dissolved depth maps that retain only low-frequency structural depth to improve temporal and inter-view coherence. Quantitative metrics (e.g., CLIP-based scores for semantic and view-consistency) and user studies demonstrate superior depth consistency and temporal stability compared with baselines, with robustness across diffusion models. The approach enables high-quality, immersive stereo videos with reduced dependence on precise depth maps, and its code is available to facilitate broader adoption and extension.

Abstract

Generating high-quality stereo videos that mimic human binocular vision requires consistent depth perception and temporal coherence across frames. Despite advances in image and video synthesis using diffusion models, producing high-quality stereo videos remains a challenging task due to the difficulty of maintaining consistent temporal and spatial coherence between left and right views. We introduce StereoCrafter-Zero, a novel framework for zero-shot stereo video generation that leverages video diffusion priors without requiring paired training data. Our key innovations include a noisy restart strategy to initialize stereo-aware latent representations and an iterative refinement process that progressively harmonizes the latent space, addressing issues like temporal flickering and view inconsistencies. In addition, we propose the use of dissolved depth maps to streamline latent space operations by reducing high-frequency depth information. Our comprehensive evaluations, including quantitative metrics and user studies, demonstrate that StereoCrafter-Zero produces high-quality stereo videos with enhanced depth consistency and temporal smoothness, even when depth estimations are imperfect. Our framework is robust and adaptable across various diffusion models, setting a new benchmark for zero-shot stereo video generation and enabling more immersive visual experiences. Our code is in https://github.com/shijianjian/StereoCrafter-Zero.

Paper Structure

This paper contains 26 sections, 12 equations, 21 figures, 6 tables.

Figures (21)

  • Figure 1: With just a single image and an associated text prompt (left), our method generates compelling stereo video sequences. The anaglyph visualization (right) offers an intuitive way to perceive the stereoscopic effects, showcasing the power of our approach.
  • Figure 2: An overview of the StereoCrafter-Zero pipeline. Top: Our method is based on two main components: (1) Noisy Restart for a robust initial latent estimation (\ref{['sec:noisy_restart']}) and (2) Iterative Refinement for the latent refinement (\ref{['sec:iter_refine']}) during the sampling step. Bottom: The proposed pipeline takes a conditioning image and text prompt as input, generating both left and right views that produce a strong stereoscopic effect.
  • Figure 3: Illustration of the noisy start strategy. At selected steps, we replace the target view sampling with a warped source view. Occluded/disoccluded areas are then filled using the non-warped source view latent, with added noise injected into the latent space. Subsequent iterations update the latents with values from the preceding iteration, while preserving the non-occluded regions.
  • Figure 4: Abrupt border handling. (Left) Images with noticeable abrupt artifacts along the right edge. (Right) Images with border artifacts effectively removed using our method.
  • Figure 5: Illustration of the iterative refinement strategy. This process iteratively refine occluded areas within the predicted $x_0$. The initial iteration uses standard sampling process. Starting from the second iteration, a new predicted latent is computed and rescaled before each subsequent step.
  • ...and 16 more figures