Table of Contents
Fetching ...

SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean Fanello, Xiaojuan Qi, Yinda Zhang

TL;DR

This work tackles the challenge of generating high-quality 3D stereoscopic videos without camera pose estimation or training on stereo data. It introduces a pose-free, training-free pipeline that starts from a monocular video generated by a diffusion model, warps it to multiple baseline views using per-frame depth, and refines disoccluded regions through a novel frame-matrix denoising inpainting process, complemented by a disocclusion boundary reinjection mechanism. The frame matrix representation enforces simultaneous spatial and temporal coherence, producing semantically consistent left and right views across time. Experiments across multiple generative models demonstrate superior stereo realism and temporal stability, with user studies confirming improved perceptual quality. The approach offers a practical path to robust 3D content from monocular diffusion models without dataset-specific optimization, and the authors provide code for reproducibility.

Abstract

Video generation models have demonstrated great capabilities of producing impressive monocular videos, however, the generation of 3D stereoscopic video remains under-explored. We propose a pose-free and training-free approach for generating 3D stereoscopic videos using an off-the-shelf monocular video generation model. Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth, and employs a novel frame matrix video inpainting framework. The framework leverages the video generation model to inpaint frames observed from different timestamps and views. This effective approach generates consistent and semantically coherent stereoscopic videos without scene optimization or model fine-tuning. Moreover, we develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, including Sora [4 ], Lumiere [2], WALT [8 ], and Zeroscope [ 42]. The experiments demonstrate that our method has a significant improvement over previous methods. The code will be released at \url{https://daipengwa.github.io/SVG_ProjectPage}.

SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

TL;DR

This work tackles the challenge of generating high-quality 3D stereoscopic videos without camera pose estimation or training on stereo data. It introduces a pose-free, training-free pipeline that starts from a monocular video generated by a diffusion model, warps it to multiple baseline views using per-frame depth, and refines disoccluded regions through a novel frame-matrix denoising inpainting process, complemented by a disocclusion boundary reinjection mechanism. The frame matrix representation enforces simultaneous spatial and temporal coherence, producing semantically consistent left and right views across time. Experiments across multiple generative models demonstrate superior stereo realism and temporal stability, with user studies confirming improved perceptual quality. The approach offers a practical path to robust 3D content from monocular diffusion models without dataset-specific optimization, and the authors provide code for reproducibility.

Abstract

Video generation models have demonstrated great capabilities of producing impressive monocular videos, however, the generation of 3D stereoscopic video remains under-explored. We propose a pose-free and training-free approach for generating 3D stereoscopic videos using an off-the-shelf monocular video generation model. Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth, and employs a novel frame matrix video inpainting framework. The framework leverages the video generation model to inpaint frames observed from different timestamps and views. This effective approach generates consistent and semantically coherent stereoscopic videos without scene optimization or model fine-tuning. Moreover, we develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, including Sora [4 ], Lumiere [2], WALT [8 ], and Zeroscope [ 42]. The experiments demonstrate that our method has a significant improvement over previous methods. The code will be released at \url{https://daipengwa.github.io/SVG_ProjectPage}.
Paper Structure (27 sections, 7 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 7 equations, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview -- Top: Given a text prompt, our method first uses a video generation model to generate a monocular video, which is warped (using estimated depth) into pre-defined camera views to form a frame matrix with disocclusion masks $M$. Then, the disoccluded regions are inpainted by denoising the frame sequences within the frame matrix. After denoising, we select the leftmost and the rightmost columns and decode them to obtain a 3D stereoscopic video. Bottom: Details of denoising frame matrix. We initialize the latent matrix $\mathbf{z}_T$ as a random noise map. For each noise level, we extend the resampling mechanism karnewar2023holodiffusionlugmayr2022repaint to alternatively denoise temporal (column) sequences and spatial (row) sequences $N$ times. Each time, row or column sequences are denoised and inpainted (see Fig.\ref{['fig:denoising_inpainting']}). By denoising along both spatial and temporal directions, we obtain an inpainted latent $\mathbf{z}_0$ which can be decoded into temporally smooth and semantically consistent sequences.
  • Figure 2: Denosing Inpainting. This figure visualizes the operations in the purple box of Fig.\ref{['fig:pipeline']}. (a) We re-inject the generated content from a denoised latent $\widetilde{\mathbf{z}}_0$ to update $\mathbf{z}_0^{known}$ and reduce its feature corruption on the disocclusion boundary. (b) A noisy latent $\mathbf{z}_t$ is denoised to $\mathbf{z}_{t-1}^{unknown}$. We take its disoccluded region and combine it with the unoccluded region of $\mathbf{z}_0^{known}$.
  • Figure 3: Qualitative comparisons. The first row shows left-view images. The video inpainting methods E2FGVI and ProPainter tend to generate blurry content in disoccluded regions, such as knight's arm and corgi's face. RoDynRF lacks the generation ability, thus content on the right side of the corgi case is poor. DynIBaR's results contain artifacts, and it requires camera poses as inputs, which failed in some scenarios. On the contrary, our method takes advantages of video generation models and is pose-free, thus generates high-quality content in different scenarios.
  • Figure 4: Semantically consistent content generation. The reference frames are warped into the target view with disoccluded regions set to be black. Without using frame matrix, the generated content does not match the reference, such as the book and the face of horse. With frame matrix, the inpainted contents are more semantically reasonable.
  • Figure 5: Disocclusion Boundary Re-injection. Without disocclusion boundary re-injection, the inpainted images usually contain artifacts. Bottom-left corner shows the warped image.
  • ...and 9 more figures