Restereo: Diffusion stereo video generation and restoration
Xingchang Huang, Ashish Kumar Singh, Florian Dubost, Cristina Nader Vasconcelos, Sakar Khattar, Liang Shi, Christian Theobalt, Cengiz Oztireli, Gurprit Singh
TL;DR
<3-5 sentence high-level summary> Restereo tackles stereo video generation from low-quality monocular inputs by training a single diffusion model to perform both left-right generation and restoration. The key idea is to degrade training data and condition generation on warped masks to enforce cross-view consistency, enabling simultaneous left and right view enhancement. Training occurs on synthetic Kubric data with ground-truth depth, and inference uses two branches with shared weights, achieving improved view and temporal consistency over prior training-free and training-based methods. The approach demonstrates strong qualitative and quantitative gains, and is applicable to real-world, low-resolution videos with modest data requirements. A public release of the pipeline and synthetic data is anticipated upon acceptance.
Abstract
Stereo video generation has been gaining increasing attention with recent advancements in video diffusion models. However, most existing methods focus on generating 3D stereoscopic videos from monocular 2D videos. These approaches typically assume that the input monocular video is of high quality, making the task primarily about inpainting occluded regions in the warped video while preserving disoccluded areas. In this paper, we introduce a new pipeline that not only generates stereo videos but also enhances both left-view and right-view videos consistently with a single model. Our approach achieves this by fine-tuning the model on degraded data for restoration, as well as conditioning the model on warped masks for consistent stereo generation. As a result, our method can be fine-tuned on a relatively small synthetic stereo video datasets and applied to low-quality real-world videos, performing both stereo video generation and restoration. Experiments demonstrate that our method outperforms existing approaches both qualitatively and quantitatively in stereo video generation from low-resolution inputs.
