Table of Contents
Fetching ...

Restereo: Diffusion stereo video generation and restoration

Xingchang Huang, Ashish Kumar Singh, Florian Dubost, Cristina Nader Vasconcelos, Sakar Khattar, Liang Shi, Christian Theobalt, Cengiz Oztireli, Gurprit Singh

TL;DR

<3-5 sentence high-level summary> Restereo tackles stereo video generation from low-quality monocular inputs by training a single diffusion model to perform both left-right generation and restoration. The key idea is to degrade training data and condition generation on warped masks to enforce cross-view consistency, enabling simultaneous left and right view enhancement. Training occurs on synthetic Kubric data with ground-truth depth, and inference uses two branches with shared weights, achieving improved view and temporal consistency over prior training-free and training-based methods. The approach demonstrates strong qualitative and quantitative gains, and is applicable to real-world, low-resolution videos with modest data requirements. A public release of the pipeline and synthetic data is anticipated upon acceptance.

Abstract

Stereo video generation has been gaining increasing attention with recent advancements in video diffusion models. However, most existing methods focus on generating 3D stereoscopic videos from monocular 2D videos. These approaches typically assume that the input monocular video is of high quality, making the task primarily about inpainting occluded regions in the warped video while preserving disoccluded areas. In this paper, we introduce a new pipeline that not only generates stereo videos but also enhances both left-view and right-view videos consistently with a single model. Our approach achieves this by fine-tuning the model on degraded data for restoration, as well as conditioning the model on warped masks for consistent stereo generation. As a result, our method can be fine-tuned on a relatively small synthetic stereo video datasets and applied to low-quality real-world videos, performing both stereo video generation and restoration. Experiments demonstrate that our method outperforms existing approaches both qualitatively and quantitatively in stereo video generation from low-resolution inputs.

Restereo: Diffusion stereo video generation and restoration

TL;DR

<3-5 sentence high-level summary> Restereo tackles stereo video generation from low-quality monocular inputs by training a single diffusion model to perform both left-right generation and restoration. The key idea is to degrade training data and condition generation on warped masks to enforce cross-view consistency, enabling simultaneous left and right view enhancement. Training occurs on synthetic Kubric data with ground-truth depth, and inference uses two branches with shared weights, achieving improved view and temporal consistency over prior training-free and training-based methods. The approach demonstrates strong qualitative and quantitative gains, and is applicable to real-world, low-resolution videos with modest data requirements. A public release of the pipeline and synthetic data is anticipated upon acceptance.

Abstract

Stereo video generation has been gaining increasing attention with recent advancements in video diffusion models. However, most existing methods focus on generating 3D stereoscopic videos from monocular 2D videos. These approaches typically assume that the input monocular video is of high quality, making the task primarily about inpainting occluded regions in the warped video while preserving disoccluded areas. In this paper, we introduce a new pipeline that not only generates stereo videos but also enhances both left-view and right-view videos consistently with a single model. Our approach achieves this by fine-tuning the model on degraded data for restoration, as well as conditioning the model on warped masks for consistent stereo generation. As a result, our method can be fine-tuned on a relatively small synthetic stereo video datasets and applied to low-quality real-world videos, performing both stereo video generation and restoration. Experiments demonstrate that our method outperforms existing approaches both qualitatively and quantitatively in stereo video generation from low-resolution inputs.

Paper Structure

This paper contains 33 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Training and inference pipeline of our method. We fine-tune the Diffusion U-Net for both left-to-right and left-to-left generation and restoration branches. During training, we randomly sample a branch, where left-to-right requires depth maps, forward warping and the right-view target video. For left-to-left, no warping is required and we use a zero mask as the condition and left-view video as the target. Both branches require data augmentation/degradation during training. During inference, we run both branches as well as the decoder to generate videos for both views, without using the yellow boxes. Note that the two U-Nets share the same weights and CLIP radford2021learning features of $\TextOrMath{$z$\xspace}{\bm{z}}'^{(w)}$ and $\TextOrMath{$z$\xspace}{\bm{z}}'^{(l)}$ are also part of the conditional input to the U-Net omitted for simplicity. Details are discussed in \ref{['sec:training', 'sec:inference']}.
  • Figure 2: Data augmentation with degradations is the key for restoration given low-resolution input. Our right-view output with augmentation contains sharper details around the edges than the one without augmentation. Input is from pixabay2025 degraded to $320 \times 160$ following \ref{['eq:updown']}.
  • Figure 3: The color histogram of Ours with histogram matching between left and right are better matched than Ours without histogram matching. This can be better observed around the red box region where the output gets darker without histogram matching. Input is from pixabay2025 degraded to $320 \times 160$ following \ref{['eq:updown']}.
  • Figure 4: Stereo generation comparisons between StereoDiffusion wang2024stereodiffusion, StereoCrafter zhao2024stereocrafter and Ours. Our method shows sharper details across different scenes, highlighted in the zoom-in insets. Inputs are from pixabay2025 degraded to $320 \times 160$ following \ref{['eq:updown']}.
  • Figure 5: Stereo generation and restoration comparisons between StereoCrafter zhao2024stereocrafter with FMA-Net youk2024fma, with Real-ESRGAN wang2021real and Ours. Our method shows better temporal consistency and image quality than others, highlighted in the zoom-in insets. Input video is generated in Kubric greff2022kubric degraded to $256 \times 128$ following \ref{['eq:updown']}.