Table of Contents
Fetching ...

M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, Federico Tombari

TL;DR

M2SVid addresses monocular-to-stereo video conversion by refining the warped right view through an end-to-end diffusion-based inpainting framework. It extends Stable Video Diffusion with conditioning on the left video, the warped right view, and disocclusion masks, and employs full attention for disoccluded tokens to leverage neighboring frames and left-view details. The model is trained end-to-end with image-space losses, enabling single-step, feed-forward inference that significantly speeds up rendering while preserving high-frequency content. Quantitative and human studies show M2SVid outperforms StereoCrafter and SVG in quality and perceptual realism, with substantial runtime advantages, making practical, scalable stereoscopic video conversion feasible on public datasets.

Abstract

We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner without iterative diffusion steps by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, being ranked best 2.6x more often than the second-place method in a user study, while being 6x faster.

M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

TL;DR

M2SVid addresses monocular-to-stereo video conversion by refining the warped right view through an end-to-end diffusion-based inpainting framework. It extends Stable Video Diffusion with conditioning on the left video, the warped right view, and disocclusion masks, and employs full attention for disoccluded tokens to leverage neighboring frames and left-view details. The model is trained end-to-end with image-space losses, enabling single-step, feed-forward inference that significantly speeds up rendering while preserving high-frequency content. Quantitative and human studies show M2SVid outperforms StereoCrafter and SVG in quality and perceptual realism, with substantial runtime advantages, making practical, scalable stereoscopic video conversion feasible on public datasets.

Abstract

We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner without iterative diffusion steps by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, being ranked best 2.6x more often than the second-place method in a user study, while being 6x faster.

Paper Structure

This paper contains 35 sections, 5 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: We present M2SVid, an end-to-end video inpainting and refinement approach for monocular-to-stereo video conversion task. Given an initial right view, obtained via e.g. depth based warping, our method inpaints the missing region and refines the artifacts introduced by warping in a feed-forward manner.
  • Figure 2: Overview of our monocular-to-stereo conversion pipeline. Given an input monocular video, we first estimate per-pixel depth, which is used to warp the input video to a right camera view. The input video, the warped video, as well as the disocclusion masks are then passed to our video inpainting and refinement module to generate the final right view.
  • Figure 3: An overview of our proposed stereoscopic video refinement method. Our model inpaints the discoccluded regions in the warped right view, and corrects possible artifacts introduced by warping errors. The model takes the VAE encodings of the input left view, reprojected right view, and the disocclusion mask as conditioning to the U-Net. The latent encodings of the refined right view are then generated in a single denoising step, and then decoded by the VAE Decoder to generate the output right video. In order to effectively utilize the information from neighboring frames for inpainting, we extend the spatial attention layer in SVD to compute full attention for the disoccluded tokens. The model is training end-to-end by minimizing image space and latent space losses.
  • Figure 4: Qualitative comparison of our approach with state-of-the-art methods SVG dai2024svg and StereoCrafter shi2024stereocrafter. Our approach can effectively preserve the high-frequency information from the input video and generate high-quality right views.
  • Figure 5: Impact of our left view conditioning (Sec. \ref{['left-cond']}).
  • ...and 7 more figures