StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

Sijie Zhao; Wenbo Hu; Xiaodong Cun; Yong Zhang; Xiaoyu Li; Zhe Kong; Xiangjun Gao; Muyao Niu; Ying Shan

StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, Ying Shan

TL;DR

StereoCrafter presents a diffusion-based framework to convert monocular videos into high-fidelity stereoscopic 3D suitable for VR/AR displays. The method combines depth-based video splatting to generate a warped right view and occlusion masks with diffusion-based stereo inpainting to fill occluded regions, leveraging foundation-model priors for improved depth and inpainting quality. A large-scale, auto-regressive, and tiled training pipeline enables handling of long and high-resolution videos, supported by a specialized dataset pipeline that constructs training pairs from stereo videos. Empirical results show improved visual fidelity and temporal consistency over traditional 2D-to-3D methods and existing video inpainting baselines, highlighting practical potential for devices like Apple Vision Pro.

Abstract

This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience. Leveraging foundation models as priors, our approach overcomes the limitations of traditional methods and boosts the performance to ensure the high-fidelity generation required by the display devices. The proposed system consists of two main steps: depth-based video splatting for warping and extracting occlusion mask, and stereo video inpainting. We utilize pre-trained stable video diffusion as the backbone and introduce a fine-tuning protocol for the stereo video inpainting task. To handle input video with varying lengths and resolutions, we explore auto-regressive strategies and tiled processing. Finally, a sophisticated data processing pipeline has been developed to reconstruct a large-scale and high-quality dataset to support our training. Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays. In summary, this work contributes to the field by presenting an effective method for generating high-quality stereoscopic videos from monocular input, potentially transforming how we experience digital media.

StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

TL;DR

Abstract

Paper Structure (19 sections, 10 figures)

This paper contains 19 sections, 10 figures.

Introduction
Related Work
2D-to-3D Video Conversion.
Dynamic View Synthesis from Monocular Videos.
Video Diffusion Models.
Methodology
Overview
Depth-based Video Splatting
Stereo Video Inpainting
Dataset Construction
Experiments
Implementation Details
Datasets.
Training Details.
Comparison to 2D-to-3D Video Conversion
...and 4 more sections

Figures (10)

Figure 1: We propose a framework to convert any 2D videos to immersive stereoscopic 3D ones that can be viewed on different display devices, like 3D Glasses, Apple Vision Pro and 3D Display. It can be applied to various video sources, such as movies, vlogs, 3D cartoons, and AIGC videos. We hope this approach can be applied to revolutionize the way we experience digital media in the future.
Figure 2: Overall framework of StereoCrafter, which contains two main stages. In the first stage, the video depth is estimated from the monocular video and we obtain the warped video and its occlusion mask through depth-based video splatting with the left video and the video depth as input. Then, we train a stereo video inpainting model to fill in the hole region of the warped video according to the occlusion mask to synthesize the right video.
Figure 3: Illustration of our depth-based forward splatting. The image on the right is created by splatting the input pixels according to the disparity. And we use a depth-aware method to resolve any ambiguity when multiple pixels are splatted to the same pixel in the right view.
Figure 4: The pipeline of our approach for constructing the training dataset. After curating a large number of stereo videos, we generate the video depth/disparity, warped left video, and occlusion mask for each data sample, while using the right video as the ground truth.
Figure 5: Illustration of our approach for handling videos of arbitrary length.
...and 5 more figures

StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

TL;DR

Abstract

StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (10)