SpatialMe: Stereo Video Conversion Using Depth-Warping and Blend-Inpainting
Jiale Zhang, Qianxi Jia, Yang Liu, Wei Zhang, Wei Wei, Xin Tian
TL;DR
This work tackles stereo video conversion from monocular footage by introducing SpatialMe, a two-stage framework that uses depth-guided warping and occlusion-aware blend-inpainting. The core innovation combines a multi-branch inpainting module (Poly-based, DL-based, and DE-based) with a mask-based hierarchical feature update refiner (MHFU) and a disparity expansion strategy to prevent foreground bleeding, achieving high-fidelity right-view synthesis. To address data scarcity, the authors release StereoV1K, a large real-world stereo video dataset with over 500k frames captured at 1180×1180, facilitating robust benchmarking. Empirical results on StereoV1K show SpatialMe outperforms state-of-the-art methods in both quantitative metrics and visual quality, underscoring its potential for VR/AR content pipelines and for advancing stereo-video research.
Abstract
Stereo video conversion aims to transform monocular videos into immersive stereo format. Despite the advancements in novel view synthesis, it still remains two major challenges: i) difficulty of achieving high-fidelity and stable results, and ii) insufficiency of high-quality stereo video data. In this paper, we introduce SpatialMe, a novel stereo video conversion framework based on depth-warping and blend-inpainting. Specifically, we propose a mask-based hierarchy feature update (MHFU) refiner, which integrate and refine the outputs from designed multi-branch inpainting module, using feature update unit (FUU) and mask mechanism. We also propose a disparity expansion strategy to address the problem of foreground bleeding. Furthermore, we conduct a high-quality real-world stereo video dataset -- StereoV1K, to alleviate the data shortage. It contains 1000 stereo videos captured in real-world at a resolution of 1180 x 1180, covering various indoor and outdoor scenes. Extensive experiments demonstrate the superiority of our approach in generating stereo videos over state-of-the-art methods.
