Table of Contents
Fetching ...

SpatialMe: Stereo Video Conversion Using Depth-Warping and Blend-Inpainting

Jiale Zhang, Qianxi Jia, Yang Liu, Wei Zhang, Wei Wei, Xin Tian

TL;DR

This work tackles stereo video conversion from monocular footage by introducing SpatialMe, a two-stage framework that uses depth-guided warping and occlusion-aware blend-inpainting. The core innovation combines a multi-branch inpainting module (Poly-based, DL-based, and DE-based) with a mask-based hierarchical feature update refiner (MHFU) and a disparity expansion strategy to prevent foreground bleeding, achieving high-fidelity right-view synthesis. To address data scarcity, the authors release StereoV1K, a large real-world stereo video dataset with over 500k frames captured at 1180×1180, facilitating robust benchmarking. Empirical results on StereoV1K show SpatialMe outperforms state-of-the-art methods in both quantitative metrics and visual quality, underscoring its potential for VR/AR content pipelines and for advancing stereo-video research.

Abstract

Stereo video conversion aims to transform monocular videos into immersive stereo format. Despite the advancements in novel view synthesis, it still remains two major challenges: i) difficulty of achieving high-fidelity and stable results, and ii) insufficiency of high-quality stereo video data. In this paper, we introduce SpatialMe, a novel stereo video conversion framework based on depth-warping and blend-inpainting. Specifically, we propose a mask-based hierarchy feature update (MHFU) refiner, which integrate and refine the outputs from designed multi-branch inpainting module, using feature update unit (FUU) and mask mechanism. We also propose a disparity expansion strategy to address the problem of foreground bleeding. Furthermore, we conduct a high-quality real-world stereo video dataset -- StereoV1K, to alleviate the data shortage. It contains 1000 stereo videos captured in real-world at a resolution of 1180 x 1180, covering various indoor and outdoor scenes. Extensive experiments demonstrate the superiority of our approach in generating stereo videos over state-of-the-art methods.

SpatialMe: Stereo Video Conversion Using Depth-Warping and Blend-Inpainting

TL;DR

This work tackles stereo video conversion from monocular footage by introducing SpatialMe, a two-stage framework that uses depth-guided warping and occlusion-aware blend-inpainting. The core innovation combines a multi-branch inpainting module (Poly-based, DL-based, and DE-based) with a mask-based hierarchical feature update refiner (MHFU) and a disparity expansion strategy to prevent foreground bleeding, achieving high-fidelity right-view synthesis. To address data scarcity, the authors release StereoV1K, a large real-world stereo video dataset with over 500k frames captured at 1180×1180, facilitating robust benchmarking. Empirical results on StereoV1K show SpatialMe outperforms state-of-the-art methods in both quantitative metrics and visual quality, underscoring its potential for VR/AR content pipelines and for advancing stereo-video research.

Abstract

Stereo video conversion aims to transform monocular videos into immersive stereo format. Despite the advancements in novel view synthesis, it still remains two major challenges: i) difficulty of achieving high-fidelity and stable results, and ii) insufficiency of high-quality stereo video data. In this paper, we introduce SpatialMe, a novel stereo video conversion framework based on depth-warping and blend-inpainting. Specifically, we propose a mask-based hierarchy feature update (MHFU) refiner, which integrate and refine the outputs from designed multi-branch inpainting module, using feature update unit (FUU) and mask mechanism. We also propose a disparity expansion strategy to address the problem of foreground bleeding. Furthermore, we conduct a high-quality real-world stereo video dataset -- StereoV1K, to alleviate the data shortage. It contains 1000 stereo videos captured in real-world at a resolution of 1180 x 1180, covering various indoor and outdoor scenes. Extensive experiments demonstrate the superiority of our approach in generating stereo videos over state-of-the-art methods.

Paper Structure

This paper contains 16 sections, 11 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Our method performs stereo conversion for any 2D video to stereo format via depth-warping and blend-inpainting, enabling viewing on 3D glasses (anaglyph) or AR/VR devices like Apple Vision Pro.
  • Figure 2: Overview of our framework, which consists of two stages. First, a depth estimation model predicts the depth of the left-view frames, guiding the warping of right-view frames and generating occlusion masks. In the second stage, a multi-branch inpainting module, composing traditional (Poly-based), deep learning (DL-based), and disparity expansion (DE-based) branches, fills the occluded regions based on the occlusion masks. The inpainted results are then fused and refined via a mask-based hierarchical feature update refiner, generating the final right-view frames.
  • Figure 3: Illustration of the disparity expansion strategy. The result of using this strategy ensures more realistic inpainting by accurately using the background context
  • Figure 4: Illustration of the side-by-side videos in StereoV1K.
  • Figure 5: Qualitative comparison with several state-of-the-art methods. Our approach obtains significantly better results.
  • ...and 2 more figures