Table of Contents
Fetching ...

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, Ying-Cong Chen

TL;DR

StereoPilot addresses depth ambiguity and inefficiency in monocular-to-stereo conversion by learning end-to-end view synthesis that leverages pretrained generative priors. It introduces UniStereo, a large-scale dataset that unifies parallel and converged stereo formats, and a diffusion-based feed-forward model with a learnable domain switcher and cycle-consistency loss to handle both formats. Across Stereo4D and 3DMovie benchmarks, StereoPilot achieves state-of-the-art fidelity and significantly faster inference by avoiding iterative diffusion sampling. Acknowledging current non-real-time latency, the work points to autoregressive extensions for real-time applications as future work.

Abstract

The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

TL;DR

StereoPilot addresses depth ambiguity and inefficiency in monocular-to-stereo conversion by learning end-to-end view synthesis that leverages pretrained generative priors. It introduces UniStereo, a large-scale dataset that unifies parallel and converged stereo formats, and a diffusion-based feed-forward model with a learnable domain switcher and cycle-consistency loss to handle both formats. Across Stereo4D and 3DMovie benchmarks, StereoPilot achieves state-of-the-art fidelity and significantly faster inference by avoiding iterative diffusion sampling. Acknowledging current non-real-time latency, the work points to autoregressive extensions for real-time applications as future work.

Abstract

The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.

Paper Structure

This paper contains 35 sections, 16 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Depth Ambiguity Issue. As shown in the legend in the upper left corner of the figure, when there are specular reflections, there will be two depths at the mirror position: the depth of the mirror surface $d_S$ and the depth of the object's reflection $d_R$. In the real physical world, these two points are warped separately according to their respective disentangled depths. However, depth estimation algorithms cannot predict multiple depths at the same position. Therefore, the inverse relationship between depth and disparity breaks down. This will cause Depth-Warp-Inpaint (DWI) type methods to predict results with incorrect disparity.
  • Figure 2: Parallel vs. Converged. In the parallel setup, when both eyes observe the same subject, the projected image points on the left and right views are denoted as $\mathbf{X_L}$ and $\mathbf{X_R}$, and their absolute difference $|\mathbf{X_L} - \mathbf{X_R}|$ defines the disparity $\boldsymbol{s}$. According to geometric relationships derived from similar triangles, $\boldsymbol{b}$, $\boldsymbol{f}$, $\boldsymbol{d}$, and $\boldsymbol{s}$ satisfy an inverse proportionality between disparity and depth when the baseline $\boldsymbol{b}$ and focal length $\boldsymbol{f}$ remain constant. In the converged configuration, a Zero-disparity Projection Plane is present—objects in front of this plane yield positive disparity, while those behind it produce negative disparity.
  • Figure 3: The inherent stochasticity of generative models can cause them to fabricate objects not present in the source view. As this figure illustrates, the right view generated by ReCamMaster erroneously introduces new artifacts, e.g., a car and a man (highlighted in red bounding box), that do not exist in the original input.
  • Figure 4: UniStereo processing pipeline. We use green icons with numbered steps to depict the Stereo4D pipeline: starting from the raw VR180 videos, we set hfov = 90° and specify the projection resolution to produce the final left- and right-eye monocular videos. Simultaneously, blue icons with numbered steps denote the 3DMovie pipeline: we segment the source films into clips, filter out non-informative segments, convert from side-by-side (SBS) to left/right monocular views, and remove black borders. All resulting videos are captioned using ShareGPT4Video chen2024sharegpt4video.
  • Figure 5: The training framework of the proposed StereoPilot. StereoPilot uses a single-step feed-forward architecture (Diffusion as Feed-Forward) that incorporates a learnable domain switcher $s$ to unify conversion for both parallel and converged stereo formats. The entire model is optimized using a cycle-consistent training strategy, combining reconstruction and cycle-consistency losses to ensure high fidelity and precise geometric alignment. The blue and orange lines represent the Left-to-Right and Right-to-Left reconstruction processes, and the orange dashed line denotes the $L \rightarrow R \rightarrow L$ cycle-consistency path.
  • ...and 8 more figures