Table of Contents
Fetching ...

Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion

Songsong Yu, Yuxin Chen, Zhongang Qi, Zeke Xie, Yifan Wang, Lijun Wang, Ying Shan, Huchuan Lu

TL;DR

This work introduces Mono2Stereo, a large-scale stereo conversion benchmark with 2.42 million left-right pairs spanning diverse scenes, and a Stereo Intersection-over-Union (SIoU) metric that correlates strongly with human judgments for stereo quality. It reveals that traditional pixel-based metrics fail to capture stereo-relevant disparities and edges, and demonstrates that one-stage diffusion methods yield high image fidelity but weak stereo, while two-stage methods enhance stereo at the cost of image quality. To address this trade-off, the authors propose a dual-condition latent diffusion baseline with Edge Consistency loss, achieving both high-quality generation and convincing stereo effects. The dataset, SIoU metric, and the dual-condition approach with EC loss are released openly to accelerate progress in stereo conversion and related 3D content creation.

Abstract

With the rapid proliferation of 3D devices and the shortage of 3D content, stereo conversion is attracting increasing attention. Recent works introduce pretrained Diffusion Models (DMs) into this task. However, due to the scarcity of large-scale training data and comprehensive benchmarks, the optimal methodologies for employing DMs in stereo conversion and the accurate evaluation of stereo effects remain largely unexplored. In this work, we introduce the Mono2Stereo dataset, providing high-quality training data and benchmark to support in-depth exploration of stereo conversion. With this dataset, we conduct an empirical study that yields two primary findings. 1) The differences between the left and right views are subtle, yet existing metrics consider overall pixels, failing to concentrate on regions critical to stereo effects. 2) Mainstream methods adopt either one-stage left-to-right generation or warp-and-inpaint pipeline, facing challenges of degraded stereo effect and image distortion respectively. Based on these findings, we introduce a new evaluation metric, Stereo Intersection-over-Union, which prioritizes disparity and achieves a high correlation with human judgments on stereo effect. Moreover, we propose a strong baseline model, harmonizing the stereo effect and image quality simultaneously, and notably surpassing current mainstream methods. Our code and data will be open-sourced to promote further research in stereo conversion. Our models are available at mono2stereo-bench.github.io.

Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion

TL;DR

This work introduces Mono2Stereo, a large-scale stereo conversion benchmark with 2.42 million left-right pairs spanning diverse scenes, and a Stereo Intersection-over-Union (SIoU) metric that correlates strongly with human judgments for stereo quality. It reveals that traditional pixel-based metrics fail to capture stereo-relevant disparities and edges, and demonstrates that one-stage diffusion methods yield high image fidelity but weak stereo, while two-stage methods enhance stereo at the cost of image quality. To address this trade-off, the authors propose a dual-condition latent diffusion baseline with Edge Consistency loss, achieving both high-quality generation and convincing stereo effects. The dataset, SIoU metric, and the dual-condition approach with EC loss are released openly to accelerate progress in stereo conversion and related 3D content creation.

Abstract

With the rapid proliferation of 3D devices and the shortage of 3D content, stereo conversion is attracting increasing attention. Recent works introduce pretrained Diffusion Models (DMs) into this task. However, due to the scarcity of large-scale training data and comprehensive benchmarks, the optimal methodologies for employing DMs in stereo conversion and the accurate evaluation of stereo effects remain largely unexplored. In this work, we introduce the Mono2Stereo dataset, providing high-quality training data and benchmark to support in-depth exploration of stereo conversion. With this dataset, we conduct an empirical study that yields two primary findings. 1) The differences between the left and right views are subtle, yet existing metrics consider overall pixels, failing to concentrate on regions critical to stereo effects. 2) Mainstream methods adopt either one-stage left-to-right generation or warp-and-inpaint pipeline, facing challenges of degraded stereo effect and image distortion respectively. Based on these findings, we introduce a new evaluation metric, Stereo Intersection-over-Union, which prioritizes disparity and achieves a high correlation with human judgments on stereo effect. Moreover, we propose a strong baseline model, harmonizing the stereo effect and image quality simultaneously, and notably surpassing current mainstream methods. Our code and data will be open-sourced to promote further research in stereo conversion. Our models are available at mono2stereo-bench.github.io.

Paper Structure

This paper contains 32 sections, 3 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Examples from the Mono2Stereo dataset, visualized in anaglyph (red-blue) stereo format. We categorize the scenes into five groups based on disparity range, geometric complexity, and color distribution: "Indoor" and "Outdoor" with distinct disparity ranges; "Simple" and "Complex" with varying geometric complexity; and "Animation" representing a different color distribution compared to natural images.
  • Figure 2: Differences Visualization. Brighter (red) regions indicate larger differences between the left-view and right-view images, while darker regions represent smaller differences. Due to the small camera baseline, differences are primarily concentrated along object boundaries.
  • Figure 3: Training pipeline of the dual-condition model. The images are fed into the VAE encoder to obtain the corresponding latent representations. The geometric and viewpoint conditions are concatenated as the input to the UNet, and only the Unet is optimized during training. To overcome degradation issues, additional constraints are applied to the edges of the velocity representations.
  • Figure 4: Visualization of the optimization target. (Middle) Heatmap generated from "Velocity" using an edge detection operator, overlaid on the input image. (Right) The "Velocity" exhibits a strong spatial correlation with the image content.
  • Figure 5: Visual comparison of different methods using anaglyph (red-blue) stereo. StereoDiffusion and OWL3D exhibit artifacts such as unreasonable offsets, while Dual Condition remains more faithful to the ground truth. The yellow boxes highlight the main differences.
  • ...and 8 more figures