Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion
Songsong Yu, Yuxin Chen, Zhongang Qi, Zeke Xie, Yifan Wang, Lijun Wang, Ying Shan, Huchuan Lu
TL;DR
This work introduces Mono2Stereo, a large-scale stereo conversion benchmark with 2.42 million left-right pairs spanning diverse scenes, and a Stereo Intersection-over-Union (SIoU) metric that correlates strongly with human judgments for stereo quality. It reveals that traditional pixel-based metrics fail to capture stereo-relevant disparities and edges, and demonstrates that one-stage diffusion methods yield high image fidelity but weak stereo, while two-stage methods enhance stereo at the cost of image quality. To address this trade-off, the authors propose a dual-condition latent diffusion baseline with Edge Consistency loss, achieving both high-quality generation and convincing stereo effects. The dataset, SIoU metric, and the dual-condition approach with EC loss are released openly to accelerate progress in stereo conversion and related 3D content creation.
Abstract
With the rapid proliferation of 3D devices and the shortage of 3D content, stereo conversion is attracting increasing attention. Recent works introduce pretrained Diffusion Models (DMs) into this task. However, due to the scarcity of large-scale training data and comprehensive benchmarks, the optimal methodologies for employing DMs in stereo conversion and the accurate evaluation of stereo effects remain largely unexplored. In this work, we introduce the Mono2Stereo dataset, providing high-quality training data and benchmark to support in-depth exploration of stereo conversion. With this dataset, we conduct an empirical study that yields two primary findings. 1) The differences between the left and right views are subtle, yet existing metrics consider overall pixels, failing to concentrate on regions critical to stereo effects. 2) Mainstream methods adopt either one-stage left-to-right generation or warp-and-inpaint pipeline, facing challenges of degraded stereo effect and image distortion respectively. Based on these findings, we introduce a new evaluation metric, Stereo Intersection-over-Union, which prioritizes disparity and achieves a high correlation with human judgments on stereo effect. Moreover, we propose a strong baseline model, harmonizing the stereo effect and image quality simultaneously, and notably surpassing current mainstream methods. Our code and data will be open-sourced to promote further research in stereo conversion. Our models are available at mono2stereo-bench.github.io.
