Unsupervised Stereo via Multi-Baseline Geometry-Consistent Self-Training
Peng Xu, Zhiyu Xiang, Tingming Bai, Tianyu Pu, Kai Wang, Chaojie Ji, Zhihao Yang, Eryun Liu
TL;DR
This work tackles the occlusion challenge in unsupervised stereo by introducing S$^3$, an asymmetric teacher-student framework where the teacher and student observe different target views, enabling reliable supervision in occluded regions through multi-baseline geometry consistency. The method combines a geometry-consistent loss with an occlusion-aware weighting strategy and uses an EMA momentum teacher trained from scratch, fused with photometric and smoothness regularizers. A synthetic multi-baseline dataset, MBS20K, is constructed from CARLA to support training, and novel view extrapolation enables KITTI-style finetuning. Experiments on KITTI 2012/2015 and zero-shot generalization demonstrate state-of-the-art unsupervised performance and robust cross-domain, cross-weather generalization, highlighting the practical potential for real-world stereo systems without ground-truth disparities.
Abstract
Photometric loss and pseudo-label-based self-training are two widely used methods for training stereo networks on unlabeled data. However, they both struggle to provide accurate supervision in occluded regions. The former lacks valid correspondences, while the latter's pseudo labels are often unreliable. To overcome these limitations, we present S$^3$, a simple yet effective framework based on multi-baseline geometry consistency. Unlike conventional self-training where teacher and student share identical stereo pairs, S$^3$ assigns them different target images, introducing natural visibility asymmetry. Regions occluded in the student's view often remain visible and matchable to the teacher, enabling reliable pseudo labels even in regions where photometric supervision fails. The teacher's disparities are rescaled to align with the student's baseline and used to guide student learning. An occlusion-aware weighting strategy is further proposed to mitigate unreliable supervision in teacher-occluded regions and to encourage the student to learn robust occlusion completion. To support training, we construct MBS20K, a multi-baseline stereo dataset synthesized using the CARLA simulator. Extensive experiments demonstrate that S$^3$ provides effective supervision in both occluded and non-occluded regions, achieves strong generalization performance, and surpasses previous state-of-the-art methods on the KITTI 2015 and 2012 benchmarks.
