Table of Contents
Fetching ...

Unsupervised Stereo via Multi-Baseline Geometry-Consistent Self-Training

Peng Xu, Zhiyu Xiang, Tingming Bai, Tianyu Pu, Kai Wang, Chaojie Ji, Zhihao Yang, Eryun Liu

TL;DR

This work tackles the occlusion challenge in unsupervised stereo by introducing S$^3$, an asymmetric teacher-student framework where the teacher and student observe different target views, enabling reliable supervision in occluded regions through multi-baseline geometry consistency. The method combines a geometry-consistent loss with an occlusion-aware weighting strategy and uses an EMA momentum teacher trained from scratch, fused with photometric and smoothness regularizers. A synthetic multi-baseline dataset, MBS20K, is constructed from CARLA to support training, and novel view extrapolation enables KITTI-style finetuning. Experiments on KITTI 2012/2015 and zero-shot generalization demonstrate state-of-the-art unsupervised performance and robust cross-domain, cross-weather generalization, highlighting the practical potential for real-world stereo systems without ground-truth disparities.

Abstract

Photometric loss and pseudo-label-based self-training are two widely used methods for training stereo networks on unlabeled data. However, they both struggle to provide accurate supervision in occluded regions. The former lacks valid correspondences, while the latter's pseudo labels are often unreliable. To overcome these limitations, we present S$^3$, a simple yet effective framework based on multi-baseline geometry consistency. Unlike conventional self-training where teacher and student share identical stereo pairs, S$^3$ assigns them different target images, introducing natural visibility asymmetry. Regions occluded in the student's view often remain visible and matchable to the teacher, enabling reliable pseudo labels even in regions where photometric supervision fails. The teacher's disparities are rescaled to align with the student's baseline and used to guide student learning. An occlusion-aware weighting strategy is further proposed to mitigate unreliable supervision in teacher-occluded regions and to encourage the student to learn robust occlusion completion. To support training, we construct MBS20K, a multi-baseline stereo dataset synthesized using the CARLA simulator. Extensive experiments demonstrate that S$^3$ provides effective supervision in both occluded and non-occluded regions, achieves strong generalization performance, and surpasses previous state-of-the-art methods on the KITTI 2015 and 2012 benchmarks.

Unsupervised Stereo via Multi-Baseline Geometry-Consistent Self-Training

TL;DR

This work tackles the occlusion challenge in unsupervised stereo by introducing S, an asymmetric teacher-student framework where the teacher and student observe different target views, enabling reliable supervision in occluded regions through multi-baseline geometry consistency. The method combines a geometry-consistent loss with an occlusion-aware weighting strategy and uses an EMA momentum teacher trained from scratch, fused with photometric and smoothness regularizers. A synthetic multi-baseline dataset, MBS20K, is constructed from CARLA to support training, and novel view extrapolation enables KITTI-style finetuning. Experiments on KITTI 2012/2015 and zero-shot generalization demonstrate state-of-the-art unsupervised performance and robust cross-domain, cross-weather generalization, highlighting the practical potential for real-world stereo systems without ground-truth disparities.

Abstract

Photometric loss and pseudo-label-based self-training are two widely used methods for training stereo networks on unlabeled data. However, they both struggle to provide accurate supervision in occluded regions. The former lacks valid correspondences, while the latter's pseudo labels are often unreliable. To overcome these limitations, we present S, a simple yet effective framework based on multi-baseline geometry consistency. Unlike conventional self-training where teacher and student share identical stereo pairs, S assigns them different target images, introducing natural visibility asymmetry. Regions occluded in the student's view often remain visible and matchable to the teacher, enabling reliable pseudo labels even in regions where photometric supervision fails. The teacher's disparities are rescaled to align with the student's baseline and used to guide student learning. An occlusion-aware weighting strategy is further proposed to mitigate unreliable supervision in teacher-occluded regions and to encourage the student to learn robust occlusion completion. To support training, we construct MBS20K, a multi-baseline stereo dataset synthesized using the CARLA simulator. Extensive experiments demonstrate that S provides effective supervision in both occluded and non-occluded regions, achieves strong generalization performance, and surpasses previous state-of-the-art methods on the KITTI 2015 and 2012 benchmarks.

Paper Structure

This paper contains 23 sections, 8 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Top. Illustration of learning behaviors in occluded regions: (a) Photometric loss takes a shortcut by copying disparities from nearby visible areas; (b) Occlusion-masked photometric loss (masked-PL) fails to supervise occlusion completion, leaving these regions poorly estimated; (c) Our S$^3$ accurately predicts disparities even in heavily occluded areas. Bottom. S$^3$ achieves outstanding unsupervised performance on the (d) KITTI 2015 KITTI2015 and (e) KITTI 2012 KITTI2012 benchmarks, as well as zero-shot generalization in (f).
  • Figure 2: Overview of S$^3$. Stereo pairs sharing the same reference view but different target views are fed into the teacher and student, respectively. This asymmetric configuration allows the teacher to observe regions that are occluded in the student’s view. The teacher’s disparities are rescaled to align with the student’s baseline and used to supervise the student through geometry-consistent loss with occlusion-aware weighting. Data augmentation is applied to the student to encourage robust feature learning. The teacher is updated via exponential moving average (EMA) of the student’s weights with gradients stopped.
  • Figure 3: Visualization of the occlusion-aware weight map. Red areas indicate regions occluded in the teacher’s target view, while green areas correspond to regions occluded in the student’s target view but visible in the teacher’s view. The remaining regions are visible in both views.
  • Figure 4: Quantitative comparison on the KITTI training sets KITTI2012KITTI2015. The D1 error over all valid pixels is reported. ZOLE ZOLE and CST-Stereo cst-stereo do not provide results on KITTI 2012.
  • Figure 5: Qualitative comparison of the occluded areas between IGEVStereo igev, Selective-IGEV wang2024selective, and our S$^3$-IGEV. Please zoom in for more details.
  • ...and 11 more figures