Table of Contents
Fetching ...

Consistency-aware Self-Training for Iterative-based Stereo Matching

Jingyi Zhou, Peng Ye, Haoyu Zhang, Jiakang Yuan, Rao Qiang, Liu YangChenXu, Wu Cailin, Feng Xu, Tao Chen

TL;DR

The paper tackles the challenge of leveraging unlabeled real-world data for iterative-based stereo matching, where reliance on labeled data and the use of cost volumes limit generalization. It introduces CST-Stereo, a consistency-aware self-training framework that uses a teacher–student setup with EMA updates, a soft filtering module (MRPCF and IPCF) to gauge pseudo-label reliability, and a soft-weighted loss that fuses multi-resolution and iterative-consistency signals. The method yields significant gains across in-domain, domain adaptation, and domain generalization benchmarks, achieving state-of-the-art or competitive results on Middlebury, KITTI2015, ETH3D, and related datasets. This approach enhances robustness to unlabeled real-world data and improves generalization in diverse scenarios, with potential for integration alongside other domain adaptation techniques.

Abstract

Iterative-based methods have become mainstream in stereo matching due to their high performance. However, these methods heavily rely on labeled data and face challenges with unlabeled real-world data. To this end, we propose a consistency-aware self-training framework for iterative-based stereo matching for the first time, leveraging real-world unlabeled data in a teacher-student manner. We first observe that regions with larger errors tend to exhibit more pronounced oscillation characteristics during model prediction.Based on this, we introduce a novel consistency-aware soft filtering module to evaluate the reliability of teacher-predicted pseudo-labels, which consists of a multi-resolution prediction consistency filter and an iterative prediction consistency filter to assess the prediction fluctuations of multiple resolutions and iterative optimization respectively. Further, we introduce a consistency-aware soft-weighted loss to adjust the weight of pseudo-labels accordingly, relieving the error accumulation and performance degradation problem due to incorrect pseudo-labels. Extensive experiments demonstrate that our method can improve the performance of various iterative-based stereo matching approaches in various scenarios. In particular, our method can achieve further enhancements over the current SOTA methods on several benchmark datasets.

Consistency-aware Self-Training for Iterative-based Stereo Matching

TL;DR

The paper tackles the challenge of leveraging unlabeled real-world data for iterative-based stereo matching, where reliance on labeled data and the use of cost volumes limit generalization. It introduces CST-Stereo, a consistency-aware self-training framework that uses a teacher–student setup with EMA updates, a soft filtering module (MRPCF and IPCF) to gauge pseudo-label reliability, and a soft-weighted loss that fuses multi-resolution and iterative-consistency signals. The method yields significant gains across in-domain, domain adaptation, and domain generalization benchmarks, achieving state-of-the-art or competitive results on Middlebury, KITTI2015, ETH3D, and related datasets. This approach enhances robustness to unlabeled real-world data and improves generalization in diverse scenarios, with potential for integration alongside other domain adaptation techniques.

Abstract

Iterative-based methods have become mainstream in stereo matching due to their high performance. However, these methods heavily rely on labeled data and face challenges with unlabeled real-world data. To this end, we propose a consistency-aware self-training framework for iterative-based stereo matching for the first time, leveraging real-world unlabeled data in a teacher-student manner. We first observe that regions with larger errors tend to exhibit more pronounced oscillation characteristics during model prediction.Based on this, we introduce a novel consistency-aware soft filtering module to evaluate the reliability of teacher-predicted pseudo-labels, which consists of a multi-resolution prediction consistency filter and an iterative prediction consistency filter to assess the prediction fluctuations of multiple resolutions and iterative optimization respectively. Further, we introduce a consistency-aware soft-weighted loss to adjust the weight of pseudo-labels accordingly, relieving the error accumulation and performance degradation problem due to incorrect pseudo-labels. Extensive experiments demonstrate that our method can improve the performance of various iterative-based stereo matching approaches in various scenarios. In particular, our method can achieve further enhancements over the current SOTA methods on several benchmark datasets.

Paper Structure

This paper contains 33 sections, 7 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of our Consistency-aware Self-Training (CST) method, which can leverage unlabeled data in multiple scenarios to boost the iterative-based stereo-matching model.
  • Figure 2: Overview of our proposed CST-Stereo. Our framework leverages unlabeled stereo data in a teacher-student manner. In detail, the student receives strongly augmented images and learns from the teacher's predictions, with model parameters updated in a delayed manner with EMA for cyclic enhancement. Further, the consistency-aware soft filtering module is applied to evaluate the reliability of teacher-predicted pseudo-labels, which includes a multi-resolution prediction consistency filter and an iterative prediction consistency filter. Finally, the consistency-aware soft-weighted loss is calculated for optimization.
  • Figure 3: Correlations between error regions and multi-resolution prediction consistency. (a) Referenced image. (b) Error map. (c) Multi-resolution prediction consistency map (Darker areas denote lower consistency).
  • Figure 4: Correlations between error regions and iterative prediction consistency. (a) Referenced image. (b) Error map. (c) Iterative prediction consistency map (Darker areas denote lower consistency).
  • Figure 5: Visualization of our method boosting the in domain performance on the Middlebury datasets. (a) Referenced images. (b) Selective-IGEV wang2024selective. (c) CST-Selective-IGEV. (d) Ground Truth
  • ...and 4 more figures