Table of Contents
Fetching ...

Robust Synthetic-to-Real Transfer for Stereo Matching

Jiawei Zhang, Jiahe Li, Lei Huang, Xiaohan Yu, Lin Gu, Jin Zheng, Xiao Bai

TL;DR

This paper explores fine-tuning stereo matching networks without compromising their robustness to unseen domains and proposes a framework to utilize this difference for fine-tuning, consisting of a frozen Teacher, an exponential moving average (EMA) Teacher, and a Student network.

Abstract

With advancements in domain generalized stereo matching networks, models pre-trained on synthetic data demonstrate strong robustness to unseen domains. However, few studies have investigated the robustness after fine-tuning them in real-world scenarios, during which the domain generalization ability can be seriously degraded. In this paper, we explore fine-tuning stereo matching networks without compromising their robustness to unseen domains. Our motivation stems from comparing Ground Truth (GT) versus Pseudo Label (PL) for fine-tuning: GT degrades, but PL preserves the domain generalization ability. Empirically, we find the difference between GT and PL implies valuable information that can regularize networks during fine-tuning. We also propose a framework to utilize this difference for fine-tuning, consisting of a frozen Teacher, an exponential moving average (EMA) Teacher, and a Student network. The core idea is to utilize the EMA Teacher to measure what the Student has learned and dynamically improve GT and PL for fine-tuning. We integrate our framework with state-of-the-art networks and evaluate its effectiveness on several real-world datasets. Extensive experiments show that our method effectively preserves the domain generalization ability during fine-tuning.

Robust Synthetic-to-Real Transfer for Stereo Matching

TL;DR

This paper explores fine-tuning stereo matching networks without compromising their robustness to unseen domains and proposes a framework to utilize this difference for fine-tuning, consisting of a frozen Teacher, an exponential moving average (EMA) Teacher, and a Student network.

Abstract

With advancements in domain generalized stereo matching networks, models pre-trained on synthetic data demonstrate strong robustness to unseen domains. However, few studies have investigated the robustness after fine-tuning them in real-world scenarios, during which the domain generalization ability can be seriously degraded. In this paper, we explore fine-tuning stereo matching networks without compromising their robustness to unseen domains. Our motivation stems from comparing Ground Truth (GT) versus Pseudo Label (PL) for fine-tuning: GT degrades, but PL preserves the domain generalization ability. Empirically, we find the difference between GT and PL implies valuable information that can regularize networks during fine-tuning. We also propose a framework to utilize this difference for fine-tuning, consisting of a frozen Teacher, an exponential moving average (EMA) Teacher, and a Student network. The core idea is to utilize the EMA Teacher to measure what the Student has learned and dynamically improve GT and PL for fine-tuning. We integrate our framework with state-of-the-art networks and evaluate its effectiveness on several real-world datasets. Extensive experiments show that our method effectively preserves the domain generalization ability during fine-tuning.
Paper Structure (28 sections, 10 equations, 19 figures, 11 tables)

This paper contains 28 sections, 10 equations, 19 figures, 11 tables.

Figures (19)

  • Figure 1: Target-domain and cross-domain performance of networks pre-trained on synthetic data, fine-tuned with GT, and with our method (best visualized in colors). IGEV-Stereo xu2023iterative is used as the baseline model. D1 error (the lower the better) is used for evaluation. (a) and (b) fine-tune networks on the KITTI and Booster datasets, respectively. Our method achieves great performance in target and unseen domains simultaneously. We also evaluate the robustness of KITTI fine-tuned networks on DrivingStereo in (a.2), where our method is more robust to challenging weather.
  • Figure 2: Visualization results on both target and unseen domains. IGEV-Stereo xu2023iterative is used as the baseline model. The network pre-trained on synthetic data shows robustness to unseen domains but can still fail to handle unseen challenges such as transparent or mirror (ToM) surfaces. Fine-tuning the network with GT improves the target-domain performance, however, it seriously degrades the domain generalization ability. Our DKT fine-tuning framework performs well on target and unseen domains simultaneously.
  • Figure 3: Visualization of each region resulting from our division. We divide GT and PL based on their consistency.
  • Figure 4: Evaluation curves of the target and cross domain performance during fine-tuning with GT or PL.
  • Figure 5: Overview of the DKT framework. It uses the prediction from the EMA Teacher to improve GT and PL during fine-tuning.
  • ...and 14 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3