MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
Jihye Ahn, Hyesong Choi, Soomin Kim, Dongbo Min
TL;DR
MaDis-Stereo tackles data scarcity in Transformer-based stereo depth estimation by injecting locality inductive bias through Masked Image Modeling (MIM) and stabilizing training with a teacher–student EMA distillation framework. The method masks stereo image patches, reconstructs them with cross-view attention, and simultaneously predicts disparities, while the teacher provides dense pseudo-disparities to complement sparse ground-truth labels. Key contributions include the dual-network architecture with EMA updating, a calibrated masking ratio of 40%, and the use of pseudo disparity maps to guide learning, yielding state-of-the-art results on KITTI 2015 and ETH3D and improved locality focus as evidenced by attention-distance analyses. This approach enhances the applicability of Transformer-based stereo models in data-limited scenarios and offers a practical pathway to more accurate dense depth prediction in real-world settings.
Abstract
In stereo matching, CNNs have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM) in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual challenge of reconstructing masked tokens and subsequently performing stereo matching poses significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher), updated via Exponential Moving Average (EMA), along with the original stereo model (student), where teacher predictions serve as pseudo supervisory signals to effectively distill knowledge into the student model. State-of-the-arts performance is achieved with the proposed method on several stereo matching such as ETH3D and KITTI 2015. Additionally, to demonstrate that our model effectively leverages locality inductive bias, we provide the attention distance measurement.
