Table of Contents
Fetching ...

MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling

Jihye Ahn, Hyesong Choi, Soomin Kim, Dongbo Min

TL;DR

MaDis-Stereo tackles data scarcity in Transformer-based stereo depth estimation by injecting locality inductive bias through Masked Image Modeling (MIM) and stabilizing training with a teacher–student EMA distillation framework. The method masks stereo image patches, reconstructs them with cross-view attention, and simultaneously predicts disparities, while the teacher provides dense pseudo-disparities to complement sparse ground-truth labels. Key contributions include the dual-network architecture with EMA updating, a calibrated masking ratio of 40%, and the use of pseudo disparity maps to guide learning, yielding state-of-the-art results on KITTI 2015 and ETH3D and improved locality focus as evidenced by attention-distance analyses. This approach enhances the applicability of Transformer-based stereo models in data-limited scenarios and offers a practical pathway to more accurate dense depth prediction in real-world settings.

Abstract

In stereo matching, CNNs have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM) in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual challenge of reconstructing masked tokens and subsequently performing stereo matching poses significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher), updated via Exponential Moving Average (EMA), along with the original stereo model (student), where teacher predictions serve as pseudo supervisory signals to effectively distill knowledge into the student model. State-of-the-arts performance is achieved with the proposed method on several stereo matching such as ETH3D and KITTI 2015. Additionally, to demonstrate that our model effectively leverages locality inductive bias, we provide the attention distance measurement.

MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling

TL;DR

MaDis-Stereo tackles data scarcity in Transformer-based stereo depth estimation by injecting locality inductive bias through Masked Image Modeling (MIM) and stabilizing training with a teacher–student EMA distillation framework. The method masks stereo image patches, reconstructs them with cross-view attention, and simultaneously predicts disparities, while the teacher provides dense pseudo-disparities to complement sparse ground-truth labels. Key contributions include the dual-network architecture with EMA updating, a calibrated masking ratio of 40%, and the use of pseudo disparity maps to guide learning, yielding state-of-the-art results on KITTI 2015 and ETH3D and improved locality focus as evidenced by attention-distance analyses. This approach enhances the applicability of Transformer-based stereo models in data-limited scenarios and offers a practical pathway to more accurate dense depth prediction in real-world settings.

Abstract

In stereo matching, CNNs have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM) in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual challenge of reconstructing masked tokens and subsequently performing stereo matching poses significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher), updated via Exponential Moving Average (EMA), along with the original stereo model (student), where teacher predictions serve as pseudo supervisory signals to effectively distill knowledge into the student model. State-of-the-arts performance is achieved with the proposed method on several stereo matching such as ETH3D and KITTI 2015. Additionally, to demonstrate that our model effectively leverages locality inductive bias, we provide the attention distance measurement.
Paper Structure (24 sections, 4 equations, 5 figures, 5 tables)

This paper contains 24 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparative Illustration of Conventional Approaches; (a) Current supervised learning approaches for stereo depth estimation, either implemented with CNNs raftnetcrestereo or Transformer croco2, typically train the model with a pre-trained encoder by comparing predicted disparity maps with ground truth depth labels using a supervised depth loss. (b) Initializing with the Transformer encoder pre-trained using Masked Image Modeling (MIM) MAE, the proposed method leverages both the self-supervised masking-and-reconstruction strategy and the supervised depth loss for stereo depth estimation.
  • Figure 2: Overall architecture of the proposed MaDis-Stereo. It consists of (bottom) a main stereo network (student) processing masked stereo images and (top) an auxiliary stereo network (teacher), whose weight parameters are updated with exponential moving average (EMA). The ViT-Base vit encoder processes visible tokens from both the masked left and right views to extract image features. The left and right features are then fed into the ViT-Base decoder consisting of cross-attention blocks. Following croco2, the RefineNet-based feature fusion block refinenet is employed as a head module to produce disparity maps, and an additional linear layer is used to reconstruct the masked image patches simmim. Here, $N$ is set to 12. The ground truth disparity maps and pseudo disparity maps are utilized as supervision signals in a complementary manner, with the blue and green circles on the right representing the pseudo disparities ('full-dense') generated by the teacher network and the ground truth disparities ('sparse'), respectively.
  • Figure 3: Comparison of Attention Distance Maps; It shows the result of computing averaged attention distance in 12 attention heads of each layer. The average attention distance revealing across various attention heads represented by dots indicates that MaDis-Stereo tends to focus locally compared to CroCo-Stereo croco2.
  • Figure 4: Qualitative results from the KITTI 2015 Leaderboard. The left column (a) presents the predicted disparity maps generated by MaDis-Stereo, while the middle column (b) depicts those from IGEV-Stereo. The right column (c) illustrates the results obtained from CroCo-Stereo. The bounding boxes we marked on the artifacts/blur of images are intended for comparison.
  • Figure 5: Ablation Study on Exponential Moving Average (EMA)-updated Teacher on KITTI 2015. We observed that the EMA structure stabilizes the training of stereo depth estimation networks, where masking is applied to inputs.