Table of Contents
Fetching ...

UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching

Soomin Kim, Hyesong Choi, Jihye Ahn, Dongbo Min

TL;DR

This paper proposes UniTT-Stereo, a method to maximize the potential of Transformer-based stereo architectures by unifying self-supervised learning for pre-training with stereo matching framework based on supervised learning, and designs a dual-task learning scheme that reconstructs masked regions of an input image while simultaneously predicting corresponding points in the paired image.

Abstract

Unlike other vision tasks where Transformer-based approaches are becoming increasingly common, stereo depth estimation is still dominated by convolution-based approaches. This is mainly due to the limited availability of real-world ground truth for stereo matching, which is a limiting factor in improving the performance of Transformer-based stereo approaches. In this paper, we propose UniTT-Stereo, a method to maximize the potential of Transformer-based stereo architectures by unifying self-supervised learning used for pre-training with stereo matching framework based on supervised learning. To be specific, we explore the effectiveness of reconstructing features of masked portions in an input image and at the same time predicting corresponding points in another image from the perspective of locality inductive bias, which is crucial in training models with limited training data. Moreover, to address these challenging tasks of reconstruction-and-prediction, we present a new strategy to vary a masking ratio when training the stereo model with stereo-tailored losses. State-of-the-art performance of UniTT-Stereo is validated on various benchmarks such as ETH3D, KITTI 2012, and KITTI 2015 datasets. Lastly, to investigate the advantages of the proposed approach, we provide a frequency analysis of feature maps and the analysis of locality inductive bias based on attention maps.

UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching

TL;DR

This paper proposes UniTT-Stereo, a method to maximize the potential of Transformer-based stereo architectures by unifying self-supervised learning for pre-training with stereo matching framework based on supervised learning, and designs a dual-task learning scheme that reconstructs masked regions of an input image while simultaneously predicting corresponding points in the paired image.

Abstract

Unlike other vision tasks where Transformer-based approaches are becoming increasingly common, stereo depth estimation is still dominated by convolution-based approaches. This is mainly due to the limited availability of real-world ground truth for stereo matching, which is a limiting factor in improving the performance of Transformer-based stereo approaches. In this paper, we propose UniTT-Stereo, a method to maximize the potential of Transformer-based stereo architectures by unifying self-supervised learning used for pre-training with stereo matching framework based on supervised learning. To be specific, we explore the effectiveness of reconstructing features of masked portions in an input image and at the same time predicting corresponding points in another image from the perspective of locality inductive bias, which is crucial in training models with limited training data. Moreover, to address these challenging tasks of reconstruction-and-prediction, we present a new strategy to vary a masking ratio when training the stereo model with stereo-tailored losses. State-of-the-art performance of UniTT-Stereo is validated on various benchmarks such as ETH3D, KITTI 2012, and KITTI 2015 datasets. Lastly, to investigate the advantages of the proposed approach, we provide a frequency analysis of feature maps and the analysis of locality inductive bias based on attention maps.
Paper Structure (26 sections, 3 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 3 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Architecture Overview of UniTT-Stereo. The visible tokens of masked left and original right images are fed into the Siamese ViT encoder for feature extraction, and then these image features are fed into the Inter Image Information Exchange (I3E) Transformer decoder based on cross-attention layers. The masked left image is reconstructed through a linear head while the disparity map is predicted by the RefineNet-based fusion module refinenet. Note that a masking ratio varies to ensure the model learns effectively across a range of information scales. The proposed reconstruction-and-prediction strategy introduces locality inductive bias in training the Transformer based stereo matching network, achieving competitive performance on various stereo benchmarks.
  • Figure 2: Attention Analysis: (a) Attention distance plot; Plain refers to the method where the same architecture is used with disparity loss alone for supervised learning, without incorporating our masking approach. Ours refers to the case where our Unified Training method is applied. (b) Attention map visualization; Brighter colors indicate higher attention scores.
  • Figure 3: Fourier Analysis: (a) Fourier analysis of the feature maps of the decoder; The ratio of high-frequency components to low-frequency components is reported using the log amplitude metric. The log amplitude represents the difference in log amplitude between $f = \pi$ (the highest frequency) and $f = 0$ (the lowest frequency). This indicates how much the high-frequency components stand out compared to the low-frequency components. (b) The results from ETH3D test data; By amplifying and utilizing high-frequency information in the process of generating disparity maps, the resulting maps tend to have sharper boundaries and more fine-grained details.
  • Figure 4: Attention Map by Varying Losses: (a) Example process of attention map visualization; Example query, key, and value are fed into the self-attention layer in the encoder or the cross-attention layer in the decoder. An expected attention map is created using the ground truth disparity value of the query to determine the location of the corresponding patch. (b) Attention map visualization when the model is trained solely using each individual loss; Brighter colors indicate higher attention scores. Since the consistency loss applies to features processed by the encoder and does not directly influence the decoder, we do not visualize the cross-attention map in the case where the model was trained using only the consistency loss.
  • Figure 5: Qualitative comparison on KITTI 2015 and 2012: The first row shows the result on KITTI 2015. UniTT-Stereo outputs clearer boundaries for objects compared to other models. The second row shows the result on KITTI 2012. Our model produces an accurate and sharp disparity map even in low texture areas with blurring.
  • ...and 2 more figures