Table of Contents
Fetching ...

Mono2Stereo: Monocular Knowledge Transfer for Enhanced Stereo Matching

Yuran Wang, Yingping Liang, Hesong Li, Ying Fu

TL;DR

This work proposes leveraging monocular knowledge transfer to enhance stereo matching, namely Mono2Stereo, with a two-stage training process, comprising synthetic data pre-training and real-world data fine-tuning.

Abstract

The generalization and performance of stereo matching networks are limited due to the domain gap of the existing synthetic datasets and the sparseness of GT labels in the real datasets. In contrast, monocular depth estimation has achieved significant advancements, benefiting from large-scale depth datasets and self-supervised strategies. To bridge the performance gap between monocular depth estimation and stereo matching, we propose leveraging monocular knowledge transfer to enhance stereo matching, namely Mono2Stereo. We introduce knowledge transfer with a two-stage training process, comprising synthetic data pre-training and real-world data fine-tuning. In the pre-training stage, we design a data generation pipeline that synthesizes stereo training data from monocular images. This pipeline utilizes monocular depth for warping and novel view synthesis and employs our proposed Edge-Aware (EA) inpainting module to fill in missing contents in the generated images. In the fine-tuning stage, we introduce a Sparse-to-Dense Knowledge Distillation (S2DKD) strategy encouraging the distributions of predictions to align with dense monocular depths. This strategy mitigates issues with edge blurring in sparse real-world labels and enhances overall consistency. Experimental results demonstrate that our pre-trained model exhibits strong zero-shot generalization capabilities. Furthermore, domain-specific fine-tuning using our pre-trained model and S2DKD strategy significantly increments in-domain performance. The code will be made available soon.

Mono2Stereo: Monocular Knowledge Transfer for Enhanced Stereo Matching

TL;DR

This work proposes leveraging monocular knowledge transfer to enhance stereo matching, namely Mono2Stereo, with a two-stage training process, comprising synthetic data pre-training and real-world data fine-tuning.

Abstract

The generalization and performance of stereo matching networks are limited due to the domain gap of the existing synthetic datasets and the sparseness of GT labels in the real datasets. In contrast, monocular depth estimation has achieved significant advancements, benefiting from large-scale depth datasets and self-supervised strategies. To bridge the performance gap between monocular depth estimation and stereo matching, we propose leveraging monocular knowledge transfer to enhance stereo matching, namely Mono2Stereo. We introduce knowledge transfer with a two-stage training process, comprising synthetic data pre-training and real-world data fine-tuning. In the pre-training stage, we design a data generation pipeline that synthesizes stereo training data from monocular images. This pipeline utilizes monocular depth for warping and novel view synthesis and employs our proposed Edge-Aware (EA) inpainting module to fill in missing contents in the generated images. In the fine-tuning stage, we introduce a Sparse-to-Dense Knowledge Distillation (S2DKD) strategy encouraging the distributions of predictions to align with dense monocular depths. This strategy mitigates issues with edge blurring in sparse real-world labels and enhances overall consistency. Experimental results demonstrate that our pre-trained model exhibits strong zero-shot generalization capabilities. Furthermore, domain-specific fine-tuning using our pre-trained model and S2DKD strategy significantly increments in-domain performance. The code will be made available soon.

Paper Structure

This paper contains 12 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison between Monocular Depth and Stereo Matching for depth estimation with histogram. The Monocular Depth approach provides fine-grained details but suffers from scale ambiguity, whereas Stereo Matching delivers real-world metrics but with coarser results. We carefully design data generation and distributed matching losses to perform knowledge transfer from monocular to stereo.
  • Figure 2: The overall architecture of our proposed method consists of two main stages: (A) Pre-training stage with a data generation pipeline. We estimate and rescale monocular depth maps to construct stereo pairs for training deep stereo networks. (B) Fine-tuning stage with knowledge transfer. We leverage the estimated monocular depth to enhance the prediction of details, which are otherwise lacking in the sparse ground truth.
  • Figure 3: Comparison between naive inpainting module and edge-aware (EA) inpainting module. Our inpainting method generates no artifacts and produces more realistic background images. We present (a) right-view image warping from left-view image with occulusion holes; (b) image inpainted from (a) using Stable Diffusion (SD); (c) right image warping with EA and (d) image inpainted from (c) using SD.
  • Figure 4: Comparison between disparity from KITTI and from monocular model. We present (a) GT disparity; (b) gradient map calculated using (a); (c) disparity from monocular estimation and (d) gradient map calculated using (c).
  • Figure 5: Qualitative results of IGEV trained with and without our proposed S2DKD strategy and our generated DiffMFS dataset. The default model is pre-traiend on sceneflow and fine-tuned on the KITTI training set by official implementation.
  • ...and 1 more figures