Table of Contents
Fetching ...

Structure-Centric Robust Monocular Depth Estimation via Knowledge Distillation

Runze Chen, Haiyong Luo, Fang Zhao, Jingze Yu, Yupeng Jia, Juan Wang, Xuepeng Ma

TL;DR

This work devise a novel approach to reduce over-reliance on local textures, enhancing robustness against missing or interfering patterns and incorporate a semantic expert model as the teacher and construct inter-model feature dependencies via learnable isomorphic graphs to enable aggregation of semantic structural knowledge.

Abstract

Monocular depth estimation, enabled by self-supervised learning, is a key technique for 3D perception in computer vision. However, it faces significant challenges in real-world scenarios, which encompass adverse weather variations, motion blur, as well as scenes with poor lighting conditions at night. Our research reveals that we can divide monocular depth estimation into three sub-problems: depth structure consistency, local texture disambiguation, and semantic-structural correlation. Our approach tackles the non-robustness of existing self-supervised monocular depth estimation models to interference textures by adopting a structure-centered perspective and utilizing the scene structure characteristics demonstrated by semantics and illumination. We devise a novel approach to reduce over-reliance on local textures, enhancing robustness against missing or interfering patterns. Additionally, we incorporate a semantic expert model as the teacher and construct inter-model feature dependencies via learnable isomorphic graphs to enable aggregation of semantic structural knowledge. Our approach achieves state-of-the-art out-of-distribution monocular depth estimation performance across a range of public adverse scenario datasets. It demonstrates notable scalability and compatibility, without necessitating extensive model engineering. This showcases the potential for customizing models for diverse industrial applications.

Structure-Centric Robust Monocular Depth Estimation via Knowledge Distillation

TL;DR

This work devise a novel approach to reduce over-reliance on local textures, enhancing robustness against missing or interfering patterns and incorporate a semantic expert model as the teacher and construct inter-model feature dependencies via learnable isomorphic graphs to enable aggregation of semantic structural knowledge.

Abstract

Monocular depth estimation, enabled by self-supervised learning, is a key technique for 3D perception in computer vision. However, it faces significant challenges in real-world scenarios, which encompass adverse weather variations, motion blur, as well as scenes with poor lighting conditions at night. Our research reveals that we can divide monocular depth estimation into three sub-problems: depth structure consistency, local texture disambiguation, and semantic-structural correlation. Our approach tackles the non-robustness of existing self-supervised monocular depth estimation models to interference textures by adopting a structure-centered perspective and utilizing the scene structure characteristics demonstrated by semantics and illumination. We devise a novel approach to reduce over-reliance on local textures, enhancing robustness against missing or interfering patterns. Additionally, we incorporate a semantic expert model as the teacher and construct inter-model feature dependencies via learnable isomorphic graphs to enable aggregation of semantic structural knowledge. Our approach achieves state-of-the-art out-of-distribution monocular depth estimation performance across a range of public adverse scenario datasets. It demonstrates notable scalability and compatibility, without necessitating extensive model engineering. This showcases the potential for customizing models for diverse industrial applications.

Paper Structure

This paper contains 15 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Key Sub-Problems in Monocular Depth Estimation. This figure highlights three areas critical for enhancing depth estimation: Depth Structure Consistency, ensuring smooth depth transitions across frames; Local Texture Disambiguation, addressing the challenge of the model's over-dependence on local textures that compromises depth estimation robustness, by improving performance in diverse or texture-sparse environments; and Semantic and Structural Correlation, leveraging object semantics and structure for continuous depth inference.
  • Figure 2: Overview of the Structure-Centric Monocular Depth Estimation Approach. The solid lines in the figure indicate data flow, while the dashed lines represent constraints during the optimization process. The operation vec represents the process of flattening these features into vectors. The operation proj denotes the re-projection of views.
  • Figure 3: Visualization of intermediate feature maps. For features $F_t^{(\mathrm{s})}$ and $F_t^{(\mathrm{t})}$, yellow indicates high values. For predicted depth $\hat{D}_t$, purple denotes high values. We provide a detailed description of the computation method for visual feature maps in the supplementary materials.