Table of Contents
Fetching ...

Embodiment: Self-Supervised Depth Estimation Based on Camera Models

Jinchang Zhang, Praveen Kumar Reddy, Xue-Iuan Wong, Yiannis Aloimonos, Guoyu Lu

TL;DR

This work tackles scale ambiguity in monocular depth estimation by introducing physics depth, a supervision signal derived from camera intrinsic/extrinsic parameters and semantic ground-plane cues. The approach embeds camera model physics into a depth-learning framework, initializing with physics-depth priors and subsequently refining via self-supervised photometric and 2D spatial losses. Key contributions include absolute-scale depth from monocular cues, Dense Physics Depth through ground-ground-extended propagation and inpainting, and a two-stage training regime that integrates physics-depth supervision with standard self-supervision, yielding state-of-the-art results on KITTI, Cityscapes, and Make3D. The method enhances robustness and accuracy, enabling better 3D structure modeling and scalable self-supervised monocular depth estimation in real-world robotics and vision tasks.

Abstract

Depth estimation is a critical topic for robotics and vision-related tasks. In monocular depth estimation, in comparison with supervised learning that requires expensive ground truth labeling, self-supervised methods possess great potential due to no labeling cost. However, self-supervised learning still has a large gap with supervised learning in 3D reconstruction and depth estimation performance. Meanwhile, scaling is also a major issue for monocular unsupervised depth estimation, which commonly still needs ground truth scale from GPS, LiDAR, or existing maps to correct. In the era of deep learning, existing methods primarily rely on exploring image relationships to train unsupervised neural networks, while the physical properties of the camera itself such as intrinsics and extrinsics are often overlooked. These physical properties are not just mathematical parameters; they are embodiments of the camera's interaction with the physical world. By embedding these physical properties into the deep learning model, we can calculate depth priors for ground regions and regions connected to the ground based on physical principles, providing free supervision signals without the need for additional sensors. This approach is not only easy to implement but also enhances the effects of all unsupervised methods by embedding the camera's physical properties into the model, thereby achieving an embodied understanding of the real world.

Embodiment: Self-Supervised Depth Estimation Based on Camera Models

TL;DR

This work tackles scale ambiguity in monocular depth estimation by introducing physics depth, a supervision signal derived from camera intrinsic/extrinsic parameters and semantic ground-plane cues. The approach embeds camera model physics into a depth-learning framework, initializing with physics-depth priors and subsequently refining via self-supervised photometric and 2D spatial losses. Key contributions include absolute-scale depth from monocular cues, Dense Physics Depth through ground-ground-extended propagation and inpainting, and a two-stage training regime that integrates physics-depth supervision with standard self-supervision, yielding state-of-the-art results on KITTI, Cityscapes, and Make3D. The method enhances robustness and accuracy, enabling better 3D structure modeling and scalable self-supervised monocular depth estimation in real-world robotics and vision tasks.

Abstract

Depth estimation is a critical topic for robotics and vision-related tasks. In monocular depth estimation, in comparison with supervised learning that requires expensive ground truth labeling, self-supervised methods possess great potential due to no labeling cost. However, self-supervised learning still has a large gap with supervised learning in 3D reconstruction and depth estimation performance. Meanwhile, scaling is also a major issue for monocular unsupervised depth estimation, which commonly still needs ground truth scale from GPS, LiDAR, or existing maps to correct. In the era of deep learning, existing methods primarily rely on exploring image relationships to train unsupervised neural networks, while the physical properties of the camera itself such as intrinsics and extrinsics are often overlooked. These physical properties are not just mathematical parameters; they are embodiments of the camera's interaction with the physical world. By embedding these physical properties into the deep learning model, we can calculate depth priors for ground regions and regions connected to the ground based on physical principles, providing free supervision signals without the need for additional sensors. This approach is not only easy to implement but also enhances the effects of all unsupervised methods by embedding the camera's physical properties into the model, thereby achieving an embodied understanding of the real world.
Paper Structure (19 sections, 13 equations, 5 figures, 8 tables)

This paper contains 19 sections, 13 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The framework for unsupervised 3D scene reconstruction neural network based on physics depth calculated from camera models. We first calculate the physical depth of flat ground areas in the input image using the camera model and semantic segmentation results. This physical depth serves as a label for supervised learning, providing a foundation for initial depth estimation. In the first stage, we train the depth estimation network with these labels. In the subsequent self-supervised stage, we introduce photometric and 2D spatial losses, which optimize depth estimation based on image characteristics without relying on depth labels.
  • Figure 2: Physics Depth Methodology demonstrated on KITTI.
  • Figure 3: Error distribution of Physics depth.
  • Figure 4: Qualitative results on make3d (Zero-shot): From left to right the models are Monodepth2 godard2019digging, RA-Depth he2022ra, MonoVit zhao2022monovit, SQLDepth wang2023sqldepth, our models.
  • Figure 5: Qualitative results on KITTI: From top to bottom the models are MonoVit zhao2022monovit, RA-Depth he2022ra, ManyDepth watson2021temporal, our models.