Table of Contents
Fetching ...

CCDepth: A Lightweight Self-supervised Depth Estimation Network with Enhanced Interpretability

Xi Zhang, Yaru Xue, Shaocheng Jia, Xin Pei

TL;DR

A novel hybrid self-supervised depth estimation network, CCDepth, comprising convolutional neural networks (CNNs) and the white-box CRATE (Coding RAte reduction TransformEr) network is proposed, which can achieve performance comparable with those state-of-the-art methods, while the model size has been significantly reduced.

Abstract

Self-supervised depth estimation, which solely requires monocular image sequence as input, has become increasingly popular and promising in recent years. Current research primarily focuses on enhancing the prediction accuracy of the models. However, the excessive number of parameters impedes the universal deployment of the model on edge devices. Moreover, the emerging neural networks, being black-box models, are difficult to analyze, leading to challenges in understanding the rationales for performance improvements. To mitigate these issues, this study proposes a novel hybrid self-supervised depth estimation network, CCDepth, comprising convolutional neural networks (CNNs) and the white-box CRATE (Coding RAte reduction TransformEr) network. This novel network uses CNNs and the CRATE modules to extract local and global information in images, respectively, thereby boosting learning efficiency and reducing model size. Furthermore, incorporating the CRATE modules into the network enables a mathematically interpretable process in capturing global features. Extensive experiments on the KITTI dataset indicate that the proposed CCDepth network can achieve performance comparable with those state-of-the-art methods, while the model size has been significantly reduced. In addition, a series of quantitative and qualitative analyses on the inner features in the CCDepth network further confirm the effectiveness of the proposed method.

CCDepth: A Lightweight Self-supervised Depth Estimation Network with Enhanced Interpretability

TL;DR

A novel hybrid self-supervised depth estimation network, CCDepth, comprising convolutional neural networks (CNNs) and the white-box CRATE (Coding RAte reduction TransformEr) network is proposed, which can achieve performance comparable with those state-of-the-art methods, while the model size has been significantly reduced.

Abstract

Self-supervised depth estimation, which solely requires monocular image sequence as input, has become increasingly popular and promising in recent years. Current research primarily focuses on enhancing the prediction accuracy of the models. However, the excessive number of parameters impedes the universal deployment of the model on edge devices. Moreover, the emerging neural networks, being black-box models, are difficult to analyze, leading to challenges in understanding the rationales for performance improvements. To mitigate these issues, this study proposes a novel hybrid self-supervised depth estimation network, CCDepth, comprising convolutional neural networks (CNNs) and the white-box CRATE (Coding RAte reduction TransformEr) network. This novel network uses CNNs and the CRATE modules to extract local and global information in images, respectively, thereby boosting learning efficiency and reducing model size. Furthermore, incorporating the CRATE modules into the network enables a mathematically interpretable process in capturing global features. Extensive experiments on the KITTI dataset indicate that the proposed CCDepth network can achieve performance comparable with those state-of-the-art methods, while the model size has been significantly reduced. In addition, a series of quantitative and qualitative analyses on the inner features in the CCDepth network further confirm the effectiveness of the proposed method.
Paper Structure (16 sections, 24 equations, 6 figures, 3 tables)

This paper contains 16 sections, 24 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The overall architecture of CCDepth.
  • Figure 2: Workflow of the CRATE layer. Step 1: the input image is split into several patches. Step 2: the two-dimensional (2D) image is flattened to a sequence. Step 3: each patch is embedded as a vector by Multilayer perceptron (MLP). Step 4: CRATE takes those embedded feature as input to learn more abstract features. Step 5: the 2D structure of the image is recovered.
  • Figure 3: Qualitative results of self-supervised depth estimation. (Monodepth2 c22; FSLNet c30).
  • Figure 4: Qualitative results of self-supervised depth estimation with different padding modes.
  • Figure 5: Comparison of the percentage of non-zero elements in each CRATE module.
  • ...and 1 more figures