Table of Contents
Fetching ...

Promoting CNNs with Cross-Architecture Knowledge Distillation for Efficient Monocular Depth Estimation

Zhimeng Zheng, Tao Huang, Gongsheng Li, Zuyi Wang

TL;DR

DisDepth targets efficient monocular depth estimation by transferring knowledge from transformer teachers to CNN backbones through cross-architecture knowledge distillation. It fuses a simple CNN-based MDE framework with a local-global convolution (LG-Conv) to capture both local and global context, and introduces a ghost-decoder-based teacher acclimation to align teacher features with the student. An attentive distillation loss (L_kd) uses spatial attentions to emphasize task-relevant regions during distillation, while a feature acclimation module (FAM) and fixed ghost decoder stabilize cross-architecture transfer. On KITTI and NYU Depth V2, DisDepth achieves competitive depth accuracy with substantially lower FLOPs and parameters than transformer-heavy methods, demonstrating practical applicability for edge devices.

Abstract

Recently, the performance of monocular depth estimation (MDE) has been significantly boosted with the integration of transformer models. However, the transformer models are usually computationally-expensive, and their effectiveness in light-weight models are limited compared to convolutions. This limitation hinders their deployment on resource-limited devices. In this paper, we propose a cross-architecture knowledge distillation method for MDE, dubbed DisDepth, to enhance efficient CNN models with the supervision of state-of-the-art transformer models. Concretely, we first build a simple framework of convolution-based MDE, which is then enhanced with a novel local-global convolution module to capture both local and global information in the image. To effectively distill valuable information from the transformer teacher and bridge the gap between convolution and transformer features, we introduce a method to acclimate the teacher with a ghost decoder. The ghost decoder is a copy of the student's decoder, and adapting the teacher with the ghost decoder aligns the features to be student-friendly while preserving their original performance. Furthermore, we propose an attentive knowledge distillation loss that adaptively identifies features valuable for depth estimation. This loss guides the student to focus more on attentive regions, improving its performance. Extensive experiments on KITTI and NYU Depth V2 datasets demonstrate the effectiveness of DisDepth. Our method achieves significant improvements on various efficient backbones, showcasing its potential for efficient monocular depth estimation.

Promoting CNNs with Cross-Architecture Knowledge Distillation for Efficient Monocular Depth Estimation

TL;DR

DisDepth targets efficient monocular depth estimation by transferring knowledge from transformer teachers to CNN backbones through cross-architecture knowledge distillation. It fuses a simple CNN-based MDE framework with a local-global convolution (LG-Conv) to capture both local and global context, and introduces a ghost-decoder-based teacher acclimation to align teacher features with the student. An attentive distillation loss (L_kd) uses spatial attentions to emphasize task-relevant regions during distillation, while a feature acclimation module (FAM) and fixed ghost decoder stabilize cross-architecture transfer. On KITTI and NYU Depth V2, DisDepth achieves competitive depth accuracy with substantially lower FLOPs and parameters than transformer-heavy methods, demonstrating practical applicability for edge devices.

Abstract

Recently, the performance of monocular depth estimation (MDE) has been significantly boosted with the integration of transformer models. However, the transformer models are usually computationally-expensive, and their effectiveness in light-weight models are limited compared to convolutions. This limitation hinders their deployment on resource-limited devices. In this paper, we propose a cross-architecture knowledge distillation method for MDE, dubbed DisDepth, to enhance efficient CNN models with the supervision of state-of-the-art transformer models. Concretely, we first build a simple framework of convolution-based MDE, which is then enhanced with a novel local-global convolution module to capture both local and global information in the image. To effectively distill valuable information from the transformer teacher and bridge the gap between convolution and transformer features, we introduce a method to acclimate the teacher with a ghost decoder. The ghost decoder is a copy of the student's decoder, and adapting the teacher with the ghost decoder aligns the features to be student-friendly while preserving their original performance. Furthermore, we propose an attentive knowledge distillation loss that adaptively identifies features valuable for depth estimation. This loss guides the student to focus more on attentive regions, improving its performance. Extensive experiments on KITTI and NYU Depth V2 datasets demonstrate the effectiveness of DisDepth. Our method achieves significant improvements on various efficient backbones, showcasing its potential for efficient monocular depth estimation.
Paper Structure (17 sections, 3 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 17 sections, 3 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: RMSE (lower is better) and FLOPs comparisons of existing MDE models and our DisDepth models on KITTI dataset. DisDepth variants (green squares) obtain competitive performance with significant superiority on efficiency.
  • Figure 2: Framework of our cross-architecture KD. We introduce (feature acclimation module) FAMs to adapt the outputs of teacher encoder, followed by a (loss attention module) LAM to identify valuable regions in the features, and these modules are optimized with a ghost decoder and task loss. The distillation is conducted on student features and the adapted teacher features. The parameters of teacher encoder and ghost decoder are fixed.
  • Figure 3: Illustration of our proposed local-global convolution (LG-Conv).
  • Figure 4: Architecture of loss attention module (LAM).