Table of Contents
Fetching ...

PPEA-Depth: Progressive Parameter-Efficient Adaptation for Self-Supervised Monocular Depth Estimation

Yue-Jiang Dong, Yuan-Chen Guo, Ying-Tian Liu, Fang-Lue Zhang, Song-Hai Zhang

TL;DR

PPEA-Depth tackles the static-scene limitation in self-supervised monocular depth estimation by introducing progressive parameter-efficient adaptation with encoder and decoder adapters. It employs a two-stage training regime—first on static-scene data to learn robust depth priors, then on dynamic scenes with adapters updating while core weights remain largely fixed—achieving state-of-the-art results on KITTI, CityScapes, and DDAD. The approach preserves generalized pre-trained patterns, reduces tunable parameters by up to 90%, and demonstrates data-efficient adaptation (as little as 3% of data in Stage 2) while maintaining robustness to object motion. This method advances practical depth estimation in real-world dynamic environments and suggests that similar adapter-based transfers could extend to other tasks with loose-constrained losses.

Abstract

Self-supervised monocular depth estimation is of significant importance with applications spanning across autonomous driving and robotics. However, the reliance on self-supervision introduces a strong static-scene assumption, thereby posing challenges in achieving optimal performance in dynamic scenes, which are prevalent in most real-world situations. To address these issues, we propose PPEA-Depth, a Progressive Parameter-Efficient Adaptation approach to transfer a pre-trained image model for self-supervised depth estimation. The training comprises two sequential stages: an initial phase trained on a dataset primarily composed of static scenes, succeeded by an expansion to more intricate datasets involving dynamic scenes. To facilitate this process, we design compact encoder and decoder adapters to enable parameter-efficient tuning, allowing the network to adapt effectively. They not only uphold generalized patterns from pre-trained image models but also retain knowledge gained from the preceding phase into the subsequent one. Extensive experiments demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI, CityScapes and DDAD datasets.

PPEA-Depth: Progressive Parameter-Efficient Adaptation for Self-Supervised Monocular Depth Estimation

TL;DR

PPEA-Depth tackles the static-scene limitation in self-supervised monocular depth estimation by introducing progressive parameter-efficient adaptation with encoder and decoder adapters. It employs a two-stage training regime—first on static-scene data to learn robust depth priors, then on dynamic scenes with adapters updating while core weights remain largely fixed—achieving state-of-the-art results on KITTI, CityScapes, and DDAD. The approach preserves generalized pre-trained patterns, reduces tunable parameters by up to 90%, and demonstrates data-efficient adaptation (as little as 3% of data in Stage 2) while maintaining robustness to object motion. This method advances practical depth estimation in real-world dynamic environments and suggests that similar adapter-based transfers could extend to other tasks with loose-constrained losses.

Abstract

Self-supervised monocular depth estimation is of significant importance with applications spanning across autonomous driving and robotics. However, the reliance on self-supervision introduces a strong static-scene assumption, thereby posing challenges in achieving optimal performance in dynamic scenes, which are prevalent in most real-world situations. To address these issues, we propose PPEA-Depth, a Progressive Parameter-Efficient Adaptation approach to transfer a pre-trained image model for self-supervised depth estimation. The training comprises two sequential stages: an initial phase trained on a dataset primarily composed of static scenes, succeeded by an expansion to more intricate datasets involving dynamic scenes. To facilitate this process, we design compact encoder and decoder adapters to enable parameter-efficient tuning, allowing the network to adapt effectively. They not only uphold generalized patterns from pre-trained image models but also retain knowledge gained from the preceding phase into the subsequent one. Extensive experiments demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI, CityScapes and DDAD datasets.
Paper Structure (44 sections, 5 equations, 8 figures, 16 tables)

This paper contains 44 sections, 5 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: Previous Paradigm v.s. Our Paradigm. The conventional training approach employs a consistent process for both static and dynamic datasets: it includes using a pre-trained image model as an encoder and fine-tuning all U-Net parameters for each dataset. In contrast, our novel two-stage training paradigm integrates adapters to progressively tailor the pre-trained image models for depth perception initially on simple datasets (static scenes primarily) and then extends to intricate datasets (with dynamic scenes).
  • Figure 2: (a) Depth network is a U-Net structure predicting depth taking three consecutive frames. (b) Pose network regresses the camera relative pose given two images. (c) Adapter is a bottleneck structure with a skip connection. (d) Structure of RepLKNet replknet backbone. (e) Our encoder adapter design. We attach encoder adapters to pre-trained RepLKBlock and ConvFFN. (f) Our decoder adapter design. Lerp represents linear interpolation.
  • Figure 3: Comparisons of Different Training Strategies on CityScapes. Training from domain adaptation yields better depth estimates on vehicles and cyclists compared to training from scratch. Tuning the decoder adapter demonstrates improved depth estimates in the upper portion of the car compared to the full fine-tuned decoder.
  • Figure 4: Different Encoder Adapter Designs. Design (a) is our final choice.
  • Figure 5: Detailed Structure of Teacher and Student Depth Network.
  • ...and 3 more figures