Table of Contents
Fetching ...

DINO-SD: Champion Solution for ICRA 2024 RoboDepth Challenge

Yifan Mao, Ming Li, Jian Liu, Jiayang Liu, Zihan Qin, Chunxi Chu, Jialei Xu, Wenbo Zhao, Junjun Jiang, Xianming Liu

TL;DR

The paper addresses the robustness of surround-view depth estimation in the presence of out-of-distribution (OoD) corruptions without requiring additional training data. It introduces DINO-SD, a six-view depth estimator that uses a DINOv2 encoder coupled with Multiview-DPT and DPT decoders, augmented by adjacent-view cross attention to fuse information from neighboring views. Training combines $L_{silog}$ supervision, an AugMix-based consistency loss $L_{AugMix}$, and a smoothness loss $L_{smooth}$, with ${L = L_{silog} + \alpha L_{smooth} + \beta L_{AugMix}}$ and hyperparameters ${\alpha=10^{-3}}$, ${\beta=10^{-2}}$, supervised by LiDAR ground truth; testing includes denoising of OoD inputs. Empirically, DINO-SD achieves state-of-the-art results on RoboDepth Track 4, with ablations validating the benefits of adjacent-view cross attention and AugMix-based training, and demonstrates robustness across 18 corruption types without data augmentation beyond the proposed augmentation strategy. This work advances reliable, dense surround-view depth estimation for autonomous driving by reducing reliance on extra data and enhancing generalization to real-world degradations.

Abstract

Surround-view depth estimation is a crucial task aims to acquire the depth maps of the surrounding views. It has many applications in real world scenarios such as autonomous driving, AR/VR and 3D reconstruction, etc. However, given that most of the data in the autonomous driving dataset is collected in daytime scenarios, this leads to poor depth model performance in the face of out-of-distribution(OoD) data. While some works try to improve the robustness of depth model under OoD data, these methods either require additional training data or lake generalizability. In this report, we introduce the DINO-SD, a novel surround-view depth estimation model. Our DINO-SD does not need additional data and has strong robustness. Our DINO-SD get the best performance in the track4 of ICRA 2024 RoboDepth Challenge.

DINO-SD: Champion Solution for ICRA 2024 RoboDepth Challenge

TL;DR

The paper addresses the robustness of surround-view depth estimation in the presence of out-of-distribution (OoD) corruptions without requiring additional training data. It introduces DINO-SD, a six-view depth estimator that uses a DINOv2 encoder coupled with Multiview-DPT and DPT decoders, augmented by adjacent-view cross attention to fuse information from neighboring views. Training combines supervision, an AugMix-based consistency loss , and a smoothness loss , with and hyperparameters , , supervised by LiDAR ground truth; testing includes denoising of OoD inputs. Empirically, DINO-SD achieves state-of-the-art results on RoboDepth Track 4, with ablations validating the benefits of adjacent-view cross attention and AugMix-based training, and demonstrates robustness across 18 corruption types without data augmentation beyond the proposed augmentation strategy. This work advances reliable, dense surround-view depth estimation for autonomous driving by reducing reliance on extra data and enhancing generalization to real-world degradations.

Abstract

Surround-view depth estimation is a crucial task aims to acquire the depth maps of the surrounding views. It has many applications in real world scenarios such as autonomous driving, AR/VR and 3D reconstruction, etc. However, given that most of the data in the autonomous driving dataset is collected in daytime scenarios, this leads to poor depth model performance in the face of out-of-distribution(OoD) data. While some works try to improve the robustness of depth model under OoD data, these methods either require additional training data or lake generalizability. In this report, we introduce the DINO-SD, a novel surround-view depth estimation model. Our DINO-SD does not need additional data and has strong robustness. Our DINO-SD get the best performance in the track4 of ICRA 2024 RoboDepth Challenge.
Paper Structure (11 sections, 6 equations, 2 figures, 2 tables)

This paper contains 11 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Our DINO-SD model: Our DINO-SD model use the pretrained DINOv2 as encoder, M-DPT and DPT as decoder.
  • Figure 2: Our training and testing pipeline.