Self-supervised Adversarial Training of Monocular Depth Estimation against Physical-World Attacks

Zhiyuan Cheng; Cheng Han; James Liang; Qifan Wang; Xiangyu Zhang; Dongfang Liu

Self-supervised Adversarial Training of Monocular Depth Estimation against Physical-World Attacks

Zhiyuan Cheng, Cheng Han, James Liang, Qifan Wang, Xiangyu Zhang, Dongfang Liu

TL;DR

The paper tackles the vulnerability of monocular depth estimation (MDE) to physical-world adversarial attacks by introducing a self-supervised adversarial training framework that does not require depth ground truth. It combines view-synthesis-based data generation with an $L_0$-norm surrogate for perturbations, using a photometric reconstruction loss $L_p$ to jointly Harden a base MDE model (e.g., Monodepth2) and a pose estimator through two-view consistency. The approach outperforms contrastive-learning and supervised baselines across white-box, black-box, and physical-world attacks, preserving benign depth accuracy while significantly reducing adversarial depth errors, and extends to indoor scenes and advanced networks like Manydepth. This work advances the security of vision-based depth systems for autonomous driving by enabling robust training with realistic, low-cost synthetic data and targeted perturbations. The method's practical impact lies in delivering more reliable 3D perception in the presence of adversarial patches and environmental variability, with broad applicability to real-world deployment.

Abstract

Monocular Depth Estimation (MDE) plays a vital role in applications such as autonomous driving. However, various attacks target MDE models, with physical attacks posing significant threats to system security. Traditional adversarial training methods, which require ground-truth labels, are not directly applicable to MDE models that lack ground-truth depth. Some self-supervised model hardening techniques (e.g., contrastive learning) overlook the domain knowledge of MDE, resulting in suboptimal performance. In this work, we introduce a novel self-supervised adversarial training approach for MDE models, leveraging view synthesis without the need for ground-truth depth. We enhance adversarial robustness against real-world attacks by incorporating L_0-norm-bounded perturbation during training. We evaluate our method against supervised learning-based and contrastive learning-based approaches specifically designed for MDE. Our experiments with two representative MDE networks demonstrate improved robustness against various adversarial attacks, with minimal impact on benign performance.

Self-supervised Adversarial Training of Monocular Depth Estimation against Physical-World Attacks

TL;DR

-norm surrogate for perturbations, using a photometric reconstruction loss

to jointly Harden a base MDE model (e.g., Monodepth2) and a pose estimator through two-view consistency. The approach outperforms contrastive-learning and supervised baselines across white-box, black-box, and physical-world attacks, preserving benign depth accuracy while significantly reducing adversarial depth errors, and extends to indoor scenes and advanced networks like Manydepth. This work advances the security of vision-based depth systems for autonomous driving by enabling robust training with realistic, low-cost synthetic data and targeted perturbations. The method's practical impact lies in delivering more reliable 3D perception in the presence of adversarial patches and environmental variability, with broad applicability to real-world deployment.

Abstract

Paper Structure (33 sections, 11 equations, 13 figures, 13 tables)

This paper contains 33 sections, 11 equations, 13 figures, 13 tables.

Introduction
Related Work
Monocular Depth Estimation
MDE Attack and Defense
Adversarial Robustness
Methodology
Self-supervised MDE Training
View Synthesis To Avoid Physical Scene Mutation
Robust Adversarial Perturbations
Evaluation
Experimental Setup
Main Results
Benign Performance
Quality of the view synthesis
White-box Attacks
...and 18 more sections

Figures (13)

Figure 1: Self-supervised adversarial training of MDE with view synthesis.
Figure 2: The pipeline of self-supervised adversarial training of monocular depth estimation.
Figure 3: (a) A top-down bird view of the relative positions of the camera and the target object. (b) The 3D coordinates of the object's four corners in the camera frame. (c) Projection of the physical-world object onto the two views.
Figure 4: More examples of view synthesis with different background scenes, target objects and distance $z_c$ of the object. Object mask is used to remove background of the 2D object image.
Figure 5: The reference images used in our human study.
...and 8 more figures

Self-supervised Adversarial Training of Monocular Depth Estimation against Physical-World Attacks

TL;DR

Abstract

Self-supervised Adversarial Training of Monocular Depth Estimation against Physical-World Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (13)