Black-box Adversarial Attacks on Monocular Depth Estimation Using Evolutionary Multi-objective Optimization
Renya Daimo, Satoshi Ono, Takahiro Suzuki
TL;DR
This paper tackles the security of monocular depth estimation by addressing black-box adversarial vulnerabilities in DNNs. It introduces a black-box global optimization framework based on Evolutionary Multi-objective Optimization (EMO) that perturbs only the texture of a target object, optimizing two objectives $f_1$ (depth error) and $f_2$ (perturbation magnitude) without a substitute model or training data, and formulates the problem with a block-wise texture perturbation pattern using MOEA/D. The approach yields scene-specific, targeted adversarial examples that can cause depth maps to misrepresent targeted objects (often as if they disappeared) on both indoor and outdoor models, demonstrated on NYU Depth v2 and KITTI-based networks. The work highlights practical implications for the vulnerabilities of proprietary depth-estimation services and outlines future directions for reducing computation and enabling physical attacks via hybridization with boundary-based methods.
Abstract
This paper proposes an adversarial attack method to deep neural networks (DNNs) for monocular depth estimation, i.e., estimating the depth from a single image. Single image depth estimation has improved drastically in recent years due to the development of DNNs. However, vulnerabilities of DNNs for image classification have been revealed by adversarial attacks, and DNNs for monocular depth estimation could contain similar vulnerabilities. Therefore, research on vulnerabilities of DNNs for monocular depth estimation has spread rapidly, but many of them assume white-box conditions where inside information of DNNs is available, or are transferability-based black-box attacks that require a substitute DNN model and a training dataset. Utilizing Evolutionary Multi-objective Optimization, the proposed method in this paper analyzes DNNs under the black-box condition where only output depth maps are available. In addition, the proposed method does not require a substitute DNN that has a similar architecture to the target DNN nor any knowledge about training data used to train the target model. Experimental results showed that the proposed method succeeded in attacking two DNN-based methods that were trained with indoor and outdoor scenes respectively.
