Table of Contents
Fetching ...

APARATE: Adaptive Adversarial Patch for CNN-based Monocular Depth Estimation for Autonomous Navigation

Amira Guesmi, Muhammad Abdullah Hanif, Ihsen Alouani, Muhammad Shafique

TL;DR

This work tackles the security risk of monocular depth estimation (MDE) in autonomous navigation by introducing APARATE, an adaptive physical adversarial patch that is sensitive to object shape, size, and distance. APARATE uses two masks and a patch transformer to expand the patch’s influence beyond the overlapped region, enabling complete concealment or depth distortion of targeted objects. A penalized depth loss, decomposed into overlapped and non-overlapped components and augmented with non-printability and total-variation terms, drives a gradient-based optimization of the patch, while an automated object-detection step localizes targets for training. Experiments on KITTI with three MDE models show substantial depth errors and near-complete region disruption, outperforming prior patches and maintaining impact under common defenses. The results highlight the need for robust defenses in real-world autonomous systems and provide a framework for evaluating adversarial patches in depth perception tasks.

Abstract

In recent times, monocular depth estimation (MDE) has experienced significant advancements in performance, largely attributed to the integration of innovative architectures, i.e., convolutional neural networks (CNNs) and Transformers. Nevertheless, the susceptibility of these models to adversarial attacks has emerged as a noteworthy concern, especially in domains where safety and security are paramount. This concern holds particular weight for MDE due to its critical role in applications like autonomous driving and robotic navigation, where accurate scene understanding is pivotal. To assess the vulnerability of CNN-based depth prediction methods, recent work tries to design adversarial patches against MDE. However, the existing approaches fall short of inducing a comprehensive and substantially disruptive impact on the vision system. Instead, their influence is partial and confined to specific local areas. These methods lead to erroneous depth predictions only within the overlapping region with the input image, without considering the characteristics of the target object, such as its size, shape, and position. In this paper, we introduce a novel adversarial patch named APARATE. This patch possesses the ability to selectively undermine MDE in two distinct ways: by distorting the estimated distances or by creating the illusion of an object disappearing from the perspective of the autonomous system. Notably, APARATE is designed to be sensitive to the shape and scale of the target object, and its influence extends beyond immediate proximity. APARATE, results in a mean depth estimation error surpassing $0.5$, significantly impacting as much as $99\%$ of the targeted region when applied to CNN-based MDE models. Furthermore, it yields a significant error of $0.34$ and exerts substantial influence over $94\%$ of the target region in the context of Transformer-based MDE.

APARATE: Adaptive Adversarial Patch for CNN-based Monocular Depth Estimation for Autonomous Navigation

TL;DR

This work tackles the security risk of monocular depth estimation (MDE) in autonomous navigation by introducing APARATE, an adaptive physical adversarial patch that is sensitive to object shape, size, and distance. APARATE uses two masks and a patch transformer to expand the patch’s influence beyond the overlapped region, enabling complete concealment or depth distortion of targeted objects. A penalized depth loss, decomposed into overlapped and non-overlapped components and augmented with non-printability and total-variation terms, drives a gradient-based optimization of the patch, while an automated object-detection step localizes targets for training. Experiments on KITTI with three MDE models show substantial depth errors and near-complete region disruption, outperforming prior patches and maintaining impact under common defenses. The results highlight the need for robust defenses in real-world autonomous systems and provide a framework for evaluating adversarial patches in depth perception tasks.

Abstract

In recent times, monocular depth estimation (MDE) has experienced significant advancements in performance, largely attributed to the integration of innovative architectures, i.e., convolutional neural networks (CNNs) and Transformers. Nevertheless, the susceptibility of these models to adversarial attacks has emerged as a noteworthy concern, especially in domains where safety and security are paramount. This concern holds particular weight for MDE due to its critical role in applications like autonomous driving and robotic navigation, where accurate scene understanding is pivotal. To assess the vulnerability of CNN-based depth prediction methods, recent work tries to design adversarial patches against MDE. However, the existing approaches fall short of inducing a comprehensive and substantially disruptive impact on the vision system. Instead, their influence is partial and confined to specific local areas. These methods lead to erroneous depth predictions only within the overlapping region with the input image, without considering the characteristics of the target object, such as its size, shape, and position. In this paper, we introduce a novel adversarial patch named APARATE. This patch possesses the ability to selectively undermine MDE in two distinct ways: by distorting the estimated distances or by creating the illusion of an object disappearing from the perspective of the autonomous system. Notably, APARATE is designed to be sensitive to the shape and scale of the target object, and its influence extends beyond immediate proximity. APARATE, results in a mean depth estimation error surpassing , significantly impacting as much as of the targeted region when applied to CNN-based MDE models. Furthermore, it yields a significant error of and exerts substantial influence over of the target region in the context of Transformer-based MDE.
Paper Structure (18 sections, 17 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 18 sections, 17 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: Our APARATE makes the object fully disappear, in contrast, adversarial patches proposed by Yamanaka et al. Yamanaka_Access and Cheng et al. Cheng_ECCV are weak adversarial patches that only impact the depth of a small region of the target object which is restricted to the overlapping region between the patch and the input image.
  • Figure 2: Overview of the proposed approach: Given a pre-trained object detector we generate two masks: the patch mask ($M_p$) corresponding to the location of the patch at the center of the target object and the focus mask ($M_f$) corresponding to the object covered area and the attacked region. We feed the patch to the patch transformer and perform the geometric transformations described in Section \ref{['GT']}. We, later on, render the patch on top of the input image by harnessing information from the object detector as described in section \ref{['PA']}. After that, we perform a forward pass, i.e., we feed the resulting adversarial image to the MDE model. The next step is to apply the generated masks to compute the required loss functions. Then, we compute the gradient of the patch and based on this information we update the patch $P$.
  • Figure 3: Depth prediction w/o the penalized loss function: (Top) the input images, (Middle) results without the penalized depth loss (using the conventional loss), (Bottom) results with the penalized depth loss.
  • Figure 4: Impact of APARATE on pedestrian/cyclist class.
  • Figure 5: Impact of APARATE on car class.