SSAP: A Shape-Sensitive Adversarial Patch for Comprehensive Disruption of Monocular Depth Estimation in Autonomous Navigation Applications
Amira Guesmi, Muhammad Abdullah Hanif, Ihsen Alouani, Bassem Ouni, Muhammad Shafique
TL;DR
SSAP presents a shape-sensitive adversarial patch that fully disrupts monocular depth estimation by making targeted objects disappear or distort in depth predictions. It leverages a detector-guided patch with dual masks, a patch transformation block, and a penalized depth loss to maximize region-wide impact across varying distances and scales, demonstrating strong effects on both CNN-based and transformer-based MDE models. Experimental results on KITTI (and CASIA for pedestrians) show mean depth errors exceeding 0.5 and affected regions over 99% for CNN-based models, with transformer models like MIMDepth reaching ~0.59 depth error and similar region influence, outperforming prior attacks. The work highlights significant security implications for autonomous navigation and underscores the need for robust defenses against cross-model, shape-aware adversarial patches.
Abstract
Monocular depth estimation (MDE) has advanced significantly, primarily through the integration of convolutional neural networks (CNNs) and more recently, Transformers. However, concerns about their susceptibility to adversarial attacks have emerged, especially in safety-critical domains like autonomous driving and robotic navigation. Existing approaches for assessing CNN-based depth prediction methods have fallen short in inducing comprehensive disruptions to the vision system, often limited to specific local areas. In this paper, we introduce SSAP (Shape-Sensitive Adversarial Patch), a novel approach designed to comprehensively disrupt monocular depth estimation (MDE) in autonomous navigation applications. Our patch is crafted to selectively undermine MDE in two distinct ways: by distorting estimated distances or by creating the illusion of an object disappearing from the system's perspective. Notably, our patch is shape-sensitive, meaning it considers the specific shape and scale of the target object, thereby extending its influence beyond immediate proximity. Furthermore, our patch is trained to effectively address different scales and distances from the camera. Experimental results demonstrate that our approach induces a mean depth estimation error surpassing 0.5, impacting up to 99% of the targeted region for CNN-based MDE models. Additionally, we investigate the vulnerability of Transformer-based MDE models to patch-based attacks, revealing that SSAP yields a significant error of 0.59 and exerts substantial influence over 99% of the target region on these models.
