Toward Inherently Robust VLMs Against Visual Perception Attacks
Pedram MohajerAnsari, Amir Salarpour, Michael Kühr, Siyu Huang, Mohammad Hamad, Sebastian Steinhorst, Habeeb Olufowobi, Bing Li, Mert D. Pesé
TL;DR
The paper tackles the vulnerability of autonomous vehicle perception to unseen adversarial attacks by introducing Vehicle Vision Language Models (V2LMs) that are fine-tuned for TSR, ALC, and VD. It shows that V2LMs exhibit inherent robustness without adversarial training, outperforming conventional DNNs under unseen attacks and maintaining high clean accuracy. The study evaluates two deployment modes—Solo (task-specific V2LMs) and Tandem (one V2LM for all tasks)—and demonstrates that Tandem achieves comparable robustness with substantially reduced memory footprint. These findings suggest V2LMs as a practical and scalable path to secure AV perception, with integration strategies and latency considerations discussed for real-time deployment.
Abstract
Autonomous vehicles rely on deep neural networks (DNNs) for traffic sign recognition, lane centering, and vehicle detection, yet these models are vulnerable to attacks that induce misclassification and threaten safety. Existing defenses (e.g., adversarial training) often fail to generalize and degrade clean accuracy. We introduce Vehicle Vision-Language Models (V2LMs), fine-tuned vision-language models specialized for autonomous vehicle perception, and show that they are inherently more robust to unseen attacks without adversarial training, maintaining substantially higher adversarial accuracy than conventional DNNs. We study two deployments: Solo (task-specific V2LMs) and Tandem (a single V2LM for all three tasks). Under attacks, DNNs drop 33-74%, whereas V2LMs decline by under 8% on average. Tandem achieves comparable robustness to Solo while being more memory-efficient. We also explore integrating V2LMs in parallel with existing perception stacks to enhance resilience. Our results suggest V2LMs are a promising path toward secure, robust AV perception.
