Table of Contents
Fetching ...

Toward Inherently Robust VLMs Against Visual Perception Attacks

Pedram MohajerAnsari, Amir Salarpour, Michael Kühr, Siyu Huang, Mohammad Hamad, Sebastian Steinhorst, Habeeb Olufowobi, Bing Li, Mert D. Pesé

TL;DR

The paper tackles the vulnerability of autonomous vehicle perception to unseen adversarial attacks by introducing Vehicle Vision Language Models (V2LMs) that are fine-tuned for TSR, ALC, and VD. It shows that V2LMs exhibit inherent robustness without adversarial training, outperforming conventional DNNs under unseen attacks and maintaining high clean accuracy. The study evaluates two deployment modes—Solo (task-specific V2LMs) and Tandem (one V2LM for all tasks)—and demonstrates that Tandem achieves comparable robustness with substantially reduced memory footprint. These findings suggest V2LMs as a practical and scalable path to secure AV perception, with integration strategies and latency considerations discussed for real-time deployment.

Abstract

Autonomous vehicles rely on deep neural networks (DNNs) for traffic sign recognition, lane centering, and vehicle detection, yet these models are vulnerable to attacks that induce misclassification and threaten safety. Existing defenses (e.g., adversarial training) often fail to generalize and degrade clean accuracy. We introduce Vehicle Vision-Language Models (V2LMs), fine-tuned vision-language models specialized for autonomous vehicle perception, and show that they are inherently more robust to unseen attacks without adversarial training, maintaining substantially higher adversarial accuracy than conventional DNNs. We study two deployments: Solo (task-specific V2LMs) and Tandem (a single V2LM for all three tasks). Under attacks, DNNs drop 33-74%, whereas V2LMs decline by under 8% on average. Tandem achieves comparable robustness to Solo while being more memory-efficient. We also explore integrating V2LMs in parallel with existing perception stacks to enhance resilience. Our results suggest V2LMs are a promising path toward secure, robust AV perception.

Toward Inherently Robust VLMs Against Visual Perception Attacks

TL;DR

The paper tackles the vulnerability of autonomous vehicle perception to unseen adversarial attacks by introducing Vehicle Vision Language Models (V2LMs) that are fine-tuned for TSR, ALC, and VD. It shows that V2LMs exhibit inherent robustness without adversarial training, outperforming conventional DNNs under unseen attacks and maintaining high clean accuracy. The study evaluates two deployment modes—Solo (task-specific V2LMs) and Tandem (one V2LM for all tasks)—and demonstrates that Tandem achieves comparable robustness with substantially reduced memory footprint. These findings suggest V2LMs as a practical and scalable path to secure AV perception, with integration strategies and latency considerations discussed for real-time deployment.

Abstract

Autonomous vehicles rely on deep neural networks (DNNs) for traffic sign recognition, lane centering, and vehicle detection, yet these models are vulnerable to attacks that induce misclassification and threaten safety. Existing defenses (e.g., adversarial training) often fail to generalize and degrade clean accuracy. We introduce Vehicle Vision-Language Models (V2LMs), fine-tuned vision-language models specialized for autonomous vehicle perception, and show that they are inherently more robust to unseen attacks without adversarial training, maintaining substantially higher adversarial accuracy than conventional DNNs. We study two deployments: Solo (task-specific V2LMs) and Tandem (a single V2LM for all three tasks). Under attacks, DNNs drop 33-74%, whereas V2LMs decline by under 8% on average. Tandem achieves comparable robustness to Solo while being more memory-efficient. We also explore integrating V2LMs in parallel with existing perception stacks to enhance resilience. Our results suggest V2LMs are a promising path toward secure, robust AV perception.

Paper Structure

This paper contains 12 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Impact of adversarial training (AT) on NVILA, Qwen-VL, YOLOv5-cls, and ViT-small. (a) Adversarial inputs. (b) Benign inputs.
  • Figure 2: Examples of adversarial attacks targeting AV perception.
  • Figure 3: Possible integration of V2LM with perception system for attack mitigation. Iroadview is the image captured by AV cameras, with the output sent to the planning and control modules for action.
  • Figure 4: Comparison of Solo and Tandem Modes.
  • Figure 5: IoU values before and after fine‐tuning.