Table of Contents
Fetching ...

Phantom Menace: Exploring and Enhancing the Robustness of VLA Models Against Physical Sensor Attacks

Xuancun Lu, Jiaxiang Chen, Shilin Xiao, Zizhi Jin, Zhangrui Chen, Hanwen Yu, Bohan Qian, Ruochen Zhou, Xiaoyu Ji, Wenyuan Xu

TL;DR

This work demonstrates that Vision-Language-Action (VLA) systems are vulnerable to physical sensor attacks that perturb cameras and microphones, potentially causing unsafe robot behavior. It introduces the Real-Sim-Real framework to automatically simulate a diverse set of physics-based sensor perturbations and validate them on real robots, enabling large-scale, cross-model robustness assessments. The study evaluates four VLA architectures across Libero datasets, reveals pronounced susceptibility with attack- and task-dependent patterns, and validates the simulator's guidance through real-world experiments. An adversarial-training defense is proposed and shown to improve robustness against out-of-distribution perturbations while maintaining performance on benign data, underscoring the need for standardized robustness benchmarks in safety-critical deployments.

Abstract

Vision-Language-Action (VLA) models revolutionize robotic systems by enabling end-to-end perception-to-action pipelines that integrate multiple sensory modalities, such as visual signals processed by cameras and auditory signals captured by microphones. This multi-modality integration allows VLA models to interpret complex, real-world environments using diverse sensor data streams. Given the fact that VLA-based systems heavily rely on the sensory input, the security of VLA models against physical-world sensor attacks remains critically underexplored. To address this gap, we present the first systematic study of physical sensor attacks against VLAs, quantifying the influence of sensor attacks and investigating the defenses for VLA models. We introduce a novel "Real-Sim-Real" framework that automatically simulates physics-based sensor attack vectors, including six attacks targeting cameras and two targeting microphones, and validates them on real robotic systems. Through large-scale evaluations across various VLA architectures and tasks under varying attack parameters, we demonstrate significant vulnerabilities, with susceptibility patterns that reveal critical dependencies on task types and model designs. We further develop an adversarial-training-based defense that enhances VLA robustness against out-of-distribution physical perturbations caused by sensor attacks while preserving model performance. Our findings expose an urgent need for standardized robustness benchmarks and mitigation strategies to secure VLA deployments in safety-critical environments.

Phantom Menace: Exploring and Enhancing the Robustness of VLA Models Against Physical Sensor Attacks

TL;DR

This work demonstrates that Vision-Language-Action (VLA) systems are vulnerable to physical sensor attacks that perturb cameras and microphones, potentially causing unsafe robot behavior. It introduces the Real-Sim-Real framework to automatically simulate a diverse set of physics-based sensor perturbations and validate them on real robots, enabling large-scale, cross-model robustness assessments. The study evaluates four VLA architectures across Libero datasets, reveals pronounced susceptibility with attack- and task-dependent patterns, and validates the simulator's guidance through real-world experiments. An adversarial-training defense is proposed and shown to improve robustness against out-of-distribution perturbations while maintaining performance on benign data, underscoring the need for standardized robustness benchmarks in safety-critical deployments.

Abstract

Vision-Language-Action (VLA) models revolutionize robotic systems by enabling end-to-end perception-to-action pipelines that integrate multiple sensory modalities, such as visual signals processed by cameras and auditory signals captured by microphones. This multi-modality integration allows VLA models to interpret complex, real-world environments using diverse sensor data streams. Given the fact that VLA-based systems heavily rely on the sensory input, the security of VLA models against physical-world sensor attacks remains critically underexplored. To address this gap, we present the first systematic study of physical sensor attacks against VLAs, quantifying the influence of sensor attacks and investigating the defenses for VLA models. We introduce a novel "Real-Sim-Real" framework that automatically simulates physics-based sensor attack vectors, including six attacks targeting cameras and two targeting microphones, and validates them on real robotic systems. Through large-scale evaluations across various VLA architectures and tasks under varying attack parameters, we demonstrate significant vulnerabilities, with susceptibility patterns that reveal critical dependencies on task types and model designs. We further develop an adversarial-training-based defense that enhances VLA robustness against out-of-distribution physical perturbations caused by sensor attacks while preserving model performance. Our findings expose an urgent need for standardized robustness benchmarks and mitigation strategies to secure VLA deployments in safety-critical environments.

Paper Structure

This paper contains 41 sections, 3 equations, 5 figures, 4 tables, 6 algorithms.

Figures (5)

  • Figure 1: Overview of the "Real-Sim-Real" framework. We demonstrate that VLA models are vulnerable to physical sensor attacks, where attackers inject malicious signals (e.g., laser, electromagnetic interference, ultrasound) into cameras and microphones, leading to severe consequences in real-world deployments. Our framework automatically evaluates these physical attack vectors to quantify their impact and we propose defenses for enhancing VLA robustness.
  • Figure 2: Architecture and pipeline of VLA models. A VLA model comprises a VLM and an action decoder. The VLM employs visual encoders and a text encoder to transform image and text data into multimodal embeddings. These embeddings are then processed by an LLM backbone to generate action tokens, which are subsequently decoded by an action decoder into corresponding physical robot actions.
  • Figure 3: We implement and simulate eight sensor attacks, including six targeting cameras and two targeting microphones, covering laser, light, acoustic, and EM signals. Attack instances are under varying attack intensities for each attack, i.e., the attack intensity progressively increases from left to right.
  • Figure 4: Real-world experiment setup. A Franka Panda equipped with a wrist camera, a full camera, and a microphone is used as an attack target VLA system. Attack devices include the EMI platform, projection platform, laser platform, and ultrasound platform.
  • Figure 5: Real-world attack consequence.