Table of Contents
Fetching ...

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Jiawei Chen, Simin Huang, Jiawei Du, Shuaihang Chen, Yu Tian, Mingjie Wei, Chao Yu, Zhaoxia Yin

Abstract

Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Abstract

Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.

Paper Structure

This paper contains 15 sections, 13 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparison between Tex3D and existing attack paradigms. Bottom-right: VLA exhibits a certain degree of generalization under color changes and Gaussian noise perturbations, but its task failure rate rises sharply under Tex3D.
  • Figure 2: Overview of Tex3D. FBD renders the background in MuJoCo and the target object in Nvdiffrast, with cross-renderer alignment of geometric parameters $(\mathbf{P}_t,\mathcal{V}_t,\mathbf{M}_t)$ and lighting parameters $(I_a,I_d,\rho)$ for photometrically consistent scene composition. The composited observation is fed into the frozen VLA model, and gradients from untargeted or targeted objectives are back-propagated to directly optimize the object texture. TAAO further applies dynamics-guided weighting over critical frames, enabling temporally effective adversarial 3D texture optimization over complex long-horizon manipulation trajectories.
  • Figure 3: Qualitative results of Tex3D on manipulation tasks. For each task, the green row shows the clean rollout,whereas the red row shows the adversarial rollout under Tex3D.
  • Figure 4: Robustness comparison of Tex3D and 2D patch-based baselines under varying camera angles, object rotations, object positions (digital simulation), and position offsets in the physical world. Task failure rate (%, $\uparrow$) is reported.
  • Figure 5: Physical-world qualitative comparison. The first row: clean samples, the second row :results under 2D patch-based attacks, and the third row: results under Tex3D.
  • ...and 4 more figures