Table of Contents
Fetching ...

Beyond the Patch: Exploring Vulnerabilities of Visuomotor Policies via Viewpoint-Consistent 3D Adversarial Object

Chanmi Lee, Minsung Yoon, Woojae Kim, Sebin Lee, Sung-eui Yoon

TL;DR

This work proposes a viewpoint-consistent adversarial texture optimization method for 3D objects through differentiable rendering and integrates saliency-guided perturbations to redirect policy attention and design a targeted loss that persistently drives robots toward adversarial objects.

Abstract

Neural network-based visuomotor policies enable robots to perform manipulation tasks but remain susceptible to perceptual attacks. For example, conventional 2D adversarial patches are effective under fixed-camera setups, where appearance is relatively consistent; however, their efficacy often diminishes under dynamic viewpoints from moving cameras, such as wrist-mounted setups, due to perspective distortions. To proactively investigate potential vulnerabilities beyond 2D patches, this work proposes a viewpoint-consistent adversarial texture optimization method for 3D objects through differentiable rendering. As optimization strategies, we employ Expectation over Transformation (EOT) with a Coarse-to-Fine (C2F) curriculum, exploiting distance-dependent frequency characteristics to induce textures effective across varying camera-object distances. We further integrate saliency-guided perturbations to redirect policy attention and design a targeted loss that persistently drives robots toward adversarial objects. Our comprehensive experiments show that the proposed method is effective under various environmental conditions, while confirming its black-box transferability and real-world applicability.

Beyond the Patch: Exploring Vulnerabilities of Visuomotor Policies via Viewpoint-Consistent 3D Adversarial Object

TL;DR

This work proposes a viewpoint-consistent adversarial texture optimization method for 3D objects through differentiable rendering and integrates saliency-guided perturbations to redirect policy attention and design a targeted loss that persistently drives robots toward adversarial objects.

Abstract

Neural network-based visuomotor policies enable robots to perform manipulation tasks but remain susceptible to perceptual attacks. For example, conventional 2D adversarial patches are effective under fixed-camera setups, where appearance is relatively consistent; however, their efficacy often diminishes under dynamic viewpoints from moving cameras, such as wrist-mounted setups, due to perspective distortions. To proactively investigate potential vulnerabilities beyond 2D patches, this work proposes a viewpoint-consistent adversarial texture optimization method for 3D objects through differentiable rendering. As optimization strategies, we employ Expectation over Transformation (EOT) with a Coarse-to-Fine (C2F) curriculum, exploiting distance-dependent frequency characteristics to induce textures effective across varying camera-object distances. We further integrate saliency-guided perturbations to redirect policy attention and design a targeted loss that persistently drives robots toward adversarial objects. Our comprehensive experiments show that the proposed method is effective under various environmental conditions, while confirming its black-box transferability and real-world applicability.
Paper Structure (33 sections, 6 equations, 10 figures, 7 tables)

This paper contains 33 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Visuomotor policy deception using a 3D adversarial object. (a) The policy successfully guides the robot to its target $O_{\text{goal}}$. (b) Our adversarial object $O_{\text{adv}}$ manipulates the visual input, compelling the policy to misguide the robot towards itself instead of the true target $O_{\text{goal}}$.
  • Figure 2: Apparent size comparison of a 3D adversarial object and a 2D patch. The 2D patch significantly shrinks and distorts, especially at large viewing angles $\phi$, unlike the more stable 3D object.
  • Figure 3: Overview of the proposed method. (a) Coarse-to-Fine (C2F) Pose Scheduling: From a set of task-feasible initial configurations, poses where the original policy succeeds, we schedule viewpoint sampling using a distance-based Beta distribution. The scheduler progressively shifts focus from distant (Coarse stage) to closer (Fine stage) viewpoints. (b) 3D Adversarial Object Optimization Pipeline: Guided by the Expectation over Transformation (EOT) framework, the pipeline optimizes the adversarial texture $T$ through short policy rollouts from each initial pose $\tau_i$. In each step, the policy $\pi_\omega$ processes a composite image $I_\text{adv}$ (formed from $I_\text{diff}$ and $I_\text{sim}$) to output an action. A targeted adversarial loss is then computed from the resulting action, guiding the robot toward the adversarial object $O_\text{adv}$. The total loss, reflecting actual image-action pairs from the rollout, is backpropagated to update the texture $T$.
  • Figure 4: Visualization of texture update patterns under different scheduling strategies: (a) Coarse-to-Fine, (b) Fine-to-Coarse, (c) Non-staged, (d) Coarse-only, (e) Fine-only.
  • Figure 5: Comparison of policy saliency maps: (a) before vs. (b) after the 3D adversarial attack.
  • ...and 5 more figures