Table of Contents
Fetching ...

Hijacking Vision-and-Language Navigation Agents with Adversarial Environmental Attacks

Zijiao Yang, Xiangxi Shi, Eric Slyman, Stefan Lee

TL;DR

The paper addresses the vulnerability of Vision-and-Language Navigation (VLN) agents to adversarial environmental changes by introducing a whitebox attack that optimizes the appearance of a 3D object via differentiable rendering to steer agent behavior. It demonstrates two attack modalities—immediate stopping and following an attacker-defined trajectory—on HAMT-based VLN models using R2R and RxR datasets, with significant disruption to instruction following and path planning. Key contributions include a formal attack framework, extensive empirical evaluation, and ablations that reveal how factors like object size, viewpoint rendering, and training diversity influence attack success. The findings underscore important security implications for embodied agents and motivate defenses and broader investigation into robust multimodal navigation systems.

Abstract

Assistive embodied agents that can be instructed in natural language to perform tasks in open-world environments have the potential to significantly impact labor tasks like manufacturing or in-home care -- benefiting the lives of those who come to depend on them. In this work, we consider how this benefit might be hijacked by local modifications in the appearance of the agent's operating environment. Specifically, we take the popular Vision-and-Language Navigation (VLN) task as a representative setting and develop a whitebox adversarial attack that optimizes a 3D attack object's appearance to induce desired behaviors in pretrained VLN agents that observe it in the environment. We demonstrate that the proposed attack can cause VLN agents to ignore their instructions and execute alternative actions after encountering the attack object -- even for instructions and agent paths not considered when optimizing the attack. For these novel settings, we find our attacks can induce early-termination behaviors or divert an agent along an attacker-defined multi-step trajectory. Under both conditions, environmental attacks significantly reduce agent capabilities to successfully follow user instructions.

Hijacking Vision-and-Language Navigation Agents with Adversarial Environmental Attacks

TL;DR

The paper addresses the vulnerability of Vision-and-Language Navigation (VLN) agents to adversarial environmental changes by introducing a whitebox attack that optimizes the appearance of a 3D object via differentiable rendering to steer agent behavior. It demonstrates two attack modalities—immediate stopping and following an attacker-defined trajectory—on HAMT-based VLN models using R2R and RxR datasets, with significant disruption to instruction following and path planning. Key contributions include a formal attack framework, extensive empirical evaluation, and ablations that reveal how factors like object size, viewpoint rendering, and training diversity influence attack success. The findings underscore important security implications for embodied agents and motivate defenses and broader investigation into robust multimodal navigation systems.

Abstract

Assistive embodied agents that can be instructed in natural language to perform tasks in open-world environments have the potential to significantly impact labor tasks like manufacturing or in-home care -- benefiting the lives of those who come to depend on them. In this work, we consider how this benefit might be hijacked by local modifications in the appearance of the agent's operating environment. Specifically, we take the popular Vision-and-Language Navigation (VLN) task as a representative setting and develop a whitebox adversarial attack that optimizes a 3D attack object's appearance to induce desired behaviors in pretrained VLN agents that observe it in the environment. We demonstrate that the proposed attack can cause VLN agents to ignore their instructions and execute alternative actions after encountering the attack object -- even for instructions and agent paths not considered when optimizing the attack. For these novel settings, we find our attacks can induce early-termination behaviors or divert an agent along an attacker-defined multi-step trajectory. Under both conditions, environmental attacks significantly reduce agent capabilities to successfully follow user instructions.

Paper Structure

This paper contains 16 sections, 3 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: We directly optimize the appearance of an in-environment object to control the trajectory of a trained VLN agent using a differentiable renderer. [fill color=mygray, inner color=white, outer color=white]1 Adversarial observations are rendered at an attack viewpoint containing the attack object. [fill color=mygray, inner color=white, outer color=white]2 The VLN agent takes this observation as input and [fill color=mygray, inner color=white, outer color=white]3 we supervise the agent's trajectory from this point to match a predetermined attack trajectory. [fill color=mygray, inner color=white, outer color=white]4 We compute loss gradients with respect to the object texture and use them to [fill color=mygray, inner color=white, outer color=white]5 update the object's appearance in the 3D mesh.
  • Figure 2: Example original and attacked objects for a desk (left), cabinet (middle), and sofa (right) from trajectory-level attacks.