Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion

Chunlong Xie; Jialing He; Shangwei Guo; Jiacheng Wang; Shudong Zhang; Tianwei Zhang; Tao Xiang

Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion

Chunlong Xie, Jialing He, Shangwei Guo, Jiacheng Wang, Shudong Zhang, Tianwei Zhang, Tao Xiang

TL;DR

This work addresses the vulnerability of vision-language navigation (VLN) agents powered by foundation models by introducing AdvOF, a three-component framework that creates adversarial 3D objects to mislead VLM-based perception across multiple views. AdvOF combines aligned object rendering, cross-modal collaborative optimization, and view-aware object fusion to ensure consistent, multi-view attack effectiveness while preserving physical plausibility. Empirical results across four VLN agents and multiple datasets demonstrate that AdvOF achieves state-of-the-art attack performance, with strong transferability to unseen encoders, datasets, and architectures, and reasonable resilience to basic defenses. The findings highlight critical security considerations for service-oriented, VLM-assisted navigation systems and lay groundwork for building robust, QoS-aware deployments in real-world settings.

Abstract

We present Adversarial Object Fusion (AdvOF), a novel attack framework targeting vision-and-language navigation (VLN) agents in service-oriented environments by generating adversarial 3D objects. While foundational models like Large Language Models (LLMs) and Vision Language Models (VLMs) have enhanced service-oriented navigation systems through improved perception and decision-making, their integration introduces vulnerabilities in mission-critical service workflows. Existing adversarial attacks fail to address service computing contexts, where reliability and quality-of-service (QoS) are paramount. We utilize AdvOF to investigate and explore the impact of adversarial environments on the VLM-based perception module of VLN agents. In particular, AdvOF first precisely aggregates and aligns the victim object positions in both 2D and 3D space, defining and rendering adversarial objects. Then, we collaboratively optimize the adversarial object with regularization between the adversarial and victim object across physical properties and VLM perceptions. Through assigning importance weights to varying views, the optimization is processed stably and multi-viewedly by iterative fusions from local updates and justifications. Our extensive evaluations demonstrate AdvOF can effectively degrade agent performance under adversarial conditions while maintaining minimal interference with normal navigation tasks. This work advances the understanding of service security in VLM-powered navigation systems, providing computational foundations for robust service composition in physical-world deployments.

Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion

TL;DR

Abstract

Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion

TL;DR

Abstract

Paper Structure

Table of Contents

Theorems & Definitions (2)