Table of Contents
Fetching ...

On the Vulnerability of LLM/VLM-Controlled Robotics

Xiyang Wu, Souradip Chakraborty, Ruiqi Xian, Jing Liang, Tianrui Guan, Fuxiao Liu, Brian M. Sadler, Dinesh Manocha, Amrit Singh Bedi

TL;DR

The paper investigates a critical reliability gap in LLM/VLM-controlled robotics: sensitivity to small input-modality variations can trigger misalignment and substantial task failures. It formalizes a vulnerability framework with a mathematical objective for perturbation-triggered failures, and proposes perturbation strategies that operate without modifying models. Through experiments on VIMA and Instruct2Act, it shows that perception-physical world perturbations can dramatically reduce success rates (e.g., up to ~29.9% for VIMA), while task type and generalization level modulate robustness, highlighting the need for improved cross-modal alignment and richer training data. The work offers a foundation for robust, safe deployment of LLM/VLM-enabled robots and points to future directions in automated vulnerability analysis and alignment-enhanced training.

Abstract

In this work, we highlight vulnerabilities in robotic systems integrating large language models (LLMs) and vision-language models (VLMs) due to input modality sensitivities. While LLM/VLM-controlled robots show impressive performance across various tasks, their reliability under slight input variations remains underexplored yet critical. These models are highly sensitive to instruction or perceptual input changes, which can trigger misalignment issues, leading to execution failures with severe real-world consequences. To study this issue, we analyze the misalignment-induced vulnerabilities within LLM/VLM-controlled robotic systems and present a mathematical formulation for failure modes arising from variations in input modalities. We propose empirical perturbation strategies to expose these vulnerabilities and validate their effectiveness through experiments on multiple robot manipulation tasks. Our results show that simple input perturbations reduce task execution success rates by 22.2% and 14.6% in two representative LLM/VLM-controlled robotic systems. These findings underscore the importance of input modality robustness and motivate further research to ensure the safe and reliable deployment of advanced LLM/VLM-controlled robotic systems.

On the Vulnerability of LLM/VLM-Controlled Robotics

TL;DR

The paper investigates a critical reliability gap in LLM/VLM-controlled robotics: sensitivity to small input-modality variations can trigger misalignment and substantial task failures. It formalizes a vulnerability framework with a mathematical objective for perturbation-triggered failures, and proposes perturbation strategies that operate without modifying models. Through experiments on VIMA and Instruct2Act, it shows that perception-physical world perturbations can dramatically reduce success rates (e.g., up to ~29.9% for VIMA), while task type and generalization level modulate robustness, highlighting the need for improved cross-modal alignment and richer training data. The work offers a foundation for robust, safe deployment of LLM/VLM-enabled robots and points to future directions in automated vulnerability analysis and alignment-enhanced training.

Abstract

In this work, we highlight vulnerabilities in robotic systems integrating large language models (LLMs) and vision-language models (VLMs) due to input modality sensitivities. While LLM/VLM-controlled robots show impressive performance across various tasks, their reliability under slight input variations remains underexplored yet critical. These models are highly sensitive to instruction or perceptual input changes, which can trigger misalignment issues, leading to execution failures with severe real-world consequences. To study this issue, we analyze the misalignment-induced vulnerabilities within LLM/VLM-controlled robotic systems and present a mathematical formulation for failure modes arising from variations in input modalities. We propose empirical perturbation strategies to expose these vulnerabilities and validate their effectiveness through experiments on multiple robot manipulation tasks. Our results show that simple input perturbations reduce task execution success rates by 22.2% and 14.6% in two representative LLM/VLM-controlled robotic systems. These findings underscore the importance of input modality robustness and motivate further research to ensure the safe and reliable deployment of advanced LLM/VLM-controlled robotic systems.
Paper Structure (14 sections, 4 equations, 2 figures, 2 tables)

This paper contains 14 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Vulnerability-Triggering Perturbations. We showcase perturbations inducing misalignment-related vulnerabilities in manipulation tasks that would otherwise succeed. These perturbations, applied to both visual and language prompt inputs, trigger misalignment-induced vulnerabilities while minimizing contextual changes: (a) Text-Action Misalignment (blue box) disrupts correspondence between language prompts and LLM action priors by altering action-related components with synonyms. (c) Text-Image Misalignment (orange box) breaks entity correspondence between prompts and visual observations by modifying entity names and attributes with synonyms or phrases. (c) Perception-Physical World Misalignment (magenta box) introduces transformations to robot perceptions, misaligning them with real-world states. Notably, LLM-Action misalignment cannot be directly triggered but arise from upstream perturbations. Once perturbations are introduced, LLM/VLM-controlled robots are highly prone to task execution or action plan failures, significantly reducing their reliability.
  • Figure 2: Misalignment-Induced Vulnerabilities in LLM/VLM-Controlled Robots. LLM/VLM-controlled robots take language prompts and visual observations as inputs. These are processed by language tokenizers and visual encoders, mapped into the LLM’s input embedding space, while outputs are action embeddings—either command lines or target poses. Misalignments occur at four key interfaces: (a) Text-Image. Misalignment between language and visual embeddings in the LLM input space. (b) Text-Action. Misalignment between action tokens in language prompts and the LLM’s priors. (c) Perception-Physical World. Discrepancy between the robot’s perception and real-world ground truth. (d) LLM-Action. Misalignment between the LLM’s action plans (e.g., command lines) and optimal ground-truth actions.