Table of Contents
Fetching ...

CHAI: Command Hijacking against embodied AI

Luis Burbano, Diego Ortiz, Qi Sun, Siwei Yang, Haoqin Tu, Cihang Xie, Yinzhi Cao, Alvaro A Cardenas

TL;DR

This work identifies a new security vulnerability in LVLM-driven embodied AI: the command layer, where intermediate text outputs bridge perception and control. It proposes CHAI, an optimization-based attack that jointly optimizes semantic content and visual realization of signs embedded in the scene to hijack high-level decisions. Through dictionary-guided search and cross-entropy optimization, CHAI achieves high attack success across drone landing, autonomous driving, and aerial tracking in simulation and real-world tests, and demonstrates cross-language generalization. The results highlight an urgent need for defenses that jointly consider text and vision modalities, broadening the scope of robustness beyond traditional perception-focused approaches.

Abstract

Embodied Artificial Intelligence (AI) promises to handle edge cases in robotic vehicle systems where data is scarce by using common-sense reasoning grounded in perception and action to generalize beyond training distributions and adapt to novel real-world situations. These capabilities, however, also create new security risks. In this paper, we introduce CHAI (Command Hijacking against embodied AI), a new class of prompt-based attacks that exploit the multimodal language interpretation abilities of Large Visual-Language Models (LVLMs). CHAI embeds deceptive natural language instructions, such as misleading signs, in visual input, systematically searches the token space, builds a dictionary of prompts, and guides an attacker model to generate Visual Attack Prompts. We evaluate CHAI on four LVLM agents; drone emergency landing, autonomous driving, and aerial object tracking, and on a real robotic vehicle. Our experiments show that CHAI consistently outperforms state-of-the-art attacks. By exploiting the semantic and multimodal reasoning strengths of next-generation embodied AI systems, CHAI underscores the urgent need for defenses that extend beyond traditional adversarial robustness.

CHAI: Command Hijacking against embodied AI

TL;DR

This work identifies a new security vulnerability in LVLM-driven embodied AI: the command layer, where intermediate text outputs bridge perception and control. It proposes CHAI, an optimization-based attack that jointly optimizes semantic content and visual realization of signs embedded in the scene to hijack high-level decisions. Through dictionary-guided search and cross-entropy optimization, CHAI achieves high attack success across drone landing, autonomous driving, and aerial tracking in simulation and real-world tests, and demonstrates cross-language generalization. The results highlight an urgent need for defenses that jointly consider text and vision modalities, broadening the scope of robustness beyond traditional perception-focused approaches.

Abstract

Embodied Artificial Intelligence (AI) promises to handle edge cases in robotic vehicle systems where data is scarce by using common-sense reasoning grounded in perception and action to generalize beyond training distributions and adapt to novel real-world situations. These capabilities, however, also create new security risks. In this paper, we introduce CHAI (Command Hijacking against embodied AI), a new class of prompt-based attacks that exploit the multimodal language interpretation abilities of Large Visual-Language Models (LVLMs). CHAI embeds deceptive natural language instructions, such as misleading signs, in visual input, systematically searches the token space, builds a dictionary of prompts, and guides an attacker model to generate Visual Attack Prompts. We evaluate CHAI on four LVLM agents; drone emergency landing, autonomous driving, and aerial object tracking, and on a real robotic vehicle. Our experiments show that CHAI consistently outperforms state-of-the-art attacks. By exploiting the semantic and multimodal reasoning strengths of next-generation embodied AI systems, CHAI underscores the urgent need for defenses that extend beyond traditional adversarial robustness.

Paper Structure

This paper contains 29 sections, 14 equations, 17 figures, 7 tables, 1 algorithm.

Figures (17)

  • Figure 1: LVLMs can understand commands in different modalities, and these modalities can be attacked.
  • Figure 2: Examples of unsuccessful and successful attacks
  • Figure 3: Attack Pipeline. In the first stage, we reduce the vocabulary space by creating a dictionary, and in the second stage, we do a joint optimization in the space of prompts in the dictionary and the perceptual features of the attack.
  • Figure 4: Attacker LLM prompt stages for a drone. The brackets indicate inputs to the prompt. Red values come from the attacker's input, the green values come from the target LVLM, the blue value is a summary of the target LVLM task coming from an LLM.
  • Figure 5: Applications for our attack. The devil figure shows an example of the attacker's objective for each application.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Definition 1: Kullback-Leibler divergence