Table of Contents
Fetching ...

The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks

Chunyang Li, Zifeng Kang, Junwei Zhang, Zhuo Ma, Anda Cheng, Xinghua Li, Jianfeng Ma

TL;DR

The paper identifies indirect environmental jailbreak (IEJ) as a novel, black-box attack surface for embodied AI, where malicious instructions embedded in the environment subvert safety mechanisms without direct model prompts. It introduces Shawshank, a four-module attack framework, and Shawshank-Forge for automatic benchmark generation, producing Shawshank-Bench with 1,632 benign and 3,957 malicious instructions across 544 scenes. Empirical results show IEJ outperforms prior direct-attacks across six Vision-Language Models, achieving higher Attack Success Rates and Harm Risk Scores, and revealing defenses like Qwen3Guard and SAP offer only partial mitigation. The work underscores the need for new defense strategies and responsible disclosure, providing open-source tools to foster ongoing research into making embodied AI safer in real-world environments.

Abstract

The adoption of Vision-Language Models (VLMs) in embodied AI agents, while being effective, brings safety concerns such as jailbreaking. Prior work have explored the possibility of directly jailbreaking the embodied agents through elaborated multi-modal prompts. However, no prior work has studied or even reported indirect jailbreaks in embodied AI, where a black-box attacker induces a jailbreak without issuing direct prompts to the embodied agent. In this paper, we propose, for the first time, indirect environmental jailbreak (IEJ), a novel attack to jailbreak embodied AI via indirect prompt injected into the environment, such as malicious instructions written on a wall. Our key insight is that embodied AI does not ''think twice'' about the instructions provided by the environment -- a blind trust that attackers can exploit to jailbreak the embodied agent. We further design and implement open-source prototypes of two fully-automated frameworks: SHAWSHANK, the first automatic attack generation framework for the proposed attack IEJ; and SHAWSHANK-FORGE, the first automatic benchmark generation framework for IEJ. Then, using SHAWSHANK-FORGE, we automatically construct SHAWSHANK-BENCH, the first benchmark for indirectly jailbreaking embodied agents. Together, our two frameworks and one benchmark answer the questions of what content can be used for malicious IEJ instructions, where they should be placed, and how IEJ can be systematically evaluated. Evaluation results show that SHAWSHANK outperforms eleven existing methods across 3,957 task-scene combinations and compromises all six tested VLMs. Furthermore, current defenses only partially mitigate our attack, and we have responsibly disclosed our findings to all affected VLM vendors.

The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks

TL;DR

The paper identifies indirect environmental jailbreak (IEJ) as a novel, black-box attack surface for embodied AI, where malicious instructions embedded in the environment subvert safety mechanisms without direct model prompts. It introduces Shawshank, a four-module attack framework, and Shawshank-Forge for automatic benchmark generation, producing Shawshank-Bench with 1,632 benign and 3,957 malicious instructions across 544 scenes. Empirical results show IEJ outperforms prior direct-attacks across six Vision-Language Models, achieving higher Attack Success Rates and Harm Risk Scores, and revealing defenses like Qwen3Guard and SAP offer only partial mitigation. The work underscores the need for new defense strategies and responsible disclosure, providing open-source tools to foster ongoing research into making embodied AI safer in real-world environments.

Abstract

The adoption of Vision-Language Models (VLMs) in embodied AI agents, while being effective, brings safety concerns such as jailbreaking. Prior work have explored the possibility of directly jailbreaking the embodied agents through elaborated multi-modal prompts. However, no prior work has studied or even reported indirect jailbreaks in embodied AI, where a black-box attacker induces a jailbreak without issuing direct prompts to the embodied agent. In this paper, we propose, for the first time, indirect environmental jailbreak (IEJ), a novel attack to jailbreak embodied AI via indirect prompt injected into the environment, such as malicious instructions written on a wall. Our key insight is that embodied AI does not ''think twice'' about the instructions provided by the environment -- a blind trust that attackers can exploit to jailbreak the embodied agent. We further design and implement open-source prototypes of two fully-automated frameworks: SHAWSHANK, the first automatic attack generation framework for the proposed attack IEJ; and SHAWSHANK-FORGE, the first automatic benchmark generation framework for IEJ. Then, using SHAWSHANK-FORGE, we automatically construct SHAWSHANK-BENCH, the first benchmark for indirectly jailbreaking embodied agents. Together, our two frameworks and one benchmark answer the questions of what content can be used for malicious IEJ instructions, where they should be placed, and how IEJ can be systematically evaluated. Evaluation results show that SHAWSHANK outperforms eleven existing methods across 3,957 task-scene combinations and compromises all six tested VLMs. Furthermore, current defenses only partially mitigate our attack, and we have responsibly disclosed our findings to all affected VLM vendors.

Paper Structure

This paper contains 27 sections, 11 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison between direct jailbreak methods (e.g., BadRobot) and Indirect Environmental Jailbreaking (IEJ) on embodied AI (e.g., Shawshank). Our method results in two outcomes: (a) Jailbreaking and (b) Denial-of-Service (DoS) attacks, both utilizing visual environmental manipulation.
  • Figure 2: Overview of the Shawshank framework. The framework includes four modules: (1) the Initialization Module, which defines the task (e.g., damaging water glasses and windows); (2) the Sampling Module, which generates case constraints using genetic algorithms; (3) the Generate Module, which identifies malicious instructions through a generator, surrogate VLM, and evaluator; and (4) the Placement Module, which suggests placement locations for the malicious instruction (e.g., on the wall), and the attack is executed using social engineering techniques.
  • Figure 3: Overview of Shawshank-Forge. The benchmark generation framework collects scene images through random teleportation, extracts object descriptions, filters invalid and similar scenes, and generates both benign and malicious instructions, ensuring semantic richness and diversity.
  • Figure 4: Overview of Shawshank-Bench, the first benchmark generated using Shawshank-Forge. It includes 544 task scenarios across four environments (kitchen, living room, bathroom, bedroom), 1,632 benign instructions, and 3,957 malicious instructions.
  • Figure 5: Malicious Instruction Distribution. The abbreviations used are as follow: DC for Damage Creatures, DI for Destroy Items, WR for Waste Resources, SD for Self Destroy, and IP for Invasion Privacy.
  • ...and 3 more figures