Table of Contents
Fetching ...

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

Chenxi Li, Xianggan Liu, Dake Shen, Yaosong Du, Zhibo Yao, Hao Jiang, Linyi Jiang, Chengwei Cao, Jingzhe Zhang, RanYi Peng, Peiling Bai, Xiande Huang

TL;DR

An underexplored vulnerability via semantic slot filling is demonstrated via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign.

Abstract

Despite the rapid progress of Large Vision-Language Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose StructAttack, a simple yet effective single-query jailbreak framework under black-box settings. StructAttack decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), StructAttack exploits LVLMs' reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models and benchmarks show the efficacy of our proposed StructAttack.

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

TL;DR

An underexplored vulnerability via semantic slot filling is demonstrated via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign.

Abstract

Despite the rapid progress of Large Vision-Language Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose StructAttack, a simple yet effective single-query jailbreak framework under black-box settings. StructAttack decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), StructAttack exploits LVLMs' reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models and benchmarks show the efficacy of our proposed StructAttack.
Paper Structure (32 sections, 3 equations, 26 figures, 10 tables)

This paper contains 32 sections, 3 equations, 26 figures, 10 tables.

Figures (26)

  • Figure 1: Demonstration of our method against GPT-4o. When prompted with a malicious request, GPT-4o effectively refuses to comply. In contrast, StructAttack bypasses the model’s safety mechanisms by decomposing the original malicious query into benign-appearing structural maps, which are then reassembled by LVLMs to reconstruct the concealed malicious intent, thus producing unsafe content. Here, each branch serves as a "Lego Block" that, when combined with others, forms the complete harmful "Semantic Blueprint".
  • Figure 2: The framework of our proposed StructAttack. Based on the vulnerability of semantic slot filling, we first employ a Semantic Slot Decomposition module to decompose the target instruction into malicious and distractor slot types. Then, we embed the decomposed slot types into structured visual prompts that can be combined with a completion-guided text prompt to jailbreak LVLMs.
  • Figure 3: Illustration of the (a) zero-shot semantic slot filling (SSF) process in natural language understanding (NLU), that assigns slot types to the given text segments. (b) The textual SSF attack exploits the model’s SSF vulnerability, inducing the LLM to fill in malicious slot types with harmful content. (c) The visual SSF attack extends this paradigm to LVLMs.
  • Figure 4: Comparison of jailbreak performance using the semantic slot filling vulnerability in textual and visual attacks (e.g., Mind Map, Table, and Sunburst Diagram).
  • Figure 5: Performance comparison of harmfulness scores across different categories on GPT-4o in Advbench-M.
  • ...and 21 more figures