Table of Contents
Fetching ...

TAIJI: Textual Anchoring for Immunizing Jailbreak Images in Vision Language Models

Xiangyu Yin, Yi Qi, Jinwei Hu, Zhen Chen, Yi Dong, Xingyu Zhao, Xiaowei Huang, Wenjie Ruan

TL;DR

This work addresses the vulnerability of Vision Language Models to jailbreak prompts by introducing TAIJI, a black-box defense that uses textual anchoring to curb harmful outputs with a single inference. It extracts key phrases from both visual and textual inputs, identifies critical keywords via manual and automatic methods, and rewrites prompts to foreground safety signals. The approach is validated across multiple datasets and VLMs, showing significant reductions in attack success rates while preserving accuracy on benign tasks, and functioning without model parameter access or retraining. The results demonstrate practical, scalable protection for VLMs in safety-critical settings such as autonomous systems and public-facing AI services.

Abstract

Vision Language Models (VLMs) have demonstrated impressive inference capabilities, but remain vulnerable to jailbreak attacks that can induce harmful or unethical responses. Existing defence methods are predominantly white-box approaches that require access to model parameters and extensive modifications, making them costly and impractical for many real-world scenarios. Although some black-box defences have been proposed, they often impose input constraints or require multiple queries, limiting their effectiveness in safety-critical tasks such as autonomous driving. To address these challenges, we propose a novel black-box defence framework called \textbf{T}extual \textbf{A}nchoring for \textbf{I}mmunizing \textbf{J}ailbreak \textbf{I}mages (\textbf{TAIJI}). TAIJI leverages key phrase-based textual anchoring to enhance the model's ability to assess and mitigate the harmful content embedded within both visual and textual prompts. Unlike existing methods, TAIJI operates effectively with a single query during inference, while preserving the VLM's performance on benign tasks. Extensive experiments demonstrate that TAIJI significantly enhances the safety and reliability of VLMs, providing a practical and efficient solution for real-world deployment.

TAIJI: Textual Anchoring for Immunizing Jailbreak Images in Vision Language Models

TL;DR

This work addresses the vulnerability of Vision Language Models to jailbreak prompts by introducing TAIJI, a black-box defense that uses textual anchoring to curb harmful outputs with a single inference. It extracts key phrases from both visual and textual inputs, identifies critical keywords via manual and automatic methods, and rewrites prompts to foreground safety signals. The approach is validated across multiple datasets and VLMs, showing significant reductions in attack success rates while preserving accuracy on benign tasks, and functioning without model parameter access or retraining. The results demonstrate practical, scalable protection for VLMs in safety-critical settings such as autonomous systems and public-facing AI services.

Abstract

Vision Language Models (VLMs) have demonstrated impressive inference capabilities, but remain vulnerable to jailbreak attacks that can induce harmful or unethical responses. Existing defence methods are predominantly white-box approaches that require access to model parameters and extensive modifications, making them costly and impractical for many real-world scenarios. Although some black-box defences have been proposed, they often impose input constraints or require multiple queries, limiting their effectiveness in safety-critical tasks such as autonomous driving. To address these challenges, we propose a novel black-box defence framework called \textbf{T}extual \textbf{A}nchoring for \textbf{I}mmunizing \textbf{J}ailbreak \textbf{I}mages (\textbf{TAIJI}). TAIJI leverages key phrase-based textual anchoring to enhance the model's ability to assess and mitigate the harmful content embedded within both visual and textual prompts. Unlike existing methods, TAIJI operates effectively with a single query during inference, while preserving the VLM's performance on benign tasks. Extensive experiments demonstrate that TAIJI significantly enhances the safety and reliability of VLMs, providing a practical and efficient solution for real-world deployment.

Paper Structure

This paper contains 22 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of TAIJI framework. Left: four distinct types of jailbreak prompts (abbreviated as SD, TYPO, SD+TYPO, FigStep) are utilized in our evaluation, which are specified in Sec. \ref{['settings']}. Middle: Instead of directly feeding these prompts towards victim VLMs, TAIJI extracts key phrases, identifies critical keywords, and rewrites prompts. Right: Revised prompts guide VLMs to generate safe and ethical responses, effectively mitigating jailbreak attacks.
  • Figure 2: The figure showcases ASR across 13 scenarios in MMSafetyBench and a cumulative evaluation in SafeBench, obtained by querying Qwen2-VL. Results in the first 13 histograms are categorized into original settings (without any defence), manually identified key phrase-based defence (indicated with M), and automatically identified key phrase-based defence (indicated with A). Similarly, the defences are divided into manual one and automatic one respectively in the final historgram.
  • Figure 3: Similar to the settings in Fig. \ref{['qualitative_qwen2']}, this figure demonstrates the effectiveness of TAIJI on CogVLM.