Table of Contents
Fetching ...

Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense

Shuyang Hao, Yiwei Wang, Bryan Hooi, Ming-Hsuan Yang, Jun Liu, Chengcheng Tang, Zi Huang, Yujun Cai

TL;DR

This work addresses the vulnerability of large vision-language models to jailbreaking attacks delivered through visual inputs by proposing ESIII, a two-stage defense that couples visual and textual safeguards. It first creates a universal defensive image $i^{*}_{def}$ via gradient-based optimization to embed security instructions, then synthesizes textual prompts $t_s$ to form a joint input $T$ with the enhanced image, yielding $y^{*} = \mathcal{M}([ W \cdot E(I), T ])$. Empirical results across MM-SafetyBench, VLGuard, and MM-Vet show ESIII significantly reduces jailbreak attack success while preserving performance on benign tasks and incurring negligible inference-time costs; it also demonstrates transferability across LVLMs and scenarios. Overall, ESIII leverages cross-modal defense signals to provide robust, efficient, and broadly applicable LVLM safety improvements suitable for practical deployment.

Abstract

Deploying large vision-language models (LVLMs) introduces a unique vulnerability: susceptibility to malicious attacks via visual inputs. However, existing defense methods suffer from two key limitations: (1) They solely focus on textual defenses, fail to directly address threats in the visual domain where attacks originate, and (2) the additional processing steps often incur significant computational overhead or compromise model performance on benign tasks. Building on these insights, we propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism. Initially, we embed security instructions into defensive images through gradient-based optimization, obtaining security instructions in the visual dimension. Subsequently, we integrate security instructions from visual and textual dimensions with the input query. The collaboration between security instructions from different dimensions ensures comprehensive security protection. Extensive experiments demonstrate that our approach effectively fortifies the robustness of LVLMs against such attacks while preserving their performance on standard benign tasks and incurring an imperceptible increase in time costs.

Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense

TL;DR

This work addresses the vulnerability of large vision-language models to jailbreaking attacks delivered through visual inputs by proposing ESIII, a two-stage defense that couples visual and textual safeguards. It first creates a universal defensive image via gradient-based optimization to embed security instructions, then synthesizes textual prompts to form a joint input with the enhanced image, yielding . Empirical results across MM-SafetyBench, VLGuard, and MM-Vet show ESIII significantly reduces jailbreak attack success while preserving performance on benign tasks and incurring negligible inference-time costs; it also demonstrates transferability across LVLMs and scenarios. Overall, ESIII leverages cross-modal defense signals to provide robust, efficient, and broadly applicable LVLM safety improvements suitable for practical deployment.

Abstract

Deploying large vision-language models (LVLMs) introduces a unique vulnerability: susceptibility to malicious attacks via visual inputs. However, existing defense methods suffer from two key limitations: (1) They solely focus on textual defenses, fail to directly address threats in the visual domain where attacks originate, and (2) the additional processing steps often incur significant computational overhead or compromise model performance on benign tasks. Building on these insights, we propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism. Initially, we embed security instructions into defensive images through gradient-based optimization, obtaining security instructions in the visual dimension. Subsequently, we integrate security instructions from visual and textual dimensions with the input query. The collaboration between security instructions from different dimensions ensures comprehensive security protection. Extensive experiments demonstrate that our approach effectively fortifies the robustness of LVLMs against such attacks while preserving their performance on standard benign tasks and incurring an imperceptible increase in time costs.

Paper Structure

This paper contains 26 sections, 10 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Illustration of our defense method, which embeds security instructions into images. By simultaneously adding defensive security instructions in both the visual and textual spaces, our approach effectively defends against state-of-the-art jailbreak attacks.
  • Figure 2: The overview of ESIII. In image processing, the workflow involves the following steps: initially, gradient optimization is employed to embed specific security instructions into the defensive image, thereby obtaining security instructions in the visual domain. Subsequently, the resulting defensive image is overlaid onto the input image, completing the defense in the visual dimension. The workflow for text processing is placing specific security instructions before the text input, thereby achieving defense in the textual dimension. The collaboration between security instructions from different dimensions ensures comprehensive security protection.
  • Figure 3: The evaluation results of transferability of ESIII across different LVLMs (LLaVA-1.5-13B, MiniGPT4-v2-13B and Qwen-VL-Chat). The results indicate that ESIII maintains excellent defensive effectiveness and benign acceptance rates across various models.
  • Figure 4: The ASR results of 13 scenarios. It can be observed that ESIII defends effectively across all scenarios; however, the degree of effectiveness varies depending on the specific scenario. "IA" to "GD" denotes the 13 sub-datasets of prohibited scenarios.
  • Figure 5: Ten security instructions we used in ESIII.
  • ...and 7 more figures