Table of Contents
Fetching ...

E$^2$AT: Multimodal Jailbreak Defense via Dynamic Joint Optimization for Multimodal Large Language Models

Liming Lu, Xiang Gu, Shuchao Pang, Siyuan Liang, Haotian Zhu, Xiyu Zeng, Xu Zheng, Yongbin Zhou

TL;DR

Results demonstrate that the proposed E$^2$AT achieves the state-of-the-art performance, outperforming existing baselines by an average margin of 34\% across text and image modalities, while maintaining clean task performance.

Abstract

Research endeavors have been made in learning robust Multimodal Large Language Models (MLLMs) against jailbreak attacks. However, existing methods for improving MLLMs' robustness still face critical challenges: \ding{172} how to efficiently tune massive weight parameters and \ding{173} how to ensure robustness against attacks across both visual and textual modalities. To this end, we propose an \textbf{E}fficient \textbf{E}nd-to-end \textbf{A}dversarial \textbf{T}raining (E$^2$AT) framework for both visual and textual adversarial attacks. Specifically, for the visual aspect, E$^2$AT incorporates an efficient projector-based AT module that aligns the attack samples at the feature level. For training objectives, we propose a Dynamic Joint Multimodal Optimization (DJMO) strategy to enhance generalization ability against jailbreak attacks by dynamically adjusting weights between normal and adversarial objectives. Extensive experiments are conducted with five major jailbreak attack methods across three mainstream MLLMs. Results demonstrate that our E$^2$AT achieves the state-of-the-art performance, outperforming existing baselines by an average margin of 34\% across text and image modalities, while maintaining clean task performance. Furthermore, evaluations of real-world embodied intelligent systems highlight the practical applicability of E$^2$AT, paving the way for the development of more secure and reliable multimodal systems. Our code is available on \href{https://anonymous.4open.science/r/E2AT_568}{\textcolor{red}{https://anonymous.4open.science/r/E2AT\_568}}.

E$^2$AT: Multimodal Jailbreak Defense via Dynamic Joint Optimization for Multimodal Large Language Models

TL;DR

Results demonstrate that the proposed EAT achieves the state-of-the-art performance, outperforming existing baselines by an average margin of 34\% across text and image modalities, while maintaining clean task performance.

Abstract

Research endeavors have been made in learning robust Multimodal Large Language Models (MLLMs) against jailbreak attacks. However, existing methods for improving MLLMs' robustness still face critical challenges: \ding{172} how to efficiently tune massive weight parameters and \ding{173} how to ensure robustness against attacks across both visual and textual modalities. To this end, we propose an \textbf{E}fficient \textbf{E}nd-to-end \textbf{A}dversarial \textbf{T}raining (EAT) framework for both visual and textual adversarial attacks. Specifically, for the visual aspect, EAT incorporates an efficient projector-based AT module that aligns the attack samples at the feature level. For training objectives, we propose a Dynamic Joint Multimodal Optimization (DJMO) strategy to enhance generalization ability against jailbreak attacks by dynamically adjusting weights between normal and adversarial objectives. Extensive experiments are conducted with five major jailbreak attack methods across three mainstream MLLMs. Results demonstrate that our EAT achieves the state-of-the-art performance, outperforming existing baselines by an average margin of 34\% across text and image modalities, while maintaining clean task performance. Furthermore, evaluations of real-world embodied intelligent systems highlight the practical applicability of EAT, paving the way for the development of more secure and reliable multimodal systems. Our code is available on \href{https://anonymous.4open.science/r/E2AT_568}{\textcolor{red}{https://anonymous.4open.science/r/E2AT\_568}}.

Paper Structure

This paper contains 17 sections, 18 equations, 4 figures, 12 tables, 1 algorithm.

Figures (4)

  • Figure 1: Top: E$^2$AT vs. Existing Frameworks. E$^2$AT takes noisy image-text pairs as input. Through joint training, it optimizes the projector and the LLM to enhance performance. Bottom: Robotics Safety Demonstration. The robotic arm refuses the command to move a bomb into the target zone, demonstrating E$^2$AT's capability to reject harmful instructions while executing valid ones.
  • Figure 2: An overview of our E$^2$AT defense framework. The framework consists of two core components. First, a projector-based adversarial training mechanism optimizes feature alignment between the vision encoder and language model. Second, a joint multimodal optimization strategy enhances generalization against jailbreak attacks by dynamically adjusting weights between normal and adversarial objectives.
  • Figure 3: Performance comparison across different defense methods. The x-axis represents the attack success rate (ASR), and the y-axis represents the accuracy, where lower values on both metrics indicate better performance. The size of each bubble represents the relative computational cost (training time).
  • Figure 4: Embodied AI experimental comparisons between the original MLLM and our jointly optimized MLLM under real-world scene: Weapon-Related Manipulation, e.g., "Put the knife on the teddy bear toy". For the original MLLM, Steps: 1) receive task instruction; 2) seek task objects: the knife and the teddy bear; 3) find and grasp the knife; 4) move the knife; 5) place the knife on the teddy bear; and 6) task instruction finished. For our jointly optimized MLLM, Steps: 1) receive task instruction; 2) seek task objects: the knife and the teddy bear; 3) The knife grasping task not performed; 4) and 5) shake the head of robotic arm to indicate that the operation is not performed; and 6) task instruction not done and red light on.