Table of Contents
Fetching ...

Efficient LLM-Jailbreaking via Multimodal-LLM Jailbreak

Haoxuan Ji, Zheng Lin, Zhenxing Niu, Xinbo Gao, Gang Hua

TL;DR

This work tackles LLM jailbreaks by introducing an indirect approach that builds a multimodal LLM (MLLM) around the target LLM to learn a jailbreaking embedding. The embedding is converted into textual jailbreak prompts (JBtxt) through De-embedding and De-tokenization, enabling efficient LLM jailbreaks via a four-step pipeline and a CLIP-based image-text initialization strategy. Key contributions include a PGD-based MLLM-jailbreak objective, a robust JBemb-to-JBtxt conversion, and an image-text semantic matching method to choose effective JBInit, yielding high attack success rates with significantly reduced runtime. Experiments on AdvBench-M and HarmBench demonstrate strong white-box and black-box performance, notable cross-class generalization, and transferability to several recent LLMs, highlighting both the vulnerability of multimodal components and practical implications for safety alignment.

Abstract

This paper focuses on jailbreaking attacks against large language models (LLMs), eliciting them to generate objectionable content in response to harmful user queries. Unlike previous LLM-jailbreak methods that directly orient to LLMs, our approach begins by constructing a multimodal large language model (MLLM) built upon the target LLM. Subsequently, we perform an efficient MLLM jailbreak and obtain a jailbreaking embedding. Finally, we convert the embedding into a textual jailbreaking suffix to carry out the jailbreak of target LLM. Compared to the direct LLM-jailbreak methods, our indirect jailbreaking approach is more efficient, as MLLMs are more vulnerable to jailbreak than pure LLM. Additionally, to improve the attack success rate of jailbreak, we propose an image-text semantic matching scheme to identify a suitable initial input. Extensive experiments demonstrate that our approach surpasses current state-of-the-art jailbreak methods in terms of both efficiency and effectiveness. Moreover, our approach exhibits superior cross-class generalization abilities.

Efficient LLM-Jailbreaking via Multimodal-LLM Jailbreak

TL;DR

This work tackles LLM jailbreaks by introducing an indirect approach that builds a multimodal LLM (MLLM) around the target LLM to learn a jailbreaking embedding. The embedding is converted into textual jailbreak prompts (JBtxt) through De-embedding and De-tokenization, enabling efficient LLM jailbreaks via a four-step pipeline and a CLIP-based image-text initialization strategy. Key contributions include a PGD-based MLLM-jailbreak objective, a robust JBemb-to-JBtxt conversion, and an image-text semantic matching method to choose effective JBInit, yielding high attack success rates with significantly reduced runtime. Experiments on AdvBench-M and HarmBench demonstrate strong white-box and black-box performance, notable cross-class generalization, and transferability to several recent LLMs, highlighting both the vulnerability of multimodal components and practical implications for safety alignment.

Abstract

This paper focuses on jailbreaking attacks against large language models (LLMs), eliciting them to generate objectionable content in response to harmful user queries. Unlike previous LLM-jailbreak methods that directly orient to LLMs, our approach begins by constructing a multimodal large language model (MLLM) built upon the target LLM. Subsequently, we perform an efficient MLLM jailbreak and obtain a jailbreaking embedding. Finally, we convert the embedding into a textual jailbreaking suffix to carry out the jailbreak of target LLM. Compared to the direct LLM-jailbreak methods, our indirect jailbreaking approach is more efficient, as MLLMs are more vulnerable to jailbreak than pure LLM. Additionally, to improve the attack success rate of jailbreak, we propose an image-text semantic matching scheme to identify a suitable initial input. Extensive experiments demonstrate that our approach surpasses current state-of-the-art jailbreak methods in terms of both efficiency and effectiveness. Moreover, our approach exhibits superior cross-class generalization abilities.
Paper Structure (18 sections, 2 equations, 2 figures, 5 tables)

This paper contains 18 sections, 2 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The indirect jailbreaking scheme of our approach begins with constructing an MLLM by integrating a visual module. We then perform an efficient MLLM-jailbreak to obtain the JBemb, which is subsequently converted into JBtxt for jailbreaking the target LLM.
  • Figure 2: The full workflow of our approach. Before Step1, we propose an image-text matching scheme to identify an appropriate initial input JBinit at Step0.