Table of Contents
Fetching ...

Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models

Yanxu Mao, Peipei Liu, Tiehan Cui, Zhaoteng Yan, Congying Liu, Datao You

TL;DR

Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models presents JMLLM, a tri-modal jailbreaking framework that combines Alternating Translation, Word Encryption, Feature Collapse, and Harmful Injection to bypass safety constraints across text, image, and speech inputs. It introduces TriJail, the first comprehensive tri-modal jailbreak dataset, and demonstrates state-of-the-art attack success rates with minimal query counts across 13 LLMs on AdvBench and TriJail, while also exploring defense via Harmful Separator. The work provides extensive ablations and cross-model comparisons, underscoring substantial security risks as multimodal models scale and integrate more modalities. These findings emphasize the need for robust, modality-aware defenses to ensure reliable and safe deployment of advanced multimodal LLMs in real-world applications.

Abstract

Large language models (LLMs) are widely applied in various fields of society due to their powerful reasoning, understanding, and generation capabilities. However, the security issues associated with these models are becoming increasingly severe. Jailbreaking attacks, as an important method for detecting vulnerabilities in LLMs, have been explored by researchers who attempt to induce these models to generate harmful content through various attack methods. Nevertheless, existing jailbreaking methods face numerous limitations, such as excessive query counts, limited coverage of jailbreak modalities, low attack success rates, and simplistic evaluation methods. To overcome these constraints, this paper proposes a multimodal jailbreaking method: JMLLM. This method integrates multiple strategies to perform comprehensive jailbreak attacks across text, visual, and auditory modalities. Additionally, we contribute a new and comprehensive dataset for multimodal jailbreaking research: TriJail, which includes jailbreak prompts for all three modalities. Experiments on the TriJail dataset and the benchmark dataset AdvBench, conducted on 13 popular LLMs, demonstrate advanced attack success rates and significant reduction in time overhead.

Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models

TL;DR

Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models presents JMLLM, a tri-modal jailbreaking framework that combines Alternating Translation, Word Encryption, Feature Collapse, and Harmful Injection to bypass safety constraints across text, image, and speech inputs. It introduces TriJail, the first comprehensive tri-modal jailbreak dataset, and demonstrates state-of-the-art attack success rates with minimal query counts across 13 LLMs on AdvBench and TriJail, while also exploring defense via Harmful Separator. The work provides extensive ablations and cross-model comparisons, underscoring substantial security risks as multimodal models scale and integrate more modalities. These findings emphasize the need for robust, modality-aware defenses to ensure reliable and safe deployment of advanced multimodal LLMs in real-world applications.

Abstract

Large language models (LLMs) are widely applied in various fields of society due to their powerful reasoning, understanding, and generation capabilities. However, the security issues associated with these models are becoming increasingly severe. Jailbreaking attacks, as an important method for detecting vulnerabilities in LLMs, have been explored by researchers who attempt to induce these models to generate harmful content through various attack methods. Nevertheless, existing jailbreaking methods face numerous limitations, such as excessive query counts, limited coverage of jailbreak modalities, low attack success rates, and simplistic evaluation methods. To overcome these constraints, this paper proposes a multimodal jailbreaking method: JMLLM. This method integrates multiple strategies to perform comprehensive jailbreak attacks across text, visual, and auditory modalities. Additionally, we contribute a new and comprehensive dataset for multimodal jailbreaking research: TriJail, which includes jailbreak prompts for all three modalities. Experiments on the TriJail dataset and the benchmark dataset AdvBench, conducted on 13 popular LLMs, demonstrate advanced attack success rates and significant reduction in time overhead.

Paper Structure

This paper contains 35 sections, 26 equations, 14 figures, 12 tables, 2 algorithms.

Figures (14)

  • Figure 1: The overall framework diagram of JMLLM illustrates the entire process of the jailbreak attack.
  • Figure 2: The overall framework of single-round and multi-round attacks for JMLLM. Hidden Toxicity is the detailed presentation of our four attack strategies.
  • Figure 3: Comparison of GPT-ASR scores across different baseline methods.
  • Figure 4: Comparison of KW-ASR scores across different baseline methods.
  • Figure 5: Comparison of ASR scores between JMLLM and ReNeLLM in different scenarios of AdvBench dataset.
  • ...and 9 more figures