Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models
Yanxu Mao, Peipei Liu, Tiehan Cui, Zhaoteng Yan, Congying Liu, Datao You
TL;DR
Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models presents JMLLM, a tri-modal jailbreaking framework that combines Alternating Translation, Word Encryption, Feature Collapse, and Harmful Injection to bypass safety constraints across text, image, and speech inputs. It introduces TriJail, the first comprehensive tri-modal jailbreak dataset, and demonstrates state-of-the-art attack success rates with minimal query counts across 13 LLMs on AdvBench and TriJail, while also exploring defense via Harmful Separator. The work provides extensive ablations and cross-model comparisons, underscoring substantial security risks as multimodal models scale and integrate more modalities. These findings emphasize the need for robust, modality-aware defenses to ensure reliable and safe deployment of advanced multimodal LLMs in real-world applications.
Abstract
Large language models (LLMs) are widely applied in various fields of society due to their powerful reasoning, understanding, and generation capabilities. However, the security issues associated with these models are becoming increasingly severe. Jailbreaking attacks, as an important method for detecting vulnerabilities in LLMs, have been explored by researchers who attempt to induce these models to generate harmful content through various attack methods. Nevertheless, existing jailbreaking methods face numerous limitations, such as excessive query counts, limited coverage of jailbreak modalities, low attack success rates, and simplistic evaluation methods. To overcome these constraints, this paper proposes a multimodal jailbreaking method: JMLLM. This method integrates multiple strategies to perform comprehensive jailbreak attacks across text, visual, and auditory modalities. Additionally, we contribute a new and comprehensive dataset for multimodal jailbreaking research: TriJail, which includes jailbreak prompts for all three modalities. Experiments on the TriJail dataset and the benchmark dataset AdvBench, conducted on 13 popular LLMs, demonstrate advanced attack success rates and significant reduction in time overhead.
