Table of Contents
Fetching ...

`Do as I say not as I do': A Semi-Automated Approach for Jailbreak Prompt Attack against Multimodal LLMs

Chun Wai Chiu, Linghan Huang, Bo Li, Huaming Chen, Kim-Kwang Raymond Choo

TL;DR

This work identifies a novel audio-based jailbreak vulnerability in multimodal LLMs and introduces Flanking Attack, a voice-driven technique that places adversarial prompts within benign narratives to bypass content filters. A semi-automated evaluation framework, using an aligned multimodal LLM for policy-violation detection, demonstrates an average Attack Success Rate of approximately 0.81 across seven forbidden categories and 2,100 prompts. The study systematically analyzes the effectiveness of multi-layered prompts, including Text Prompt and Flanking Attack, and shows that component ablations reduce vulnerability, underscoring the need for robust defenses in audio-enabled LLMs. The authors discuss future directions including audio variation, prompt structure, and multilingual extensions to further stress-test and strengthen safety mechanisms for multimodal AI systems.

Abstract

Large Language Models (LLMs) have seen widespread applications across various domains due to their growing ability to process diverse types of input data, including text, audio, image and video. While LLMs have demonstrated outstanding performance in understanding and generating contexts for different scenarios, they are vulnerable to prompt-based attacks, which are mostly via text input. In this paper, we introduce the first voice-based jailbreak attack against multimodal LLMs, termed as Flanking Attack, which can process different types of input simultaneously towards the multimodal LLMs. Our work is motivated by recent advancements in monolingual voice-driven large language models, which have introduced new attack surfaces beyond traditional text-based vulnerabilities for LLMs. To investigate these risks, we examine the state-of-the-art multimodal LLMs, which can be accessed via different types of inputs such as audio input, focusing on how adversarial prompts can bypass its defense mechanisms. We propose a novel strategy, in which the disallowed prompt is flanked by benign, narrative-driven prompts. It is integrated in the Flanking Attack which attempts to humanizes the interaction context and execute the attack through a fictional setting. Further, to better evaluate the attack performance, we present a semi-automated self-assessment framework for policy violation detection. We demonstrate that Flanking Attack is capable of manipulating state-of-the-art LLMs into generating misaligned and forbidden outputs, which achieves an average attack success rate ranging from 0.67 to 0.93 across seven forbidden scenarios.

`Do as I say not as I do': A Semi-Automated Approach for Jailbreak Prompt Attack against Multimodal LLMs

TL;DR

This work identifies a novel audio-based jailbreak vulnerability in multimodal LLMs and introduces Flanking Attack, a voice-driven technique that places adversarial prompts within benign narratives to bypass content filters. A semi-automated evaluation framework, using an aligned multimodal LLM for policy-violation detection, demonstrates an average Attack Success Rate of approximately 0.81 across seven forbidden categories and 2,100 prompts. The study systematically analyzes the effectiveness of multi-layered prompts, including Text Prompt and Flanking Attack, and shows that component ablations reduce vulnerability, underscoring the need for robust defenses in audio-enabled LLMs. The authors discuss future directions including audio variation, prompt structure, and multilingual extensions to further stress-test and strengthen safety mechanisms for multimodal AI systems.

Abstract

Large Language Models (LLMs) have seen widespread applications across various domains due to their growing ability to process diverse types of input data, including text, audio, image and video. While LLMs have demonstrated outstanding performance in understanding and generating contexts for different scenarios, they are vulnerable to prompt-based attacks, which are mostly via text input. In this paper, we introduce the first voice-based jailbreak attack against multimodal LLMs, termed as Flanking Attack, which can process different types of input simultaneously towards the multimodal LLMs. Our work is motivated by recent advancements in monolingual voice-driven large language models, which have introduced new attack surfaces beyond traditional text-based vulnerabilities for LLMs. To investigate these risks, we examine the state-of-the-art multimodal LLMs, which can be accessed via different types of inputs such as audio input, focusing on how adversarial prompts can bypass its defense mechanisms. We propose a novel strategy, in which the disallowed prompt is flanked by benign, narrative-driven prompts. It is integrated in the Flanking Attack which attempts to humanizes the interaction context and execute the attack through a fictional setting. Further, to better evaluate the attack performance, we present a semi-automated self-assessment framework for policy violation detection. We demonstrate that Flanking Attack is capable of manipulating state-of-the-art LLMs into generating misaligned and forbidden outputs, which achieves an average attack success rate ranging from 0.67 to 0.93 across seven forbidden scenarios.

Paper Structure

This paper contains 41 sections, 26 figures, 3 tables.

Figures (26)

  • Figure 1: Example prompt and completions for refusals on disallowed categories.
  • Figure 2: Overview of Voice Jailbreakchu2024comprehensive
  • Figure 3: Responses to a malicious instruction by the LLAMA2-7B-CHAT model under different generation configurations.huang2023catastrophic
  • Figure 4: A taxonomy of concepts covered in the survey. shayegani2023survey
  • Figure 5: Adversarial Embedding Space Attackrussinovich2024great
  • ...and 21 more figures