Table of Contents
Fetching ...

Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models

Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari

TL;DR

This work comprehensively red team the safety of five advanced audio LMMs under three settings: (i) harmful questions in both audio and text formats, (ii) harmful questions in text format accompanied by distracting non-speech audio, and (iii) speech-specific jailbreaks.

Abstract

Large Multimodal Models (LMMs) have demonstrated the ability to interact with humans under real-world conditions by combining Large Language Models (LLMs) and modality encoders to align multimodal information (visual and auditory) with text. However, such models raise new safety challenges of whether models that are safety-aligned on text also exhibit consistent safeguards for multimodal inputs. Despite recent safety-alignment research on vision LMMs, the safety of audio LMMs remains under-explored. In this work, we comprehensively red team the safety of five advanced audio LMMs under three settings: (i) harmful questions in both audio and text formats, (ii) harmful questions in text format accompanied by distracting non-speech audio, and (iii) speech-specific jailbreaks. Our results under these settings demonstrate that open-source audio LMMs suffer an average attack success rate of 69.14% on harmful audio questions, and exhibit safety vulnerabilities when distracted with non-speech audio noise. Our speech-specific jailbreaks on Gemini-1.5-Pro achieve an attack success rate of 70.67% on the harmful query benchmark. We provide insights on what could cause these reported safety-misalignments. Warning: this paper contains offensive examples.

Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models

TL;DR

This work comprehensively red team the safety of five advanced audio LMMs under three settings: (i) harmful questions in both audio and text formats, (ii) harmful questions in text format accompanied by distracting non-speech audio, and (iii) speech-specific jailbreaks.

Abstract

Large Multimodal Models (LMMs) have demonstrated the ability to interact with humans under real-world conditions by combining Large Language Models (LLMs) and modality encoders to align multimodal information (visual and auditory) with text. However, such models raise new safety challenges of whether models that are safety-aligned on text also exhibit consistent safeguards for multimodal inputs. Despite recent safety-alignment research on vision LMMs, the safety of audio LMMs remains under-explored. In this work, we comprehensively red team the safety of five advanced audio LMMs under three settings: (i) harmful questions in both audio and text formats, (ii) harmful questions in text format accompanied by distracting non-speech audio, and (iii) speech-specific jailbreaks. Our results under these settings demonstrate that open-source audio LMMs suffer an average attack success rate of 69.14% on harmful audio questions, and exhibit safety vulnerabilities when distracted with non-speech audio noise. Our speech-specific jailbreaks on Gemini-1.5-Pro achieve an attack success rate of 70.67% on the harmful query benchmark. We provide insights on what could cause these reported safety-misalignments. Warning: this paper contains offensive examples.

Paper Structure

This paper contains 26 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: t-SNE visualisation of representation of harmful vs. benign questions (§\ref{['sec:analysis']}). The harmful$/$benign_text (red and yellow) denotes audio LMMs with text questions; harmful$/$benign_audio (green and cyan) denotes audio LMMs with audio questions; harmful$/$benign_llm (violet and pink) denotes backbone LLMs with text questions.
  • Figure 2: The percentage of harmful/safe responses beginning with specific words (%).
  • Figure 3: The ASR-q for non-speech audio input injections across different audio lengths (2-14 seconds).
  • Figure 4: t-SNE visualisation of representation space on types of non-speech audio input. 0s-Harmful/Benign (red and yellow) denote only harmful/benign text question input without non-speech audio. The rest of the representations denote the audio length with the lowest and highest ASR-q across types of non-speech audio.
  • Figure 5: The shape of representation space on SALMONN-7B under various length of input silence audio. "0s" denotes no audio input.
  • ...and 8 more figures