Table of Contents
Fetching ...

Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models

Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari

TL;DR

This work targets safety-alignment in Large Audio Language Models (LALMs) and the adverse effect of over-rejection on usefulness. It introduces Reshaping Representation Space (RRS), an unsupervised fine-tuning approach that identifies safety-critical representation features and optimizes cluster-distance to relocate harmful inputs into a refusal zone while preserving benign responses. Empirical results across three Qwen LALMs and three input modes show that RRS achieves competitive or superior safety improvements with only a small average rise in over-rejection (0.88%), and maintains speech chatting performance in several configurations. The method leverages a compact safety dataset and a feature-selection mechanism (Top-m%) to drive targeted, representation-level adjustments without requiring large-scale alignment data, offering a practical path to safer LALMs in audio-enabled contexts.

Abstract

Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs) by enabling audio-based human interactions. However, recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment. Despite advances in defence measures for text and vision LLMs, effective safety-alignment strategies and audio-safety dataset specifically targeting LALMs are notably absent. Meanwhile defence measures based on Supervised Fine-tuning (SFT) struggle to address safety improvement while avoiding over-rejection issues, significantly compromising helpfulness. In this work, we propose an unsupervised safety-fine-tuning strategy as remedy that reshapes model's representation space to enhance existing LALMs safety-alignment while balancing the risk of over-rejection. Our experiments, conducted across three generations of Qwen LALMs, demonstrate that our approach significantly improves LALMs safety under three modality input conditions (audio-text, text-only, and audio-only) while increasing over-rejection rate by only 0.88% on average. Warning: this paper contains harmful examples.

Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models

TL;DR

This work targets safety-alignment in Large Audio Language Models (LALMs) and the adverse effect of over-rejection on usefulness. It introduces Reshaping Representation Space (RRS), an unsupervised fine-tuning approach that identifies safety-critical representation features and optimizes cluster-distance to relocate harmful inputs into a refusal zone while preserving benign responses. Empirical results across three Qwen LALMs and three input modes show that RRS achieves competitive or superior safety improvements with only a small average rise in over-rejection (0.88%), and maintains speech chatting performance in several configurations. The method leverages a compact safety dataset and a feature-selection mechanism (Top-m%) to drive targeted, representation-level adjustments without requiring large-scale alignment data, offering a practical path to safer LALMs in audio-enabled contexts.

Abstract

Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs) by enabling audio-based human interactions. However, recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment. Despite advances in defence measures for text and vision LLMs, effective safety-alignment strategies and audio-safety dataset specifically targeting LALMs are notably absent. Meanwhile defence measures based on Supervised Fine-tuning (SFT) struggle to address safety improvement while avoiding over-rejection issues, significantly compromising helpfulness. In this work, we propose an unsupervised safety-fine-tuning strategy as remedy that reshapes model's representation space to enhance existing LALMs safety-alignment while balancing the risk of over-rejection. Our experiments, conducted across three generations of Qwen LALMs, demonstrate that our approach significantly improves LALMs safety under three modality input conditions (audio-text, text-only, and audio-only) while increasing over-rejection rate by only 0.88% on average. Warning: this paper contains harmful examples.

Paper Structure

This paper contains 18 sections, 13 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Based on the visualisation of Qwen-Audio in AIAH aiah, we draw a simple representation space for illustrating the safety-alignment states of models.
  • Figure 2: t-SNE visualisation of representation of harmful and benign questions on Qwen-Audio RRS fine-tuning process. Epoch 0 denotes the representation space generated from the vanilla model. Red and blue denote harmful and benign questions, respectively.
  • Figure 3: Dataset structures of Basic, Mirror, and Parallel.