Omni-Safety under Cross-Modality Conflict: Vulnerabilities, Dynamics Mechanisms and Efficient Alignment
Kun Wang, Zherui Li, Zhenhong Zhou, Yitong Zhang, Yan Mi, Kun Yang, Yiming Zhang, Junhao Dong, Zhongxiang Sun, Qiankun Li, Yang Liu
TL;DR
This work reveals a critical cross-modality safety gap in omni-modal LLMs by decoupling modality from semantics with AdvBench-Omni, uncovering a Mid-layer Dissolution of refusal signals and a predominant shrinkage in refusal-vector magnitude as cross-modal inputs unfold. It identifies a modality-invariant pure refusal direction via SVD and introduces OmniSteer, an adaptive, layer-wise steering mechanism using lightweight adapters to apply this direction dynamically. Across three OLLMs and eight datasets, OmniSteer raises the Refusal Success Rate from a challenging baseline to over 91% while preserving Benign Acceptance Rate and overall capabilities, validating its effectiveness and efficiency. The approach provides mechanistic insight into cross-modal safety and offers a practical, plug-in tool for safer omni-modal systems, with potential for integration with training-time defenses in future work.
Abstract
Omni-modal Large Language Models (OLLMs) greatly expand LLMs' multimodal capabilities but also introduce cross-modal safety risks. However, a systematic understanding of vulnerabilities in omni-modal interactions remains lacking. To bridge this gap, we establish a modality-semantics decoupling principle and construct the AdvBench-Omni dataset, which reveals a significant vulnerability in OLLMs. Mechanistic analysis uncovers a Mid-layer Dissolution phenomenon driven by refusal vector magnitude shrinkage, alongside the existence of a modal-invariant pure refusal direction. Inspired by these insights, we extract a golden refusal vector using Singular Value Decomposition and propose OmniSteer, which utilizes lightweight adapters to modulate intervention intensity adaptively. Extensive experiments show that our method not only increases the Refusal Success Rate against harmful inputs from 69.9% to 91.2%, but also effectively preserves the general capabilities across all modalities. Our code is available at: https://github.com/zhrli324/omni-safety-research.
