HoliAntiSpoof: Audio LLM for Holistic Speech Anti-Spoofing
Xuenan Xu, Yiming Ren, Liwei Liu, Wen Wu, Baoxiang Li, Chaochao Lu, Shuai Wang, Chao Zhang
TL;DR
HoliAntiSpoof addresses the gap where spoofing research treats manipulation as a binary task, by leveraging an Audio Large Language Model to perform holistic spoofing analysis that unifies authenticity, spoofing method, localized regions, and semantic influence into a single text-generation objective. It introduces DailyTalkEdit and extends PartialEdit with semantic annotations to support reasoning about the semantic effects of spoofing, and demonstrates that an ALLM-based framework can outperform conventional baselines in both in-domain and out-of-domain evaluations. The model is trained via supervised fine-tuning and enhanced with in-context learning to improve generalization across unseen domains and languages, showing promising zero-shot adaptation capabilities. The work contributes a novel holistic framework, two semantic-focused datasets, and evidence that spoofing-oriented features complement semantic-rich ALLM representations, enabling more interpretable and robust speech security analyses. Data and code are publicly available, underscoring practical impact for researchers and practitioners in speech security and safety.
Abstract
Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semantic effects. In this paper, we introduce HoliAntiSpoof, the first audio large language model (ALLM) framework for holistic speech anti-spoofing analysis. HoliAntiSpoof reformulates spoofing analysis as a unified text generation task, enabling joint reasoning over spoofing methods, affected speech attributes, and their semantic impacts. To support semantic-level analysis, we introduce DailyTalkEdit, a new anti-spoofing benchmark that simulates realistic conversational manipulations and provides annotations of semantic influence. Extensive experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across multiple settings, while preliminary results show that in-context learning further improves out-of-domain generalization. These findings indicate that ALLMs not only enhance speech spoofing detection performance but also enable interpretable analysis of spoofing behaviors and their semantic effects, pointing towards more trustworthy and explainable speech security. Data and code are publicly available.
