Table of Contents
Fetching ...

HoliAntiSpoof: Audio LLM for Holistic Speech Anti-Spoofing

Xuenan Xu, Yiming Ren, Liwei Liu, Wen Wu, Baoxiang Li, Chaochao Lu, Shuai Wang, Chao Zhang

TL;DR

HoliAntiSpoof addresses the gap where spoofing research treats manipulation as a binary task, by leveraging an Audio Large Language Model to perform holistic spoofing analysis that unifies authenticity, spoofing method, localized regions, and semantic influence into a single text-generation objective. It introduces DailyTalkEdit and extends PartialEdit with semantic annotations to support reasoning about the semantic effects of spoofing, and demonstrates that an ALLM-based framework can outperform conventional baselines in both in-domain and out-of-domain evaluations. The model is trained via supervised fine-tuning and enhanced with in-context learning to improve generalization across unseen domains and languages, showing promising zero-shot adaptation capabilities. The work contributes a novel holistic framework, two semantic-focused datasets, and evidence that spoofing-oriented features complement semantic-rich ALLM representations, enabling more interpretable and robust speech security analyses. Data and code are publicly available, underscoring practical impact for researchers and practitioners in speech security and safety.

Abstract

Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semantic effects. In this paper, we introduce HoliAntiSpoof, the first audio large language model (ALLM) framework for holistic speech anti-spoofing analysis. HoliAntiSpoof reformulates spoofing analysis as a unified text generation task, enabling joint reasoning over spoofing methods, affected speech attributes, and their semantic impacts. To support semantic-level analysis, we introduce DailyTalkEdit, a new anti-spoofing benchmark that simulates realistic conversational manipulations and provides annotations of semantic influence. Extensive experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across multiple settings, while preliminary results show that in-context learning further improves out-of-domain generalization. These findings indicate that ALLMs not only enhance speech spoofing detection performance but also enable interpretable analysis of spoofing behaviors and their semantic effects, pointing towards more trustworthy and explainable speech security. Data and code are publicly available.

HoliAntiSpoof: Audio LLM for Holistic Speech Anti-Spoofing

TL;DR

HoliAntiSpoof addresses the gap where spoofing research treats manipulation as a binary task, by leveraging an Audio Large Language Model to perform holistic spoofing analysis that unifies authenticity, spoofing method, localized regions, and semantic influence into a single text-generation objective. It introduces DailyTalkEdit and extends PartialEdit with semantic annotations to support reasoning about the semantic effects of spoofing, and demonstrates that an ALLM-based framework can outperform conventional baselines in both in-domain and out-of-domain evaluations. The model is trained via supervised fine-tuning and enhanced with in-context learning to improve generalization across unseen domains and languages, showing promising zero-shot adaptation capabilities. The work contributes a novel holistic framework, two semantic-focused datasets, and evidence that spoofing-oriented features complement semantic-rich ALLM representations, enabling more interpretable and robust speech security analyses. Data and code are publicly available, underscoring practical impact for researchers and practitioners in speech security and safety.

Abstract

Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semantic effects. In this paper, we introduce HoliAntiSpoof, the first audio large language model (ALLM) framework for holistic speech anti-spoofing analysis. HoliAntiSpoof reformulates spoofing analysis as a unified text generation task, enabling joint reasoning over spoofing methods, affected speech attributes, and their semantic impacts. To support semantic-level analysis, we introduce DailyTalkEdit, a new anti-spoofing benchmark that simulates realistic conversational manipulations and provides annotations of semantic influence. Extensive experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across multiple settings, while preliminary results show that in-context learning further improves out-of-domain generalization. These findings indicate that ALLMs not only enhance speech spoofing detection performance but also enable interpretable analysis of spoofing behaviors and their semantic effects, pointing towards more trustworthy and explainable speech security. Data and code are publicly available.
Paper Structure (34 sections, 4 equations, 3 figures, 5 tables)

This paper contains 34 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Holistic spoofing analysis incorporates comprehensive speech perception and understanding, to perform multiple subtasks beyond real/fake classification.
  • Figure 2: Overview of HoliAntiSpoof. The ALLM is first fine-tuned to generate structured text, including both signal-level and semantic-level spoofing analysis. The incorporation of spoofing-oriented features is explored. Then the LLM is further fine-tuned to enable ICL.
  • Figure 3: The data construction pipeline of DailyTalkEdit dataset and the annotation of spoofing semantic influence. A dual-agent workflow makes contextual coherent modification of dialogues in DailyTalk for subsequent TTS and cut-and-paste spoofing. Then the spoofed text is fed to Gemini to annotate its potential impact.