Table of Contents
Fetching ...

CompSpoof: A Dataset and Joint Learning Framework for Component-Level Audio Anti-spoofing Countermeasures

Xueping Zhang, Yechen Wang, Linxi Li, Liwei Jin, Ming Li

TL;DR

This work addresses component-level audio spoofing by introducing the CompSpoof dataset and a separation-enhanced joint learning framework. The approach first separates speech and environmental sounds via a UNet-based module, then applies dedicated anti-spoofing models to each component, with a joint objective that preserves spoof-relevant information, yielding a final five-class decision. Experiments show that Separation-Enhanced Joint Learning (SEF+JL) significantly outperforms baselines, especially for mixed-content cases, validating the need for per-component detection and joint optimization. The dataset and code released publicly enable research toward fine-grained, component-wise defenses in realistic audio security scenarios.

Abstract

Component-level audio Spoofing (Comp-Spoof) targets a new form of audio manipulation where only specific components of a signal, such as speech or environmental sound, are forged or substituted while other components remain genuine. Existing anti-spoofing datasets and methods treat an utterance or a segment as entirely bona fide or entirely spoofed, and thus cannot accurately detect component-level spoofing. To address this, we construct a new dataset, CompSpoof, covering multiple combinations of bona fide and spoofed speech and environmental sound. We further propose a separation-enhanced joint learning framework that separates audio components apart and applies anti-spoofing models to each one. Joint learning is employed, preserving information relevant for detection. Extensive experiments demonstrate that our method outperforms the baseline, highlighting the necessity of separate components and the importance of detecting spoofing for each component separately. Datasets and code are available at: https://github.com/XuepingZhang/CompSpoof.

CompSpoof: A Dataset and Joint Learning Framework for Component-Level Audio Anti-spoofing Countermeasures

TL;DR

This work addresses component-level audio spoofing by introducing the CompSpoof dataset and a separation-enhanced joint learning framework. The approach first separates speech and environmental sounds via a UNet-based module, then applies dedicated anti-spoofing models to each component, with a joint objective that preserves spoof-relevant information, yielding a final five-class decision. Experiments show that Separation-Enhanced Joint Learning (SEF+JL) significantly outperforms baselines, especially for mixed-content cases, validating the need for per-component detection and joint optimization. The dataset and code released publicly enable research toward fine-grained, component-wise defenses in realistic audio security scenarios.

Abstract

Component-level audio Spoofing (Comp-Spoof) targets a new form of audio manipulation where only specific components of a signal, such as speech or environmental sound, are forged or substituted while other components remain genuine. Existing anti-spoofing datasets and methods treat an utterance or a segment as entirely bona fide or entirely spoofed, and thus cannot accurately detect component-level spoofing. To address this, we construct a new dataset, CompSpoof, covering multiple combinations of bona fide and spoofed speech and environmental sound. We further propose a separation-enhanced joint learning framework that separates audio components apart and applies anti-spoofing models to each one. Joint learning is employed, preserving information relevant for detection. Extensive experiments demonstrate that our method outperforms the baseline, highlighting the necessity of separate components and the importance of detecting spoofing for each component separately. Datasets and code are available at: https://github.com/XuepingZhang/CompSpoof.

Paper Structure

This paper contains 10 sections, 4 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of the baseline and proposed separation-enhanced joint learning framework. " $\rightarrow$ ' illustrates the joint learning data flow between the separation and anti-spoofing models.