A Preliminary Case Study on Long-Form In-the-Wild Audio Spoofing Detection

Xuechen Liu; Xin Wang; Junichi Yamagishi

A Preliminary Case Study on Long-Form In-the-Wild Audio Spoofing Detection

Xuechen Liu, Xin Wang, Junichi Yamagishi

TL;DR

The paper tackles the problem that spoofing countermeasures trained on short, single-speaker audio struggle in real-world, long-form settings with multiple speakers and diverse processing. It presents a pipeline to create long-form spoof audio by concatenating $N=10$ short samples with $M\in [0,10]$ spoof segments, yielding a genuine ratio $M:(N-M)$, and evaluates AASIST-based CMs on four long-form evaluation sets derived from the ASVspoof2019 LA data. Findings show that models trained only on short-form data fail dramatically on long-form data ($EER$ as high as $>9\%$ without processing and $>45\%$ with processing), while incorporating long-form training data substantially improves performance (e.g., $EER<6\%$ at $0:10$ genuine ratio for certain models). The work highlights the need for long-form, multi-speaker, and processing-diverse training to build robust, real-world audio deepfake detectors and suggests directions for spoof localization and further dataset development. Overall, this study advances practical spoofing detection by bridging the gap between controlled datasets and real-world audio with longer duration and overlap.

Abstract

Audio spoofing detection has become increasingly important due to the rise in real-world cases. Current spoofing detectors, referred to as spoofing countermeasures (CM), are mainly trained and focused on audio waveforms with a single speaker and short duration. This study explores spoofing detection in more realistic scenarios, where the audio is long in duration and features multiple speakers and complex acoustic conditions. We test the widely-acquired AASIST under this challenging scenario, looking at the impact of multiple variations such as duration, speaker presence, and acoustic complexities on CM performance. Our work reveals key issues with current methods and suggests preliminary ways to improve them. We aim to make spoofing detection more applicable in more in-the-wild scenarios. This research is served as an important step towards developing detection systems that can handle the challenges of audio spoofing in real-world applications.

A Preliminary Case Study on Long-Form In-the-Wild Audio Spoofing Detection

TL;DR

short samples with

spoof segments, yielding a genuine ratio

, and evaluates AASIST-based CMs on four long-form evaluation sets derived from the ASVspoof2019 LA data. Findings show that models trained only on short-form data fail dramatically on long-form data (

as high as

without processing and

with processing), while incorporating long-form training data substantially improves performance (e.g.,

genuine ratio for certain models). The work highlights the need for long-form, multi-speaker, and processing-diverse training to build robust, real-world audio deepfake detectors and suggests directions for spoof localization and further dataset development. Overall, this study advances practical spoofing detection by bridging the gap between controlled datasets and real-world audio with longer duration and overlap.

Abstract

Paper Structure (8 sections, 2 figures, 1 table)

This paper contains 8 sections, 2 figures, 1 table.

Introduction
Long-form Spoofing Audio Generation
Processing Pipeline
Differences from related databases
Experimental Setup
Results & Discussion
Conclusion
Acknowledgements

Figures (2)

Figure 1: Audio generation and partitioning pipeline from short-duration audio to long-form audio. This can be applied to any audio dataset that contains genuine and spoof partitions. In this study, $N = 10$, and $M \in [0, 10]$. The compression codec and additional noise and reverberation effect can be found in Tab. \ref{['tab:manipulation_methods']}.
Figure 2: Performance of trained systems on generated long-form audio with various audio processing and overlap between adjacent source audio segments. The "audio process" in the figure represents the neural codec compression and adding noise and reverberation. Note that the x-axis of each figure shows the genuine ratio of spoofed audio. Genuine ratio of 10:0 corresponds to genuine audio.

A Preliminary Case Study on Long-Form In-the-Wild Audio Spoofing Detection

TL;DR

Abstract

A Preliminary Case Study on Long-Form In-the-Wild Audio Spoofing Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (2)