A Preliminary Case Study on Long-Form In-the-Wild Audio Spoofing Detection
Xuechen Liu, Xin Wang, Junichi Yamagishi
TL;DR
The paper tackles the problem that spoofing countermeasures trained on short, single-speaker audio struggle in real-world, long-form settings with multiple speakers and diverse processing. It presents a pipeline to create long-form spoof audio by concatenating $N=10$ short samples with $M\in [0,10]$ spoof segments, yielding a genuine ratio $M:(N-M)$, and evaluates AASIST-based CMs on four long-form evaluation sets derived from the ASVspoof2019 LA data. Findings show that models trained only on short-form data fail dramatically on long-form data ($EER$ as high as $>9\%$ without processing and $>45\%$ with processing), while incorporating long-form training data substantially improves performance (e.g., $EER<6\%$ at $0:10$ genuine ratio for certain models). The work highlights the need for long-form, multi-speaker, and processing-diverse training to build robust, real-world audio deepfake detectors and suggests directions for spoof localization and further dataset development. Overall, this study advances practical spoofing detection by bridging the gap between controlled datasets and real-world audio with longer duration and overlap.
Abstract
Audio spoofing detection has become increasingly important due to the rise in real-world cases. Current spoofing detectors, referred to as spoofing countermeasures (CM), are mainly trained and focused on audio waveforms with a single speaker and short duration. This study explores spoofing detection in more realistic scenarios, where the audio is long in duration and features multiple speakers and complex acoustic conditions. We test the widely-acquired AASIST under this challenging scenario, looking at the impact of multiple variations such as duration, speaker presence, and acoustic complexities on CM performance. Our work reveals key issues with current methods and suggests preliminary ways to improve them. We aim to make spoofing detection more applicable in more in-the-wild scenarios. This research is served as an important step towards developing detection systems that can handle the challenges of audio spoofing in real-world applications.
