Table of Contents
Fetching ...

Multilingual Dataset Integration Strategies for Robust Audio Deepfake Detection: A SAFE Challenge System

Hashim Ali, Surya Subramani, Lekha Bollinani, Nithin Sai Adupa, Sali El-Loh, Hafiz Malik

TL;DR

The paper tackles robust audio deepfake detection in diverse, real-world conditions by proposing the SAFE Challenge and a multilingual dataset integration strategy. It couples SSL front-ends (notably WavLM Large and MAE-AST Frame) with the AASIST back-end and uses RawBoost augmentation to improve cross-domain generalization. Through four iterative experiments integrating six complementary datasets and varying audio lengths, the approach achieves top-tier SAFE Challenge performance (2nd–3rd place) and strong ITW generalization (EER reduced from 35.61% to 8.42%). The findings highlight the value of multilingual, multi-domain training for detecting unmodified, processed, and laundering audio, while also acknowledging laundered audio as a persistent challenge demanding further investigation.

Abstract

The SAFE Challenge evaluates synthetic speech detection across three tasks: unmodified audio, processed audio with compression artifacts, and laundered audio designed to evade detection. We systematically explore self-supervised learning (SSL) front-ends, training data compositions, and audio length configurations for robust deepfake detection. Our AASIST-based approach incorporates WavLM large frontend with RawBoost augmentation, trained on a multilingual dataset of 256,600 samples spanning 9 languages and over 70 TTS systems from CodecFake, MLAAD v5, SpoofCeleb, Famous Figures, and MAILABS. Through extensive experimentation with different SSL front-ends, three training data versions, and two audio lengths, we achieved second place in both Task 1 (unmodified audio detection) and Task 3 (laundered audio detection), demonstrating strong generalization and robustness.

Multilingual Dataset Integration Strategies for Robust Audio Deepfake Detection: A SAFE Challenge System

TL;DR

The paper tackles robust audio deepfake detection in diverse, real-world conditions by proposing the SAFE Challenge and a multilingual dataset integration strategy. It couples SSL front-ends (notably WavLM Large and MAE-AST Frame) with the AASIST back-end and uses RawBoost augmentation to improve cross-domain generalization. Through four iterative experiments integrating six complementary datasets and varying audio lengths, the approach achieves top-tier SAFE Challenge performance (2nd–3rd place) and strong ITW generalization (EER reduced from 35.61% to 8.42%). The findings highlight the value of multilingual, multi-domain training for detecting unmodified, processed, and laundering audio, while also acknowledging laundered audio as a persistent challenge demanding further investigation.

Abstract

The SAFE Challenge evaluates synthetic speech detection across three tasks: unmodified audio, processed audio with compression artifacts, and laundered audio designed to evade detection. We systematically explore self-supervised learning (SSL) front-ends, training data compositions, and audio length configurations for robust deepfake detection. Our AASIST-based approach incorporates WavLM large frontend with RawBoost augmentation, trained on a multilingual dataset of 256,600 samples spanning 9 languages and over 70 TTS systems from CodecFake, MLAAD v5, SpoofCeleb, Famous Figures, and MAILABS. Through extensive experimentation with different SSL front-ends, three training data versions, and two audio lengths, we achieved second place in both Task 1 (unmodified audio detection) and Task 3 (laundered audio detection), demonstrating strong generalization and robustness.

Paper Structure

This paper contains 22 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: UMAP visualizations of SSL model representations for ASVSpoof 2019 LA eval data. Top: Comparison of MAE AST Frame, WavLM Large, and SSAST Patch Base models. Bottom: Additional SSL model comparisons including NPC 960hr, Mockingjay, and XLSR-53. Green color represent bonafide, other colors represent various deepfakes.