Continuous Learning of Transformer-based Audio Deepfake Detection
Tuan Duy Nguyen Le, Kah Kuan Teh, Huy Dat Tran
TL;DR
This work introduces an Audio Spectrogram Transformer (AST) framework for high-accuracy audio deepfake detection, trained on over 2 million synthetic samples with diverse augmentations to ensure robustness to distortions and low-quality audio. A novel continuous-learning plugin, combining AST embeddings with gradient-boosted decision trees, enables rapid adaptation to new fake types using minimal labeled data, outperforming traditional fine-tuning. The approach achieves state-of-the-art results on standard benchmarks (e.g., ASVspoof 2019) and demonstrates improved generalization to unseen methods and degraded audio, with substantial gains observed through the continuous-learning pipeline. The proposed methodology offers a practical path for industrial deployment by maintaining performance while quickly incorporating emerging deepfake techniques via semi-supervised labeling and targeted fine-tuning.
Abstract
This paper proposes a novel framework for audio deepfake detection with two main objectives: i) attaining the highest possible accuracy on available fake data, and ii) effectively performing continuous learning on new fake data in a few-shot learning manner. Specifically, we conduct a large audio deepfake collection using various deep audio generation methods. The data is further enhanced with additional augmentation methods to increase variations amidst compressions, far-field recordings, noise, and other distortions. We then adopt the Audio Spectrogram Transformer for the audio deepfake detection model. Accordingly, the proposed method achieves promising performance on various benchmark datasets. Furthermore, we present a continuous learning plugin module to update the trained model most effectively with the fewest possible labeled data points of the new fake type. The proposed method outperforms the conventional direct fine-tuning approach with much fewer labeled data points.
