Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models
Lam Pham, Phat Lam, Truong Nguyen, Huyen Nguyen, Alexander Schindler
TL;DR
This work tackles deepfake audio detection in IoS by evaluating a broad set of spectrogram-based features and three deep learning paradigms: end-to-end, transfer learning, and audio embeddings, unified under an ensemble framework. Six spectrogram variants derived from STFT, CQT, and WT with multiple auditory filters are combined with delta features to form compact 64×64 representations, enabling diverse pattern learning. Among end-to-end models, CNNs perform best, while pre-trained image networks and audio-embedding models (Whisper, Seamless, Speechbrain, Pyannote) contribute complementary strengths; ensembles using mean fusion across segments and models yield the strongest results. On ASVspoof 2019 Logic Access data, the proposed ensemble achieves an EER of $0.03$ and AuC of $0.994$, approaching state-of-the-art performance and highlighting the practical viability of spectrogram-centric ensembles for robust deepfake audio detection.
Abstract
In this paper, we propose a deep learning based system for the task of deepfake audio detection. In particular, the draw input audio is first transformed into various spectrograms using three transformation methods of Short-time Fourier Transform (STFT), Constant-Q Transform (CQT), Wavelet Transform (WT) combined with different auditory-based filters of Mel, Gammatone, linear filters (LF), and discrete cosine transform (DCT). Given the spectrograms, we evaluate a wide range of classification models based on three deep learning approaches. The first approach is to train directly the spectrograms using our proposed baseline models of CNN-based model (CNN-baseline), RNN-based model (RNN-baseline), C-RNN model (C-RNN baseline). Meanwhile, the second approach is transfer learning from computer vision models such as ResNet-18, MobileNet-V3, EfficientNet-B0, DenseNet-121, SuffleNet-V2, Swint, Convnext-Tiny, GoogLeNet, MNASsnet, RegNet. In the third approach, we leverage the state-of-the-art audio pre-trained models of Whisper, Seamless, Speechbrain, and Pyannote to extract audio embeddings from the input spectrograms. Then, the audio embeddings are explored by a Multilayer perceptron (MLP) model to detect the fake or real audio samples. Finally, high-performance deep learning models from these approaches are fused to achieve the best performance. We evaluated our proposed models on ASVspoof 2019 benchmark dataset. Our best ensemble model achieved an Equal Error Rate (EER) of 0.03, which is highly competitive to top-performing systems in the ASVspoofing 2019 challenge. Experimental results also highlight the potential of selective spectrograms and deep learning approaches to enhance the task of audio deepfake detection.
