Deepfake Detection of Singing Voices With Whisper Encodings
Falguni Sharma, Priyanka Gupta
TL;DR
This work tackles singing voice deepfake detection (SVDD) by leveraging noise-variant encodings from the Whisper ASR model as discriminative features. The authors demonstrate that Whisper encodings outperform standard spectral features on both isolated vocals and mixtures, with a ResNet34 classifier achieving the best results across Whisper sizes. A key insight is that Whisper encodings, though not noise-invariant for ASR, carry noise-conditioned information that helps distinguish bonafide from deepfake singing, though unseen languages (T04) still pose challenges. The findings suggest practical improvements for SVDD in music, offering a path toward more robust defense against deepfake singing technologies in real-world applications.
Abstract
The deepfake generation of singing vocals is a concerning issue for artists in the music industry. In this work, we propose a singing voice deepfake detection (SVDD) system, which uses noise-variant encodings of open-AI's Whisper model. As counter-intuitive as it may sound, even though the Whisper model is known to be noise-robust, the encodings are rich in non-speech information, and are noise-variant. This leads us to evaluate Whisper encodings as feature representations for the SVDD task. Therefore, in this work, the SVDD task is performed on vocals and mixtures, and the performance is evaluated in \%EER over varying Whisper model sizes and two classifiers- CNN and ResNet34, under different testing conditions.
