How Does Instrumental Music Help SingFake Detection?
Xuanjun Chen, Chia-Yu Hu, I-Ming Lin, Yi-Cheng Lin, I-Hsiang Chiu, You Zhang, Sung-Feng Huang, Yi-Hsuan Yang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang
TL;DR
This work addresses how instrumental accompaniment affects SingFake detection by combining behavioral analyses across multiple backbones with representational probing of encoder spaces. It demonstrates that instrumental music mainly serves as data augmentation, not a source of intrinsic musical cues, and that models predominantly rely on low-frequency vocal information. Fine-tuning for SingFake detection shifts representations toward speaker-specific features, diminishing content, paralinguistic, and semantic capabilities. These findings offer practical guidance for building more interpretable and robust SingFake detectors that resist reliance on superficial cues and are less sensitive to accompaniment artifacts.
Abstract
Although many models exist to detect singing voice deepfakes (SingFake), how these models operate, particularly with instrumental accompaniment, is unclear. We investigate how instrumental music affects SingFake detection from two perspectives. To investigate the behavioral effect, we test different backbones, unpaired instrumental tracks, and frequency subbands. To analyze the representational effect, we probe how fine-tuning alters encoders' speech and music capabilities. Our results show that instrumental accompaniment acts mainly as data augmentation rather than providing intrinsic cues (e.g., rhythm or harmony). Furthermore, fine-tuning increases reliance on shallow speaker features while reducing sensitivity to content, paralinguistic, and semantic information. These insights clarify how models exploit vocal versus instrumental cues and can inform the design of more interpretable and robust SingFake detection systems.
