Table of Contents
Fetching ...

Joint Fullband-Subband Modeling for High-Resolution SingFake Detection

Xuanjun Chen, Chia-Yu Hu, Sung-Feng Huang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

Abstract

Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Conventional 16 kHz-sampled detectors prove inadequate, as they discard vital high-frequency information. This study presents the first systematic analysis of high-resolution (44.1 kHz sampling rate) audio for SVDD. We propose a joint fullband-subband modeling framework: the fullband captures global context, while subband-specific experts isolate fine-grained synthesis artifacts unevenly distributed across the spectrum. Experiments on the WildSVDD dataset demonstrate that high-frequency subbands provide essential complementary cues. Our framework significantly outperforms 16 kHz-sampled models, proving that high-resolution audio and strategic subband integration are critical for robust in-the-wild detection.

Joint Fullband-Subband Modeling for High-Resolution SingFake Detection

Abstract

Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Conventional 16 kHz-sampled detectors prove inadequate, as they discard vital high-frequency information. This study presents the first systematic analysis of high-resolution (44.1 kHz sampling rate) audio for SVDD. We propose a joint fullband-subband modeling framework: the fullband captures global context, while subband-specific experts isolate fine-grained synthesis artifacts unevenly distributed across the spectrum. Experiments on the WildSVDD dataset demonstrate that high-frequency subbands provide essential complementary cues. Our framework significantly outperforms 16 kHz-sampled models, proving that high-resolution audio and strategic subband integration are critical for robust in-the-wild detection.

Paper Structure

This paper contains 21 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of audio spectral coverage under different sampling rates. Existing systems typically process 16 kHz sampled audio, restricting them to the speech-critical band (0–8 kHz) and discarding high-frequency details. In contrast, our approach utilizes 44.1 kHz audio to cover the full spectral range (0–22.05 kHz). This preserves extended harmonics and breath textures essential for detecting sophisticated singing forgeries that are mathematically invisible at lower sampling rates.
  • Figure 2: The overview of our proposed Sing-HiResNet framework. The framework is implemented in two stages: Phase 1 establishes the backbone for fullband and subband expert models, while Phase 2 facilitates their integration through various joint fusion processes.
  • Figure 3: EER (%) results across two categorization schemes. The left columns present a method-centric view to highlight frequency impact, while the right columns provide a condition-centric comparison of integration strategies across Test A and Test B.
  • Figure 4: Grad-CAM visualizations of expert and distilled models for bonafide and deepfake samples, featuring (a) single-teacher (Low) and (b) dual-teacher (Low/Mid-High) distillation. White dashed lines marking corresponding subband boundaries.