Table of Contents
Fetching ...

SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection

Ido Nitzan HIdekel, Gal lifshitz, Khen Cohen, Dan Raviv

TL;DR

We address the generalization gap in audio deepfake detection caused by spectral bias, where high-frequency artifacts are underutilized. Our approach, SONAR, uses a frequency-guided dual-path architecture that separately encodes low-frequency content and high-frequency residuals, with learnable SRM filters and frequency cross-attention, trained via a Jensen–Shannon divergence loss $L_{JS}$ to align real LF–HF distributions and separate fake ones. SONAR achieves state-of-the-art single-run performance on ASVspoof 2021 (LA/DF) and In-The-Wild benchmarks, while converging in as few as 4–12 epochs, and remains robust to common codecs and bandwidth shifts. The framework is architecture-agnostic and can be integrated into other models or modalities where subtle high-frequency cues drive decisions, turning a fundamental spectral-bias flaw into a practical detection signal.

Abstract

Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts and leaves those same artifacts under-exploited by common detectors. To address this gap, we propose Spectral-cONtrastive Audio Residuals (SONAR), a frequency-guided framework that explicitly disentangles an audio signal into complementary representations. An XLSR encoder captures the dominant low-frequency content, while the same cloned path, preceded by learnable SRM, value-constrained high-pass filters, distills faint HF residuals. Frequency cross-attention reunites the two views for long- and short-range frequency dependencies, and a frequency-aware Jensen-Shannon contrastive loss pulls real content-noise pairs together while pushing fake embeddings apart, accelerating optimization and sharpening decision boundaries. Evaluated on the ASVspoof 2021 and in-the-wild benchmarks, SONAR attains state-of-the-art performance and converges four times faster than strong baselines. By elevating faint high-frequency residuals to first-class learning signals, SONAR unveils a fully data-driven, frequency-guided contrastive framework that splits the latent space into two disjoint manifolds: natural-HF for genuine audio and distorted-HF for synthetic audio, thereby sharpening decision boundaries. Because the scheme operates purely at the representation level, it is architecture-agnostic and, in future work, can be seamlessly integrated into any model or modality where subtle high-frequency cues are decisive.

SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection

TL;DR

We address the generalization gap in audio deepfake detection caused by spectral bias, where high-frequency artifacts are underutilized. Our approach, SONAR, uses a frequency-guided dual-path architecture that separately encodes low-frequency content and high-frequency residuals, with learnable SRM filters and frequency cross-attention, trained via a Jensen–Shannon divergence loss to align real LF–HF distributions and separate fake ones. SONAR achieves state-of-the-art single-run performance on ASVspoof 2021 (LA/DF) and In-The-Wild benchmarks, while converging in as few as 4–12 epochs, and remains robust to common codecs and bandwidth shifts. The framework is architecture-agnostic and can be integrated into other models or modalities where subtle high-frequency cues drive decisions, turning a fundamental spectral-bias flaw into a practical detection signal.

Abstract

Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts and leaves those same artifacts under-exploited by common detectors. To address this gap, we propose Spectral-cONtrastive Audio Residuals (SONAR), a frequency-guided framework that explicitly disentangles an audio signal into complementary representations. An XLSR encoder captures the dominant low-frequency content, while the same cloned path, preceded by learnable SRM, value-constrained high-pass filters, distills faint HF residuals. Frequency cross-attention reunites the two views for long- and short-range frequency dependencies, and a frequency-aware Jensen-Shannon contrastive loss pulls real content-noise pairs together while pushing fake embeddings apart, accelerating optimization and sharpening decision boundaries. Evaluated on the ASVspoof 2021 and in-the-wild benchmarks, SONAR attains state-of-the-art performance and converges four times faster than strong baselines. By elevating faint high-frequency residuals to first-class learning signals, SONAR unveils a fully data-driven, frequency-guided contrastive framework that splits the latent space into two disjoint manifolds: natural-HF for genuine audio and distorted-HF for synthetic audio, thereby sharpening decision boundaries. Because the scheme operates purely at the representation level, it is architecture-agnostic and, in future work, can be seamlessly integrated into any model or modality where subtle high-frequency cues are decisive.

Paper Structure

This paper contains 29 sections, 11 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: SONAR overview. Audio is processed in parallel by the Content Feature Extractor (CFE) and the Noise Feature Extractor (NFE). Their embeddings are fused via cross-attention (CA) and classified as real/fake.
  • Figure 2: Low–high frequency structure reveals spoofing artifacts. (a) Pearson correlation between low- (0–4 kHz) and high-frequency (7–8 kHz) bands shows real speech with strong co-modulation ($r \approx 0.6$), while fakes collapse toward zero or negative values. (b) The energy difference $\Delta E = E_{\text{HF}} - E_{\text{LF}}$ is systematically shifted for fakes across corpora, exposing a consistent HF/LF imbalance. These second-order cues motivate SONAR’s distributional alignment objective.
  • Figure 3: Rich Feature Extractor (RFE). Audio $x$ is processed by a bank of $M$ SRM-inspired filters, concatenated, and passed through a $1 \times 1$ learnable convolution layer to produce the noise residual representation $x_{\text{noise}}$.
  • Figure 4: Latent representation analysis of SONAR. (a) t-SNE shows that SONAR’s dual-path embeddings separate real and fake audio more distinctly than the baseline. (b) Cosine similarity histograms confirm that real speech preserves LF–HF coupling, while fakes exhibit disjoint embeddings.
  • Figure 5: Impact of resampling on detection accuracy. Equal‑Error Rate (EER) rises as the sampling rate (SR) of the test set is lowered, confirming that the model relies on high‑frequency artifacts introduced during deep‑fake synthesis.
  • ...and 1 more figures