Table of Contents
Fetching ...

SIGNL: A Label-Efficient Audio Deepfake Detection System via Spectral-Temporal Graph Non-Contrastive Learning

Falih Gozi Febrinanto, Kristen Moore, Chandra Thapa, Jiangang Ma, Vidya Saikrishna

TL;DR

The paper tackles robust audio deepfake detection in settings with limited labeled data. It proposes SIGNL, a dual-view graph framework that builds spectral and temporal graphs from an audio's visual representation and pre-trains two vision GC encoders with a non-contrastive objective on augmented graph pairs, followed by fine-tuning on minimal labels. SIGNL consistently outperforms both supervised and self-supervised baselines across four benchmarks, achieving strong results at 5% label coverage and demonstrating cross-domain generalization, robustness to perturbations, and resilience to ultra-realistic attacks after limited fine-tuning. The work highlights the practical value of leveraging unlabeled audio via spectral-temporal graph representations and non-contrastive learning for real-world deployment in security-critical scenarios.

Abstract

Audio deepfake detection is increasingly important as synthetic speech becomes more realistic and accessible. Recent methods, including those using graph neural networks (GNNs) to model frequency and temporal dependencies, show strong potential but need large amounts of labeled data, which limits their practical use. Label-efficient alternatives like graph-based non-contrastive learning offer a potential solution, as they can learn useful representations from unlabeled data without using negative samples. However, current graph non-contrastive approaches are built for single-view graph representations and cannot be directly used for audio, which has unique spectral and temporal structures. Bridging this gap requires dual-view graph modeling suited to audio signals. In this work, we introduce SIGNL (Spectral-temporal vIsion Graph Non-contrastive Learning), a label-efficient expert system for detecting audio deepfakes. SIGNL operates on the visual representation of audio, such as spectrograms or other time-frequency encodings, transforming them into spectral and temporal graphs for structured feature extraction. It then employs graph convolutional encoders to learn complementary frequency-time features, effectively capturing the unique characteristics of audio. These encoders are pre-trained using a non-contrastive self-supervised learning strategy on augmented graph pairs, enabling effective representation learning without labeled data. The resulting encoders are then fine-tuned on minimal labelled data for downstream deepfake detection. SIGNL achieves strong performance on multiple audio deepfake detection benchmarks, including 7.88% EER on ASVspoof 2021 DF and 3.95% EER on ASVspoof 5 using only 5% labeled data. It also generalizes well to unseen conditions, reaching 10.16% EER on the In-The-Wild dataset when trained on CFAD.

SIGNL: A Label-Efficient Audio Deepfake Detection System via Spectral-Temporal Graph Non-Contrastive Learning

TL;DR

The paper tackles robust audio deepfake detection in settings with limited labeled data. It proposes SIGNL, a dual-view graph framework that builds spectral and temporal graphs from an audio's visual representation and pre-trains two vision GC encoders with a non-contrastive objective on augmented graph pairs, followed by fine-tuning on minimal labels. SIGNL consistently outperforms both supervised and self-supervised baselines across four benchmarks, achieving strong results at 5% label coverage and demonstrating cross-domain generalization, robustness to perturbations, and resilience to ultra-realistic attacks after limited fine-tuning. The work highlights the practical value of leveraging unlabeled audio via spectral-temporal graph representations and non-contrastive learning for real-world deployment in security-critical scenarios.

Abstract

Audio deepfake detection is increasingly important as synthetic speech becomes more realistic and accessible. Recent methods, including those using graph neural networks (GNNs) to model frequency and temporal dependencies, show strong potential but need large amounts of labeled data, which limits their practical use. Label-efficient alternatives like graph-based non-contrastive learning offer a potential solution, as they can learn useful representations from unlabeled data without using negative samples. However, current graph non-contrastive approaches are built for single-view graph representations and cannot be directly used for audio, which has unique spectral and temporal structures. Bridging this gap requires dual-view graph modeling suited to audio signals. In this work, we introduce SIGNL (Spectral-temporal vIsion Graph Non-contrastive Learning), a label-efficient expert system for detecting audio deepfakes. SIGNL operates on the visual representation of audio, such as spectrograms or other time-frequency encodings, transforming them into spectral and temporal graphs for structured feature extraction. It then employs graph convolutional encoders to learn complementary frequency-time features, effectively capturing the unique characteristics of audio. These encoders are pre-trained using a non-contrastive self-supervised learning strategy on augmented graph pairs, enabling effective representation learning without labeled data. The resulting encoders are then fine-tuned on minimal labelled data for downstream deepfake detection. SIGNL achieves strong performance on multiple audio deepfake detection benchmarks, including 7.88% EER on ASVspoof 2021 DF and 3.95% EER on ASVspoof 5 using only 5% labeled data. It also generalizes well to unseen conditions, reaching 10.16% EER on the In-The-Wild dataset when trained on CFAD.
Paper Structure (23 sections, 6 equations, 5 figures, 11 tables, 2 algorithms)

This paper contains 23 sections, 6 equations, 5 figures, 11 tables, 2 algorithms.

Figures (5)

  • Figure 1: Overview of the SIGNL framework. Stage 1 - Spectral-temporal graph generation: Audio data is converted into spectral and temporal graphs to capture complementary frequency-time structure. Stage 2 - Non-contrastive graph pre-training: – Graph encoders are trained to maximize similarity between augmented graph pairs without labels. Stage 3 - Downstream training: – Pre-trained encoders are fine-tuned on a small labeled set for audio deepfake detection.
  • Figure 2: Vision GC Encoder.
  • Figure 3: Similarity of the pair embeddings before and after the projection head $g$.
  • Figure 4: EER (%) comparison for SIGNL across different combinations of the number of patches ($N$) and the number of nodes' neighbors ($K$) in the full-label scenario. The line plots represent the mean EER values, while the shaded areas indicate the standard deviation. Lower values: better performance ($\downarrow$).
  • Figure 5: EER (%) comparison for SIGNL across different visual representations of audio in the full-label scenario. The bar charts represent the mean, while the error bars indicate the standard deviation. Lower values: better performance ($\downarrow$).