Table of Contents
Fetching ...

Physics-Guided Deepfake Detection for Voice Authentication Systems

Alireza Mohammadi, Keshav Sood, Dhananjay Thiruvady, Asef Nazari

TL;DR

The paper tackles dual threats to network-edge voice authentication—deepfake audio and control-plane poisoning in federated learning—by proposing a physics-guided, uncertainty-aware detection framework. It fuses interpretable physics-based speech features with self-supervised SSL embeddings, uses orthogonal fusion, and employs a hybrid detector (ViT, GNN, LightGBM) whose outputs are calibrated via Bayesian uncertainty for robust edge deployment. Across ASVspoof 2019/2021 LA and PA benchmarks, the approach achieves strong detection performance with interpretable uncertainty signals, and demonstrates practical edge latency suitable for real-world screening. This integrated, trust-aware architecture advances secure, distributed voice authentication in adversarial, privacy-preserving environments.

Abstract

Voice authentication systems deployed at the network edge face dual threats: a) sophisticated deepfake synthesis attacks and b) control-plane poisoning in distributed federated learning protocols. We present a framework coupling physics-guided deepfake detection with uncertainty-aware in edge learning. The framework fuses interpretable physics features modeling vocal tract dynamics with representations coming from a self-supervised learning module. The representations are then processed via a Multi-Modal Ensemble Architecture, followed by a Bayesian ensemble providing uncertainty estimates. Incorporating physics-based characteristics evaluations and uncertainty estimates of audio samples allows our proposed framework to remain robust to both advanced deepfake attacks and sophisticated control-plane poisoning, addressing the complete threat model for networked voice authentication.

Physics-Guided Deepfake Detection for Voice Authentication Systems

TL;DR

The paper tackles dual threats to network-edge voice authentication—deepfake audio and control-plane poisoning in federated learning—by proposing a physics-guided, uncertainty-aware detection framework. It fuses interpretable physics-based speech features with self-supervised SSL embeddings, uses orthogonal fusion, and employs a hybrid detector (ViT, GNN, LightGBM) whose outputs are calibrated via Bayesian uncertainty for robust edge deployment. Across ASVspoof 2019/2021 LA and PA benchmarks, the approach achieves strong detection performance with interpretable uncertainty signals, and demonstrates practical edge latency suitable for real-world screening. This integrated, trust-aware architecture advances secure, distributed voice authentication in adversarial, privacy-preserving environments.

Abstract

Voice authentication systems deployed at the network edge face dual threats: a) sophisticated deepfake synthesis attacks and b) control-plane poisoning in distributed federated learning protocols. We present a framework coupling physics-guided deepfake detection with uncertainty-aware in edge learning. The framework fuses interpretable physics features modeling vocal tract dynamics with representations coming from a self-supervised learning module. The representations are then processed via a Multi-Modal Ensemble Architecture, followed by a Bayesian ensemble providing uncertainty estimates. Incorporating physics-based characteristics evaluations and uncertainty estimates of audio samples allows our proposed framework to remain robust to both advanced deepfake attacks and sophisticated control-plane poisoning, addressing the complete threat model for networked voice authentication.

Paper Structure

This paper contains 9 sections, 8 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: System architecture overview showing the five-module pipeline for physics-guided deepfake detection. Raw audio (16 kHz, 3-second segments) flows through: (1) Physics Feature Extraction, (2) SSL Backbone (frozen WavLM-Large), (3) Orthogonal Feature Fusion, (4) Hybrid Detection Backbone (ViT + GNN + Gradient Boosting), and (5) Bayesian uncertainty quantification, producing calibrated genuine/Deepfake audio samples classifications with uncertainty estimates.
  • Figure 2: ECDFs for the temporal-frequency variation feature on ASVspoof 2019 LA. The left shift for deepfake-generated audio is pronounced, resulting in a KS distance $D \approx 0.296$ and a univariate ROC-AUC of 0.697. This confirms the feature's utility independent of the neural backbone.
  • Figure 3: ECDFs for the embedding mean velocity magnitude feature on ASVspoof 2019 PA. The deepfake distribution (red) is visibly shifted left relative to the genuine distribution (blue), yielding a Kolmogorov--Smirnov distance $D \approx 0.292$. This demonstrates the feature's discriminative power.