Physics-Guided Deepfake Detection for Voice Authentication Systems
Alireza Mohammadi, Keshav Sood, Dhananjay Thiruvady, Asef Nazari
TL;DR
The paper tackles dual threats to network-edge voice authentication—deepfake audio and control-plane poisoning in federated learning—by proposing a physics-guided, uncertainty-aware detection framework. It fuses interpretable physics-based speech features with self-supervised SSL embeddings, uses orthogonal fusion, and employs a hybrid detector (ViT, GNN, LightGBM) whose outputs are calibrated via Bayesian uncertainty for robust edge deployment. Across ASVspoof 2019/2021 LA and PA benchmarks, the approach achieves strong detection performance with interpretable uncertainty signals, and demonstrates practical edge latency suitable for real-world screening. This integrated, trust-aware architecture advances secure, distributed voice authentication in adversarial, privacy-preserving environments.
Abstract
Voice authentication systems deployed at the network edge face dual threats: a) sophisticated deepfake synthesis attacks and b) control-plane poisoning in distributed federated learning protocols. We present a framework coupling physics-guided deepfake detection with uncertainty-aware in edge learning. The framework fuses interpretable physics features modeling vocal tract dynamics with representations coming from a self-supervised learning module. The representations are then processed via a Multi-Modal Ensemble Architecture, followed by a Bayesian ensemble providing uncertainty estimates. Incorporating physics-based characteristics evaluations and uncertainty estimates of audio samples allows our proposed framework to remain robust to both advanced deepfake attacks and sophisticated control-plane poisoning, addressing the complete threat model for networked voice authentication.
