Parallel Stacked Aggregated Network for Voice Authentication in IoT-Enabled Smart Devices

Awais Khan; Ijaz Ul Haq; Khalid Mahmood Malik

Parallel Stacked Aggregated Network for Voice Authentication in IoT-Enabled Smart Devices

Awais Khan, Ijaz Ul Haq, Khalid Mahmood Malik

TL;DR

This work addresses the vulnerability of IoT voice authentication to multiple spoofing attacks by introducing PSA-Net, a lightweight, raw-audio, unified anti-spoofing framework. It leverages a ResNeXt-inspired, split-transform-merge architecture with SE blocks to extract robust embeddings directly from audio, improving generalization against replay and voice-cloning attacks while remaining feasible for edge devices. Extensive experiments on ASVspoof2019/2021, PartialSpoof, and VSDC show PSA-Net achieves leading or competitive EER and t-DCF scores, with strong performance even under unified training conditions and unseen attacks. The results demonstrate practical impact for secure, low-footprint voice authentication in IoT environments, including favorable inference times and model sizes for edge deployment.

Abstract

Voice authentication on IoT-enabled smart devices has gained prominence in recent years due to increasing concerns over user privacy and security. The current authentication systems are vulnerable to different voice-spoofing attacks (e.g., replay, voice cloning, and audio deepfakes) that mimic legitimate voices to deceive authentication systems and enable fraudulent activities (e.g., impersonation, unauthorized access, financial fraud, etc.). Existing solutions are often designed to tackle a single type of attack, leading to compromised performance against unseen attacks. On the other hand, existing unified voice anti-spoofing solutions, not designed specifically for IoT, possess complex architectures and thus cannot be deployed on IoT-enabled smart devices. Additionally, most of these unified solutions exhibit significant performance issues, including higher equal error rates or lower accuracy for specific attacks. To overcome these issues, we present the parallel stacked aggregation network (PSA-Net), a lightweight framework designed as an anti-spoofing defense system for voice-controlled smart IoT devices. The PSA-Net processes raw audios directly and eliminates the need for dataset-dependent handcrafted features or pre-computed spectrograms. Furthermore, PSA-Net employs a split-transform-aggregate approach, which involves the segmentation of utterances, the extraction of intrinsic differentiable embeddings through convolutions, and the aggregation of them to distinguish legitimate from spoofed audios. In contrast to existing deep Resnet-oriented solutions, we incorporate cardinality as an additional dimension in our network, which enhances the PSA-Net ability to generalize across diverse attacks. The results show that the PSA-Net achieves more consistent performance for different attacks that exist in current anti-spoofing solutions.

Parallel Stacked Aggregated Network for Voice Authentication in IoT-Enabled Smart Devices

TL;DR

Abstract

Parallel Stacked Aggregated Network for Voice Authentication in IoT-Enabled Smart Devices

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)