Table of Contents
Fetching ...

Parallel Stacked Aggregated Network for Voice Authentication in IoT-Enabled Smart Devices

Awais Khan, Ijaz Ul Haq, Khalid Mahmood Malik

TL;DR

This work addresses the vulnerability of IoT voice authentication to multiple spoofing attacks by introducing PSA-Net, a lightweight, raw-audio, unified anti-spoofing framework. It leverages a ResNeXt-inspired, split-transform-merge architecture with SE blocks to extract robust embeddings directly from audio, improving generalization against replay and voice-cloning attacks while remaining feasible for edge devices. Extensive experiments on ASVspoof2019/2021, PartialSpoof, and VSDC show PSA-Net achieves leading or competitive EER and t-DCF scores, with strong performance even under unified training conditions and unseen attacks. The results demonstrate practical impact for secure, low-footprint voice authentication in IoT environments, including favorable inference times and model sizes for edge deployment.

Abstract

Voice authentication on IoT-enabled smart devices has gained prominence in recent years due to increasing concerns over user privacy and security. The current authentication systems are vulnerable to different voice-spoofing attacks (e.g., replay, voice cloning, and audio deepfakes) that mimic legitimate voices to deceive authentication systems and enable fraudulent activities (e.g., impersonation, unauthorized access, financial fraud, etc.). Existing solutions are often designed to tackle a single type of attack, leading to compromised performance against unseen attacks. On the other hand, existing unified voice anti-spoofing solutions, not designed specifically for IoT, possess complex architectures and thus cannot be deployed on IoT-enabled smart devices. Additionally, most of these unified solutions exhibit significant performance issues, including higher equal error rates or lower accuracy for specific attacks. To overcome these issues, we present the parallel stacked aggregation network (PSA-Net), a lightweight framework designed as an anti-spoofing defense system for voice-controlled smart IoT devices. The PSA-Net processes raw audios directly and eliminates the need for dataset-dependent handcrafted features or pre-computed spectrograms. Furthermore, PSA-Net employs a split-transform-aggregate approach, which involves the segmentation of utterances, the extraction of intrinsic differentiable embeddings through convolutions, and the aggregation of them to distinguish legitimate from spoofed audios. In contrast to existing deep Resnet-oriented solutions, we incorporate cardinality as an additional dimension in our network, which enhances the PSA-Net ability to generalize across diverse attacks. The results show that the PSA-Net achieves more consistent performance for different attacks that exist in current anti-spoofing solutions.

Parallel Stacked Aggregated Network for Voice Authentication in IoT-Enabled Smart Devices

TL;DR

This work addresses the vulnerability of IoT voice authentication to multiple spoofing attacks by introducing PSA-Net, a lightweight, raw-audio, unified anti-spoofing framework. It leverages a ResNeXt-inspired, split-transform-merge architecture with SE blocks to extract robust embeddings directly from audio, improving generalization against replay and voice-cloning attacks while remaining feasible for edge devices. Extensive experiments on ASVspoof2019/2021, PartialSpoof, and VSDC show PSA-Net achieves leading or competitive EER and t-DCF scores, with strong performance even under unified training conditions and unseen attacks. The results demonstrate practical impact for secure, low-footprint voice authentication in IoT environments, including favorable inference times and model sizes for edge deployment.

Abstract

Voice authentication on IoT-enabled smart devices has gained prominence in recent years due to increasing concerns over user privacy and security. The current authentication systems are vulnerable to different voice-spoofing attacks (e.g., replay, voice cloning, and audio deepfakes) that mimic legitimate voices to deceive authentication systems and enable fraudulent activities (e.g., impersonation, unauthorized access, financial fraud, etc.). Existing solutions are often designed to tackle a single type of attack, leading to compromised performance against unseen attacks. On the other hand, existing unified voice anti-spoofing solutions, not designed specifically for IoT, possess complex architectures and thus cannot be deployed on IoT-enabled smart devices. Additionally, most of these unified solutions exhibit significant performance issues, including higher equal error rates or lower accuracy for specific attacks. To overcome these issues, we present the parallel stacked aggregation network (PSA-Net), a lightweight framework designed as an anti-spoofing defense system for voice-controlled smart IoT devices. The PSA-Net processes raw audios directly and eliminates the need for dataset-dependent handcrafted features or pre-computed spectrograms. Furthermore, PSA-Net employs a split-transform-aggregate approach, which involves the segmentation of utterances, the extraction of intrinsic differentiable embeddings through convolutions, and the aggregation of them to distinguish legitimate from spoofed audios. In contrast to existing deep Resnet-oriented solutions, we incorporate cardinality as an additional dimension in our network, which enhances the PSA-Net ability to generalize across diverse attacks. The results show that the PSA-Net achieves more consistent performance for different attacks that exist in current anti-spoofing solutions.

Paper Structure

This paper contains 31 sections, 12 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: The workflow of voice-spoofing attacks and the corresponding defense mechanisms safeguarding IoT-enabled smart devices. (a) Replay Attack Detection: a standard system trained on recorded physical access attacks (replays) effectively detects attempted replayed commands. (b) Voice Cloning Detection: A dedicated system trained on text-to-speech (TTS) and voice conversion data to identify spoofing from logical access. (c) Unified Detection: A single system trained to detect all types of spoofing attempts. This system ensures only verified commands reach the IoT device and execution of the commands within authorized access.
  • Figure 2: The internal architecture of the proposed parallel-stacked aggregated network (PSA-Net). Raw audio passes through three blocks of convolutional layers before it is split into various cardinalities. Residual connections are used between the layers, followed by global max pooling. The architectural design includes three Conv1D and five SE-PSA blocks before reaching the fully connected layer for real vs. spoof classification.
  • Figure 3: Intra-architecture of SE-PSA Blocks with 4 cardinalities and pre-activation convolutions. The similar Intra-architecture repeated 5 times for each block of the proposed PSA network.
  • Figure 4: The internal architectural for addressing the vanishing gradient via Spatial dropout. (a) A standard DNN, with processing and activation of all neurons, without any selection or drop. (b) A standard neural network with Spatial dropout, which results in the selection of required neurons with more relevant embeddings as mentioned in lee2020revisiting.
  • Figure 5: Aggregated Feature Map extraction with the Squeeze and Excitation Block. The SE block include the spatial dropout applied before every global average layer of the each SE-PSA block.
  • ...and 2 more figures