Table of Contents
Fetching ...

HiddenSpeaker: Generate Imperceptible Unlearnable Audios for Speaker Verification System

Zhisheng Zhang, Pengyang Huang

TL;DR

The proposed HiddenSpeaker embedding imperceptible perturbations within the training speech samples and rendering them unlearnable for deep-learning-based speaker verification systems that employ large-scale speakers for efficient training is demonstrated.

Abstract

In recent years, the remarkable advancements in deep neural networks have brought tremendous convenience. However, the training process of a highly effective model necessitates a substantial quantity of samples, which brings huge potential threats, like unauthorized exploitation with privacy leakage. In response, we propose a framework named HiddenSpeaker, embedding imperceptible perturbations within the training speech samples and rendering them unlearnable for deep-learning-based speaker verification systems that employ large-scale speakers for efficient training. The HiddenSpeaker utilizes a simplified error-minimizing method named Single-Level Error-Minimizing (SLEM) to generate specific and effective perturbations. Additionally, a hybrid objective function is employed for human perceptual optimization, ensuring the perturbation is indistinguishable from human listeners. We conduct extensive experiments on multiple state-of-the-art (SOTA) models in the speaker verification domain to evaluate HiddenSpeaker. Our results demonstrate that HiddenSpeaker not only deceives the model with unlearnable samples but also enhances the imperceptibility of the perturbations, showcasing strong transferability across different models.

HiddenSpeaker: Generate Imperceptible Unlearnable Audios for Speaker Verification System

TL;DR

The proposed HiddenSpeaker embedding imperceptible perturbations within the training speech samples and rendering them unlearnable for deep-learning-based speaker verification systems that employ large-scale speakers for efficient training is demonstrated.

Abstract

In recent years, the remarkable advancements in deep neural networks have brought tremendous convenience. However, the training process of a highly effective model necessitates a substantial quantity of samples, which brings huge potential threats, like unauthorized exploitation with privacy leakage. In response, we propose a framework named HiddenSpeaker, embedding imperceptible perturbations within the training speech samples and rendering them unlearnable for deep-learning-based speaker verification systems that employ large-scale speakers for efficient training. The HiddenSpeaker utilizes a simplified error-minimizing method named Single-Level Error-Minimizing (SLEM) to generate specific and effective perturbations. Additionally, a hybrid objective function is employed for human perceptual optimization, ensuring the perturbation is indistinguishable from human listeners. We conduct extensive experiments on multiple state-of-the-art (SOTA) models in the speaker verification domain to evaluate HiddenSpeaker. Our results demonstrate that HiddenSpeaker not only deceives the model with unlearnable samples but also enhances the imperceptibility of the perturbations, showcasing strong transferability across different models.
Paper Structure (14 sections, 7 equations, 4 figures, 3 tables)

This paper contains 14 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: When users upload unprotected audio files to the internet, they become accessible for training purposes. The Single-Level Error-Minimizing method that the HiddenSpeaker system uses, injects noise into raw audio to render these audio datasets unlearnable, thereby disrupting model training effectiveness.
  • Figure 2: The HiddenSpeaker system workflow operates in two phases. SLEM noise is embedded into the audio in need of protection. Subsequently, a PHL function optimizes this noise, factoring in both STFT and STOI considerations, to maintain auditory indiscernibility.
  • Figure 3: EER and minDCF values over epochs when different types of noises are added to the complete VoxCeleb1 dataset for ECAPA-TDNN model training, and clean VoxCeleb1 samples without added noise as a control group.
  • Figure 4: The visual comparison between the original audio waveform and the HiddenSpeaker-protected waveform. It can be observed that visually, there is no clear difference between the two waveforms.