Table of Contents
Fetching ...

Adversarial Representation Learning for Robust Privacy Preservation in Audio

Shayan Gharib, Minh Tran, Diep Luong, Konstantinos Drossos, Tuomas Virtanen

TL;DR

The paper addresses the privacy risk in cloud-based audio analytics by learning latent audio representations that suppress speech presence while preserving targeted sound-event detection. It introduces Robust Discriminative Adversarial Learning (RDAL), a framework that couples a feature extractor $F$ and a sound-event classifier $C$ with an adversarial speech classifier $D$, augmented by periodically replacing $D$ with a supervisedly trained $D^\tau$ to prevent open-set leakage. An optional masking U-Net (RDAL+M) further isolates speech components before feature extraction. Empirical results show RDAL substantially reduces an attacker’s ability to identify speech and gender from latent features, with RDAL+M achieving near-random privacy levels while maintaining SED performance. The work highlights a practical path to privacy-preserving audio processing and proposes a robust training protocol to counter classifier drift and leakage across unseen speech attributes."

Abstract

Sound event detection systems are widely used in various applications such as surveillance and environmental monitoring where data is automatically collected, processed, and sent to a cloud for sound recognition. However, this process may inadvertently reveal sensitive information about users or their surroundings, hence raising privacy concerns. In this study, we propose a novel adversarial training method for learning representations of audio recordings that effectively prevents the detection of speech activity from the latent features of the recordings. The proposed method trains a model to generate invariant latent representations of speech-containing audio recordings that cannot be distinguished from non-speech recordings by a speech classifier. The novelty of our work is in the optimization algorithm, where the speech classifier's weights are regularly replaced with the weights of classifiers trained in a supervised manner. This increases the discrimination power of the speech classifier constantly during the adversarial training, motivating the model to generate latent representations in which speech is not distinguishable, even using new speech classifiers trained outside the adversarial training loop. The proposed method is evaluated against a baseline approach with no privacy measures and a prior adversarial training method, demonstrating a significant reduction in privacy violations compared to the baseline approach. Additionally, we show that the prior adversarial method is practically ineffective for this purpose.

Adversarial Representation Learning for Robust Privacy Preservation in Audio

TL;DR

The paper addresses the privacy risk in cloud-based audio analytics by learning latent audio representations that suppress speech presence while preserving targeted sound-event detection. It introduces Robust Discriminative Adversarial Learning (RDAL), a framework that couples a feature extractor and a sound-event classifier with an adversarial speech classifier , augmented by periodically replacing with a supervisedly trained to prevent open-set leakage. An optional masking U-Net (RDAL+M) further isolates speech components before feature extraction. Empirical results show RDAL substantially reduces an attacker’s ability to identify speech and gender from latent features, with RDAL+M achieving near-random privacy levels while maintaining SED performance. The work highlights a practical path to privacy-preserving audio processing and proposes a robust training protocol to counter classifier drift and leakage across unseen speech attributes."

Abstract

Sound event detection systems are widely used in various applications such as surveillance and environmental monitoring where data is automatically collected, processed, and sent to a cloud for sound recognition. However, this process may inadvertently reveal sensitive information about users or their surroundings, hence raising privacy concerns. In this study, we propose a novel adversarial training method for learning representations of audio recordings that effectively prevents the detection of speech activity from the latent features of the recordings. The proposed method trains a model to generate invariant latent representations of speech-containing audio recordings that cannot be distinguished from non-speech recordings by a speech classifier. The novelty of our work is in the optimization algorithm, where the speech classifier's weights are regularly replaced with the weights of classifiers trained in a supervised manner. This increases the discrimination power of the speech classifier constantly during the adversarial training, motivating the model to generate latent representations in which speech is not distinguishable, even using new speech classifiers trained outside the adversarial training loop. The proposed method is evaluated against a baseline approach with no privacy measures and a prior adversarial training method, demonstrating a significant reduction in privacy violations compared to the baseline approach. Additionally, we show that the prior adversarial method is practically ineffective for this purpose.
Paper Structure (14 sections, 5 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 5 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of the problem setup where speech privacy is compromised during the transmission of acoustic features to a cloud platform.
  • Figure 2: Schematic diagram of the proposed method. $F$, $C$, $D$, and $D^\tau$ are neural networks and $\mathcal{L}$ denotes different loss terms employed in our method. The solid lines illustrate the regular forward pass. The dashed line actives after $\tau$ epochs. Finally, the dotted lines represent the backpropagation of each specific error w.r.t the associated parameters.
  • Figure 3: ROC curves for each method are displayed, showcasing the privacy preservation results on the SAD task as outlined in Table \ref{['results-table']}.
  • Figure 4: Comparison of latent features obtained by RDAL's $F$ (right) and supervised training of $F$ for sound events and speech (left). Sound events are color-coded with 12 different colors, while speech and non-speech samples are marked with "o" and "x" respectively.
  • Figure 5: Density curves using Gaussian kernel to represent predicted probability densities from the attacker model on the test data using the latent features of baseline (left), RDAL (middle), and RDAL+M (right) methods.