Table of Contents
Fetching ...

Representation Learning for Audio Privacy Preservation using Source Separation and Robust Adversarial Learning

Diep Luong, Minh Tran, Shayan Gharib, Konstantinos Drossos, Tuomas Virtanen

TL;DR

This work tackles privacy leakage in passive acoustic monitoring by preventing the latent representations from encoding speech presence. It introduces RDAL-M, a framework that fuses a source separation network with robust discriminative adversarial learning, using a gradient reversal layer and a periodically updated auxiliary discriminator to maximize privacy while preserving a sound-event detection task. The method is evaluated on mixtures of FSD50K sound events with LibriSpeech speech, using ablations against baselines and showing that a fixed masking strategy can reach near-random privacy levels without hurting utility, while a learnable mask can further improve privacy at a modest cost to performance. The results indicate that jointly leveraging source separation and adversarial learning yields stronger privacy preservation than either technique alone, with potential for practical deployment in privacy-sensitive audio sensing systems.

Abstract

Privacy preservation has long been a concern in smart acoustic monitoring systems, where speech can be passively recorded along with a target signal in the system's operating environment. In this study, we propose the integration of two commonly used approaches in privacy preservation: source separation and adversarial representation learning. The proposed system learns the latent representation of audio recordings such that it prevents differentiating between speech and non-speech recordings. Initially, the source separation network filters out some of the privacy-sensitive data, and during the adversarial learning process, the system will learn privacy-preserving representation on the filtered signal. We demonstrate the effectiveness of our proposed method by comparing our method against systems without source separation, without adversarial learning, and without both. Overall, our results suggest that the proposed system can significantly improve speech privacy preservation compared to that of using source separation or adversarial learning solely while maintaining good performance in the acoustic monitoring task.

Representation Learning for Audio Privacy Preservation using Source Separation and Robust Adversarial Learning

TL;DR

This work tackles privacy leakage in passive acoustic monitoring by preventing the latent representations from encoding speech presence. It introduces RDAL-M, a framework that fuses a source separation network with robust discriminative adversarial learning, using a gradient reversal layer and a periodically updated auxiliary discriminator to maximize privacy while preserving a sound-event detection task. The method is evaluated on mixtures of FSD50K sound events with LibriSpeech speech, using ablations against baselines and showing that a fixed masking strategy can reach near-random privacy levels without hurting utility, while a learnable mask can further improve privacy at a modest cost to performance. The results indicate that jointly leveraging source separation and adversarial learning yields stronger privacy preservation than either technique alone, with potential for practical deployment in privacy-sensitive audio sensing systems.

Abstract

Privacy preservation has long been a concern in smart acoustic monitoring systems, where speech can be passively recorded along with a target signal in the system's operating environment. In this study, we propose the integration of two commonly used approaches in privacy preservation: source separation and adversarial representation learning. The proposed system learns the latent representation of audio recordings such that it prevents differentiating between speech and non-speech recordings. Initially, the source separation network filters out some of the privacy-sensitive data, and during the adversarial learning process, the system will learn privacy-preserving representation on the filtered signal. We demonstrate the effectiveness of our proposed method by comparing our method against systems without source separation, without adversarial learning, and without both. Overall, our results suggest that the proposed system can significantly improve speech privacy preservation compared to that of using source separation or adversarial learning solely while maintaining good performance in the acoustic monitoring task.
Paper Structure (10 sections, 6 equations, 2 figures, 2 tables)

This paper contains 10 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The schematic diagram of RDAL-M. $M$ is the source separation network, $F$ is the feature extractor, $C$ is the sound event classifier, $D$ is the speech discriminator on the adversarial branch, and $D'$ is the speech discriminator activated only after $P$ epochs. $\mathcal{L}$ represents different losses. The solid lines show the forward pass. The dashed line shows the forward pass to $D'$ only after every $P$ epochs. The dotted line shows the backpropagation from the losses to the corresponding weights. The dotted arrows with empty heads from $F$ to $M$ represent the backward pass in the learnable mask approach. In the fixed mask approach, there is no backpropagation from $F$ to $M$, and the parameters of the pre-trained $M$ are kept fixed during the training of RDAL-M.
  • Figure 2: ROC curves of different methods discussed in Tables \ref{['table-results']}. FL stands for feature learning.