Representation Learning for Audio Privacy Preservation using Source Separation and Robust Adversarial Learning
Diep Luong, Minh Tran, Shayan Gharib, Konstantinos Drossos, Tuomas Virtanen
TL;DR
This work tackles privacy leakage in passive acoustic monitoring by preventing the latent representations from encoding speech presence. It introduces RDAL-M, a framework that fuses a source separation network with robust discriminative adversarial learning, using a gradient reversal layer and a periodically updated auxiliary discriminator to maximize privacy while preserving a sound-event detection task. The method is evaluated on mixtures of FSD50K sound events with LibriSpeech speech, using ablations against baselines and showing that a fixed masking strategy can reach near-random privacy levels without hurting utility, while a learnable mask can further improve privacy at a modest cost to performance. The results indicate that jointly leveraging source separation and adversarial learning yields stronger privacy preservation than either technique alone, with potential for practical deployment in privacy-sensitive audio sensing systems.
Abstract
Privacy preservation has long been a concern in smart acoustic monitoring systems, where speech can be passively recorded along with a target signal in the system's operating environment. In this study, we propose the integration of two commonly used approaches in privacy preservation: source separation and adversarial representation learning. The proposed system learns the latent representation of audio recordings such that it prevents differentiating between speech and non-speech recordings. Initially, the source separation network filters out some of the privacy-sensitive data, and during the adversarial learning process, the system will learn privacy-preserving representation on the filtered signal. We demonstrate the effectiveness of our proposed method by comparing our method against systems without source separation, without adversarial learning, and without both. Overall, our results suggest that the proposed system can significantly improve speech privacy preservation compared to that of using source separation or adversarial learning solely while maintaining good performance in the acoustic monitoring task.
