Exploring Federated Self-Supervised Learning for General Purpose Audio Understanding
Yasar Abbas Ur Rehman, Kin Wai Lau, Yuyang Xie, Lan Ma, Jiajun Shen
TL;DR
The paper tackles privacy-preserving, scalable general-purpose audio understanding by combining Federated Learning (FL) with Self-supervised Learning (SSL) and introducing a novel Federated SSL framework, FASSL, to identify optimal global models during FL pretraining on non-iid data. It conducts a systematic comparison of predictive (e.g., ACOP) and feature-matching (e.g., SimCLR, Barlow Twins) SSL under cross-device FL and evaluates various aggregation strategies, including backbone-only transmission. Key findings show that FL-SSL can match centralized SSL on average, predictive SSL excels on semantic audio tasks, and feature-matching suits non-semantic tasks, with FASSL enabling task-aware global model selection. These results highlight practical pathways to deploy general-purpose audio SSL at scale while preserving user privacy, and point to design choices in transmission and aggregation for different downstream tasks.
Abstract
The integration of Federated Learning (FL) and Self-supervised Learning (SSL) offers a unique and synergetic combination to exploit the audio data for general-purpose audio understanding, without compromising user data privacy. However, rare efforts have been made to investigate the SSL models in the FL regime for general-purpose audio understanding, especially when the training data is generated by large-scale heterogeneous audio sources. In this paper, we evaluate the performance of feature-matching and predictive audio-SSL techniques when integrated into large-scale FL settings simulated with non-independently identically distributed (non-iid) data. We propose a novel Federated SSL (F-SSL) framework, dubbed FASSL, that enables learning intermediate feature representations from large-scale decentralized heterogeneous clients, holding unlabelled audio data. Our study has found that audio F-SSL approaches perform on par with the centralized audio-SSL approaches on the audio-retrieval task. Extensive experiments demonstrate the effectiveness and significance of FASSL as it assists in obtaining the optimal global model for state-of-the-art FL aggregation methods.
