Table of Contents
Fetching ...

Exploring Federated Self-Supervised Learning for General Purpose Audio Understanding

Yasar Abbas Ur Rehman, Kin Wai Lau, Yuyang Xie, Lan Ma, Jiajun Shen

TL;DR

The paper tackles privacy-preserving, scalable general-purpose audio understanding by combining Federated Learning (FL) with Self-supervised Learning (SSL) and introducing a novel Federated SSL framework, FASSL, to identify optimal global models during FL pretraining on non-iid data. It conducts a systematic comparison of predictive (e.g., ACOP) and feature-matching (e.g., SimCLR, Barlow Twins) SSL under cross-device FL and evaluates various aggregation strategies, including backbone-only transmission. Key findings show that FL-SSL can match centralized SSL on average, predictive SSL excels on semantic audio tasks, and feature-matching suits non-semantic tasks, with FASSL enabling task-aware global model selection. These results highlight practical pathways to deploy general-purpose audio SSL at scale while preserving user privacy, and point to design choices in transmission and aggregation for different downstream tasks.

Abstract

The integration of Federated Learning (FL) and Self-supervised Learning (SSL) offers a unique and synergetic combination to exploit the audio data for general-purpose audio understanding, without compromising user data privacy. However, rare efforts have been made to investigate the SSL models in the FL regime for general-purpose audio understanding, especially when the training data is generated by large-scale heterogeneous audio sources. In this paper, we evaluate the performance of feature-matching and predictive audio-SSL techniques when integrated into large-scale FL settings simulated with non-independently identically distributed (non-iid) data. We propose a novel Federated SSL (F-SSL) framework, dubbed FASSL, that enables learning intermediate feature representations from large-scale decentralized heterogeneous clients, holding unlabelled audio data. Our study has found that audio F-SSL approaches perform on par with the centralized audio-SSL approaches on the audio-retrieval task. Extensive experiments demonstrate the effectiveness and significance of FASSL as it assists in obtaining the optimal global model for state-of-the-art FL aggregation methods.

Exploring Federated Self-Supervised Learning for General Purpose Audio Understanding

TL;DR

The paper tackles privacy-preserving, scalable general-purpose audio understanding by combining Federated Learning (FL) with Self-supervised Learning (SSL) and introducing a novel Federated SSL framework, FASSL, to identify optimal global models during FL pretraining on non-iid data. It conducts a systematic comparison of predictive (e.g., ACOP) and feature-matching (e.g., SimCLR, Barlow Twins) SSL under cross-device FL and evaluates various aggregation strategies, including backbone-only transmission. Key findings show that FL-SSL can match centralized SSL on average, predictive SSL excels on semantic audio tasks, and feature-matching suits non-semantic tasks, with FASSL enabling task-aware global model selection. These results highlight practical pathways to deploy general-purpose audio SSL at scale while preserving user privacy, and point to design choices in transmission and aggregation for different downstream tasks.

Abstract

The integration of Federated Learning (FL) and Self-supervised Learning (SSL) offers a unique and synergetic combination to exploit the audio data for general-purpose audio understanding, without compromising user data privacy. However, rare efforts have been made to investigate the SSL models in the FL regime for general-purpose audio understanding, especially when the training data is generated by large-scale heterogeneous audio sources. In this paper, we evaluate the performance of feature-matching and predictive audio-SSL techniques when integrated into large-scale FL settings simulated with non-independently identically distributed (non-iid) data. We propose a novel Federated SSL (F-SSL) framework, dubbed FASSL, that enables learning intermediate feature representations from large-scale decentralized heterogeneous clients, holding unlabelled audio data. Our study has found that audio F-SSL approaches perform on par with the centralized audio-SSL approaches on the audio-retrieval task. Extensive experiments demonstrate the effectiveness and significance of FASSL as it assists in obtaining the optimal global model for state-of-the-art FL aggregation methods.
Paper Structure (10 sections, 1 equation, 1 figure, 4 tables)

This paper contains 10 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: System overview: Stage-1 (left) represents the FL audio-SSL pertaining. The downstream task; audio retrieval is depicted in Stage 2 (right). Note that the data of stage-1 and stage-2 are from different sources