Table of Contents
Fetching ...

Where's That Voice Coming? Continual Learning for Sound Source Localization

Yang Xiao, Rohan Kumar Das

TL;DR

The paper tackles catastrophic forgetting in sound source localization when acoustic configurations change. It introduces CL-SSL, an exemplar-free continual learning framework that creates task-specific sub-networks with lateral connections and a gap-aware scaling mechanism, enabling adaptation across environments without storing past data. The SSL backbone SSNet processes complex STFT features into a spatial spectrum, while a new sub-network is created per task and connected to prior ones via adapters, with a fixed 181-class memory head. Across simulated LibriSpeech-based two-mic data and LOCATA, CL-SSL achieves high accuracy with a modest parameter increase (≈$1.6$M) and outperforms non-CL baselines, approaching, and sometimes exceeding, joint training in challenging settings.

Abstract

Sound source localization (SSL) is essential for many speech-processing applications. Deep learning models have achieved high performance, but often fail when the training and inference environments differ. Adapting SSL models to dynamic acoustic conditions faces a major challenge: catastrophic forgetting. In this work, we propose an exemplar-free continual learning strategy for SSL (CL-SSL) to address such a forgetting phenomenon. CL-SSL applies task-specific sub-networks to adapt across diverse acoustic environments while retaining previously learned knowledge. It also uses a scaling mechanism to limit parameter growth, ensuring consistent performance across incremental tasks. We evaluated CL-SSL on simulated data with varying microphone distances and real-world data with different noise levels. The results demonstrate CL-SSL's ability to maintain high accuracy with minimal parameter increase, offering an efficient solution for SSL applications.

Where's That Voice Coming? Continual Learning for Sound Source Localization

TL;DR

The paper tackles catastrophic forgetting in sound source localization when acoustic configurations change. It introduces CL-SSL, an exemplar-free continual learning framework that creates task-specific sub-networks with lateral connections and a gap-aware scaling mechanism, enabling adaptation across environments without storing past data. The SSL backbone SSNet processes complex STFT features into a spatial spectrum, while a new sub-network is created per task and connected to prior ones via adapters, with a fixed 181-class memory head. Across simulated LibriSpeech-based two-mic data and LOCATA, CL-SSL achieves high accuracy with a modest parameter increase (≈M) and outperforms non-CL baselines, approaching, and sometimes exceeding, joint training in challenging settings.

Abstract

Sound source localization (SSL) is essential for many speech-processing applications. Deep learning models have achieved high performance, but often fail when the training and inference environments differ. Adapting SSL models to dynamic acoustic conditions faces a major challenge: catastrophic forgetting. In this work, we propose an exemplar-free continual learning strategy for SSL (CL-SSL) to address such a forgetting phenomenon. CL-SSL applies task-specific sub-networks to adapt across diverse acoustic environments while retaining previously learned knowledge. It also uses a scaling mechanism to limit parameter growth, ensuring consistent performance across incremental tasks. We evaluated CL-SSL on simulated data with varying microphone distances and real-world data with different noise levels. The results demonstrate CL-SSL's ability to maintain high accuracy with minimal parameter increase, offering an efficient solution for SSL applications.
Paper Structure (16 sections, 2 equations, 2 figures, 3 tables)

This paper contains 16 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: (a) Architecture of SSNet (b) Proposed CL-SSL framework.
  • Figure 2: Comparative performance in ACC (%) of various SSL methods after learning each microphone spacing for three tolerances levels. The tasks T1-T5 are microphones distance from 5 to 9 cm.