Where's That Voice Coming? Continual Learning for Sound Source Localization
Yang Xiao, Rohan Kumar Das
TL;DR
The paper tackles catastrophic forgetting in sound source localization when acoustic configurations change. It introduces CL-SSL, an exemplar-free continual learning framework that creates task-specific sub-networks with lateral connections and a gap-aware scaling mechanism, enabling adaptation across environments without storing past data. The SSL backbone SSNet processes complex STFT features into a spatial spectrum, while a new sub-network is created per task and connected to prior ones via adapters, with a fixed 181-class memory head. Across simulated LibriSpeech-based two-mic data and LOCATA, CL-SSL achieves high accuracy with a modest parameter increase (≈$1.6$M) and outperforms non-CL baselines, approaching, and sometimes exceeding, joint training in challenging settings.
Abstract
Sound source localization (SSL) is essential for many speech-processing applications. Deep learning models have achieved high performance, but often fail when the training and inference environments differ. Adapting SSL models to dynamic acoustic conditions faces a major challenge: catastrophic forgetting. In this work, we propose an exemplar-free continual learning strategy for SSL (CL-SSL) to address such a forgetting phenomenon. CL-SSL applies task-specific sub-networks to adapt across diverse acoustic environments while retaining previously learned knowledge. It also uses a scaling mechanism to limit parameter growth, ensuring consistent performance across incremental tasks. We evaluated CL-SSL on simulated data with varying microphone distances and real-world data with different noise levels. The results demonstrate CL-SSL's ability to maintain high accuracy with minimal parameter increase, offering an efficient solution for SSL applications.
