TF-Mamba: A Time-Frequency Network for Sound Source Localization
Yang Xiao, Rohan Kumar Das
TL;DR
This work tackles SSL by estimating source directions from multi-channel audio using a novel Time-Frequency Mamba (TF-Mamba) architecture. TF-Mamba fuses temporal and spectral cues through Bidirectional Mamba blocks, enabling efficient modeling of long-range dependencies for moving sources. Evaluations on simulated data and LOCATA show TF-Mamba outperforms modern baselines with robust accuracy and low localization error, highlighting the approach's practical value for robust SSL in real-world environments. The study also demonstrates the first successful application of a state-space model to SSL, suggesting strong potential for broader adoption in spatial audio tasks.
Abstract
Sound source localization (SSL) determines the position of sound sources using multi-channel audio data. It is commonly used to improve speech enhancement and separation. Extracting spatial features is crucial for SSL, especially in challenging acoustic environments. Recently, a novel structure referred to as Mamba demonstrated notable performance across various sequence-based modalities. This study introduces the Mamba for SSL tasks. We consider the Mamba-based model to analyze spatial features from speech signals by fusing both time and frequency features, and we develop an SSL system called TF-Mamba. This system integrates time and frequency fusion, with Bidirectional Mamba managing both time-wise and frequency-wise processing. We conduct the experiments on the simulated and real datasets. Experiments show that TF-Mamba significantly outperforms other advanced methods. The code will be publicly released in due course.
