Table of Contents
Fetching ...

TF-Mamba: A Time-Frequency Network for Sound Source Localization

Yang Xiao, Rohan Kumar Das

TL;DR

This work tackles SSL by estimating source directions from multi-channel audio using a novel Time-Frequency Mamba (TF-Mamba) architecture. TF-Mamba fuses temporal and spectral cues through Bidirectional Mamba blocks, enabling efficient modeling of long-range dependencies for moving sources. Evaluations on simulated data and LOCATA show TF-Mamba outperforms modern baselines with robust accuracy and low localization error, highlighting the approach's practical value for robust SSL in real-world environments. The study also demonstrates the first successful application of a state-space model to SSL, suggesting strong potential for broader adoption in spatial audio tasks.

Abstract

Sound source localization (SSL) determines the position of sound sources using multi-channel audio data. It is commonly used to improve speech enhancement and separation. Extracting spatial features is crucial for SSL, especially in challenging acoustic environments. Recently, a novel structure referred to as Mamba demonstrated notable performance across various sequence-based modalities. This study introduces the Mamba for SSL tasks. We consider the Mamba-based model to analyze spatial features from speech signals by fusing both time and frequency features, and we develop an SSL system called TF-Mamba. This system integrates time and frequency fusion, with Bidirectional Mamba managing both time-wise and frequency-wise processing. We conduct the experiments on the simulated and real datasets. Experiments show that TF-Mamba significantly outperforms other advanced methods. The code will be publicly released in due course.

TF-Mamba: A Time-Frequency Network for Sound Source Localization

TL;DR

This work tackles SSL by estimating source directions from multi-channel audio using a novel Time-Frequency Mamba (TF-Mamba) architecture. TF-Mamba fuses temporal and spectral cues through Bidirectional Mamba blocks, enabling efficient modeling of long-range dependencies for moving sources. Evaluations on simulated data and LOCATA show TF-Mamba outperforms modern baselines with robust accuracy and low localization error, highlighting the approach's practical value for robust SSL in real-world environments. The study also demonstrates the first successful application of a state-space model to SSL, suggesting strong potential for broader adoption in spatial audio tasks.

Abstract

Sound source localization (SSL) determines the position of sound sources using multi-channel audio data. It is commonly used to improve speech enhancement and separation. Extracting spatial features is crucial for SSL, especially in challenging acoustic environments. Recently, a novel structure referred to as Mamba demonstrated notable performance across various sequence-based modalities. This study introduces the Mamba for SSL tasks. We consider the Mamba-based model to analyze spatial features from speech signals by fusing both time and frequency features, and we develop an SSL system called TF-Mamba. This system integrates time and frequency fusion, with Bidirectional Mamba managing both time-wise and frequency-wise processing. We conduct the experiments on the simulated and real datasets. Experiments show that TF-Mamba significantly outperforms other advanced methods. The code will be publicly released in due course.
Paper Structure (19 sections, 2 equations, 1 figure, 3 tables)

This paper contains 19 sections, 2 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Architecture of (a) the proposed TF-Mamba network and (b) Bidirectional Mamba (BiMamba) layer. Each TF-Mamba block includes a temporal Mamba (T-BiMamba) and frequency Mamba (F-BiMamba) layer, with skip connections to prevent information loss. "S" denotes the SiLU activation and "Linear" indicates linear projection.