Table of Contents
Fetching ...

Moving Speaker Separation via Parallel Spectral-Spatial Processing

Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

TL;DR

A dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams that outperforms existing state-of-the-art SOTA methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios.

Abstract

Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.

Moving Speaker Separation via Parallel Spectral-Spatial Processing

TL;DR

A dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams that outperforms existing state-of-the-art SOTA methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios.

Abstract

Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.
Paper Structure (29 sections, 7 equations, 6 figures, 6 tables)

This paper contains 29 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Visualization of temporal evolution in spectral and spatial features for moving versus static sound source scenarios. Top row: magnitude spectrogram (left) and inter-channel time difference (ITD) (right) of a moving speaker ($0.8$ m/s); Bottom row: corresponding visualizations for a static speaker positioned at the initial point of the moving source trajectory. The magnitude spectrograms use the first microphone channel of noise-free reverberant speech (RT$60$: $250$ ms) simulated in a $10$ m $\times$$10$ m $\times$$3$ m room with two microphones spaced by $15$ cm at $16$ kHz sampling rate. The ITD is visualized by computing the time difference between the two channels for each time-frequency bin. For each channel, the complex-valued spectrogram is normalized by its magnitude, yielding a phase-only spectrogram. The inter-channel phase difference is then computed for each bin and transformed into a time delay.
  • Figure 2: The illustration of the proposed PS2 system.
  • Figure 3: The architecture of the proposed PS2 system.
  • Figure 4: The core components of PS2 system: (a) spectral branch, (b) spatial branch, and (c) cross-attention fusion module.
  • Figure 5: Multi-conditional evaluation results under different acoustic conditions. SI-SDR performance across varying (a) reverberation times (RT60), (b) input SNR levels, (c) source movement speeds, (d) inter-source angles, and (e) signal durations.
  • ...and 1 more figures