SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

Hiroshi Sato; Takafumi Moriya; Masato Mimura; Shota Horiguchi; Tsubasa Ochiai; Takanori Ashihara; Atsushi Ando; Kentaro Shinayama; Marc Delcroix

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix

TL;DR

This work tackles real-time target speaker extraction by integrating State Space Modeling (SSM) into Conv-TasNet, forming SpeakerBeam-SS. By incorporating S4D blocks, the model achieves long-range temporal modeling with fewer layers, and by widening the frontend encoder with over-parameterization, compensates for broader windows while preserving performance. The approach yields a substantial reduction in real-time factor (about 78%) without sacrificing SDR or DNSMOS, and demonstrates superior performance compared with existing real-time TSE architectures. The results highlight the practicality of SSM for real-time audio separation and suggest extending SSM to other TSE models in future work.

Abstract

Real-time target speaker extraction (TSE) is intended to extract the desired speaker's voice from the observed mixture of multiple speakers in a streaming manner. Implementing real-time TSE is challenging as the computational complexity must be reduced to provide real-time operation. This work introduces to Conv-TasNet-based TSE a new architecture based on state space modeling (SSM) that has been shown to model long-term dependency effectively. Owing to SSM, fewer dilated convolutional layers are required to capture temporal dependency in Conv-TasNet, resulting in the reduction of model complexity. We also enlarge the window length and shift of the convolutional (TasNet) frontend encoder to reduce the computational cost further; the performance decline is compensated by over-parameterization of the frontend encoder. The proposed method reduces the real-time factor by 78% from the conventional causal Conv-TasNet-based TSE while matching its performance.

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

TL;DR

Abstract

Paper Structure (13 sections, 7 equations, 2 figures, 2 tables)

This paper contains 13 sections, 7 equations, 2 figures, 2 tables.

Introduction
Related work
Background
Conv-TasNet-based TSE
State Space Modeling
Proposed method
Experiments
Experimental setup
Dataset
System configuration and training procedure
Evaluation details
Experimental results
Conclusion

Figures (2)

Figure 1: Overview of the proposed SpeakerBeam-SS architecture. (a) shows the overall structure and (b) shows the details of the S4D block. The dropout layer is omitted from the figure. $d$ refers to the dilation of 1-D convolutional blocks.
Figure 2: The relationship between RTF and the enhancement performance with various numbers of filters $N$ and window size $L$, in the frontend encoder.

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

TL;DR

Abstract

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (2)