Table of Contents
Fetching ...

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix

TL;DR

This work tackles real-time target speaker extraction by integrating State Space Modeling (SSM) into Conv-TasNet, forming SpeakerBeam-SS. By incorporating S4D blocks, the model achieves long-range temporal modeling with fewer layers, and by widening the frontend encoder with over-parameterization, compensates for broader windows while preserving performance. The approach yields a substantial reduction in real-time factor (about 78%) without sacrificing SDR or DNSMOS, and demonstrates superior performance compared with existing real-time TSE architectures. The results highlight the practicality of SSM for real-time audio separation and suggest extending SSM to other TSE models in future work.

Abstract

Real-time target speaker extraction (TSE) is intended to extract the desired speaker's voice from the observed mixture of multiple speakers in a streaming manner. Implementing real-time TSE is challenging as the computational complexity must be reduced to provide real-time operation. This work introduces to Conv-TasNet-based TSE a new architecture based on state space modeling (SSM) that has been shown to model long-term dependency effectively. Owing to SSM, fewer dilated convolutional layers are required to capture temporal dependency in Conv-TasNet, resulting in the reduction of model complexity. We also enlarge the window length and shift of the convolutional (TasNet) frontend encoder to reduce the computational cost further; the performance decline is compensated by over-parameterization of the frontend encoder. The proposed method reduces the real-time factor by 78% from the conventional causal Conv-TasNet-based TSE while matching its performance.

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

TL;DR

This work tackles real-time target speaker extraction by integrating State Space Modeling (SSM) into Conv-TasNet, forming SpeakerBeam-SS. By incorporating S4D blocks, the model achieves long-range temporal modeling with fewer layers, and by widening the frontend encoder with over-parameterization, compensates for broader windows while preserving performance. The approach yields a substantial reduction in real-time factor (about 78%) without sacrificing SDR or DNSMOS, and demonstrates superior performance compared with existing real-time TSE architectures. The results highlight the practicality of SSM for real-time audio separation and suggest extending SSM to other TSE models in future work.

Abstract

Real-time target speaker extraction (TSE) is intended to extract the desired speaker's voice from the observed mixture of multiple speakers in a streaming manner. Implementing real-time TSE is challenging as the computational complexity must be reduced to provide real-time operation. This work introduces to Conv-TasNet-based TSE a new architecture based on state space modeling (SSM) that has been shown to model long-term dependency effectively. Owing to SSM, fewer dilated convolutional layers are required to capture temporal dependency in Conv-TasNet, resulting in the reduction of model complexity. We also enlarge the window length and shift of the convolutional (TasNet) frontend encoder to reduce the computational cost further; the performance decline is compensated by over-parameterization of the frontend encoder. The proposed method reduces the real-time factor by 78% from the conventional causal Conv-TasNet-based TSE while matching its performance.
Paper Structure (13 sections, 7 equations, 2 figures, 2 tables)

This paper contains 13 sections, 7 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the proposed SpeakerBeam-SS architecture. (a) shows the overall structure and (b) shows the details of the S4D block. The dropout layer is omitted from the figure. $d$ refers to the dilation of 1-D convolutional blocks.
  • Figure 2: The relationship between RTF and the enhancement performance with various numbers of filters $N$ and window size $L$, in the frontend encoder.