Table of Contents
Fetching ...

TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation

Mohan Xu, Kai Li, Guo Chen, Xiaolin Hu

TL;DR

TIGER addresses the practical need for low-latency, resource-efficient speech separation by introducing a time-frequency interleaved architecture that leverages band-split processing and two specialized attention modules (MSA and F^3A) within a shared-parameter separator. It pairs this lightweight model with EchoSet, a realism-focused dataset that spans noisy, reverberant, and occluded environments to better approximate real-world use. Empirical results show TIGER achieves competitive SDRi/SI-SDRi with dramatically fewer parameters and MACs compared to SOTA models, and demonstrates superior generalization to real-world data. These findings suggest TIGER as a viable approach for edge-friendly speech separation without sacrificing performance, validated on both standard benchmarks and a realism-enhanced dataset.

Abstract

In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, high efficiency is equally important. Therefore, we propose a speech separation model with significantly reduced parameters and computational costs: Time-frequency Interleaved Gain Extraction and Reconstruction network (TIGER). TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We employ a multi-scale selective attention module to extract contextual features while introducing a full-frequency-frame attention module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results showed that models trained on EchoSet had better generalization ability than those trained on other datasets compared to the data collected in the physical world, which validated the practical value of the EchoSet. On EchoSet and real-world data, TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% while achieving performance surpassing the state-of-the-art (SOTA) model TF-GridNet.

TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation

TL;DR

TIGER addresses the practical need for low-latency, resource-efficient speech separation by introducing a time-frequency interleaved architecture that leverages band-split processing and two specialized attention modules (MSA and F^3A) within a shared-parameter separator. It pairs this lightweight model with EchoSet, a realism-focused dataset that spans noisy, reverberant, and occluded environments to better approximate real-world use. Empirical results show TIGER achieves competitive SDRi/SI-SDRi with dramatically fewer parameters and MACs compared to SOTA models, and demonstrates superior generalization to real-world data. These findings suggest TIGER as a viable approach for edge-friendly speech separation without sacrificing performance, validated on both standard benchmarks and a realism-enhanced dataset.

Abstract

In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, high efficiency is equally important. Therefore, we propose a speech separation model with significantly reduced parameters and computational costs: Time-frequency Interleaved Gain Extraction and Reconstruction network (TIGER). TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We employ a multi-scale selective attention module to extract contextual features while introducing a full-frequency-frame attention module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results showed that models trained on EchoSet had better generalization ability than those trained on other datasets compared to the data collected in the physical world, which validated the practical value of the EchoSet. On EchoSet and real-world data, TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% while achieving performance surpassing the state-of-the-art (SOTA) model TF-GridNet.
Paper Structure (25 sections, 10 equations, 5 figures, 12 tables)

This paper contains 25 sections, 10 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: The overall pipeline of TIGER. We focus on scenarios with only two speakers.
  • Figure 2: The separator of TIGER, consists of several FFI blocks which share parameters. Residual connections are used to retain original features and reduce learning difficulty.
  • Figure 3: The structure of the MSA module and the F$^3$A module. The structures of frequency and frame paths are the same.
  • Figure 4: SI-SDRi results of different models on the real-world data. Models were trained on Libri2Mix, LRS2-2Mix and EchoSet respectively.
  • Figure 5: Comparison of the spectrograms of the ground truth, audio separated by TIGER and by TF-GridNet.