Table of Contents
Fetching ...

Study of Lightweight Transformer Architectures for Single-Channel Speech Enhancement

Haixin Zhao, Nilesh Madhu

TL;DR

Edge-constraint speech enhancement demands high performance with low compute. The authors introduce LCT-GAN, a lightweight, causal transformer-based GAN that uses a Frequency-Time-Frequency (FTF) bottleneck to capture global dependencies, and trains with multi-resolution adversarial losses and discriminators that do not add inference cost. The generator estimates a magnitude mask in the compressed-domain IRM with $c=0.3$, enabling effective speech enhancement while preserving phase. Empirical results on Voicebank+Demand and DNS3 show LCT-GAN achieves state-of-the-art performance among lightweight models with a fraction of DeepFilterNet2’s parameters and competitive MACs, with discriminators and PCS further boosting perceptual quality. The work enables practical deployment on edge devices with low latency and opens avenues for further efficiency gains.

Abstract

In speech enhancement, achieving state-of-the-art (SotA) performance while adhering to the computational constraints on edge devices remains a formidable challenge. Networks integrating stacked temporal and spectral modelling effectively leverage improved architectures such as transformers; however, they inevitably incur substantial computational complexity and model expansion. Through systematic ablation analysis on transformer-based temporal and spectral modelling, we demonstrate that the architecture employing streamlined Frequency-Time-Frequency (FTF) stacked transformers efficiently learns global dependencies within causal context, while avoiding considerable computational demands. Utilising discriminators in training further improves learning efficacy and enhancement without introducing additional complexity during inference. The proposed lightweight, causal, transformer-based architecture with adversarial training (LCT-GAN) yields SoTA performance on instrumental metrics among contemporary lightweight models, but with far less overhead. Compared to DeepFilterNet2, the LCT-GAN only requires 6% of the parameters, at similar complexity and performance. Against CCFNet+(Lite), LCT-GAN saves 9% in parameters and 10% in multiply-accumulate operations yet yielding improved performance. Further, the LCT-GAN even outperforms more complex, common baseline models on widely used test datasets.

Study of Lightweight Transformer Architectures for Single-Channel Speech Enhancement

TL;DR

Edge-constraint speech enhancement demands high performance with low compute. The authors introduce LCT-GAN, a lightweight, causal transformer-based GAN that uses a Frequency-Time-Frequency (FTF) bottleneck to capture global dependencies, and trains with multi-resolution adversarial losses and discriminators that do not add inference cost. The generator estimates a magnitude mask in the compressed-domain IRM with , enabling effective speech enhancement while preserving phase. Empirical results on Voicebank+Demand and DNS3 show LCT-GAN achieves state-of-the-art performance among lightweight models with a fraction of DeepFilterNet2’s parameters and competitive MACs, with discriminators and PCS further boosting perceptual quality. The work enables practical deployment on edge devices with low latency and opens avenues for further efficiency gains.

Abstract

In speech enhancement, achieving state-of-the-art (SotA) performance while adhering to the computational constraints on edge devices remains a formidable challenge. Networks integrating stacked temporal and spectral modelling effectively leverage improved architectures such as transformers; however, they inevitably incur substantial computational complexity and model expansion. Through systematic ablation analysis on transformer-based temporal and spectral modelling, we demonstrate that the architecture employing streamlined Frequency-Time-Frequency (FTF) stacked transformers efficiently learns global dependencies within causal context, while avoiding considerable computational demands. Utilising discriminators in training further improves learning efficacy and enhancement without introducing additional complexity during inference. The proposed lightweight, causal, transformer-based architecture with adversarial training (LCT-GAN) yields SoTA performance on instrumental metrics among contemporary lightweight models, but with far less overhead. Compared to DeepFilterNet2, the LCT-GAN only requires 6% of the parameters, at similar complexity and performance. Against CCFNet+(Lite), LCT-GAN saves 9% in parameters and 10% in multiply-accumulate operations yet yielding improved performance. Further, the LCT-GAN even outperforms more complex, common baseline models on widely used test datasets.

Paper Structure

This paper contains 10 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The architecture of the proposed LCT-GAN model, with the 'generator' and the discriminators distinguished by dotted lines. The generator is a predictive network to estimate masks for denoising. The $L_\mathrm{adv\_gen}$ and $L_\mathrm{adv\_dis}$ denote the adversarial loss components for the training of the generator and discriminator, respectively. $G()$ denotes the generator (LCT), while $D()$ is the discriminator. A multi-resolution loss, $L_\mathrm{multi\_res}$, is employed for the generator training. Skip connections are implemented by point-wise convolutional layers. The input tensor dimensions for each transformer are explicitly indicated.
  • Figure 2: The schematic diagram for the information-exploitation flow of proposed efficient FTF-transformer structure. TF bins are represented by small blocks. Information flows from TF bins of previous and current frames to each certain TF bin (marked by a cross) are denoted by coloured arrows. The x and y axis are time and frequency dimensions, respectively.