Table of Contents
Fetching ...

Low-latency Speech Enhancement via Speech Token Generation

Huaying Xue, Xiulian Peng, Yan Lu

TL;DR

The paper tackles robustness and latency in speech enhancement under unseen real-world noises by formulating enhancement as conditional generation of clean-speech tokens. It introduces a conditional generative framework that encodes clean speech into discrete TF-Codec tokens and autoregressively generates these tokens from noisy input, guided by an explicit-alignment scheme that enables pure next-token prediction. Leveraging a single-stage, group-quantized TF-Codec-based generator, the approach achieves low latency while maintaining high speech quality. Experiments on synthetic and real DNS data show improved perceptual quality and intelligibility over a TFNet baseline, with ablations confirming the benefits of explicit alignment and autoregressive design for robustness and temporal coherence.

Abstract

Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. Extensive results on both synthetic and real-recorded test set show its superiority over data-driven approaches in terms of noise robustness and temporal speech coherence.

Low-latency Speech Enhancement via Speech Token Generation

TL;DR

The paper tackles robustness and latency in speech enhancement under unseen real-world noises by formulating enhancement as conditional generation of clean-speech tokens. It introduces a conditional generative framework that encodes clean speech into discrete TF-Codec tokens and autoregressively generates these tokens from noisy input, guided by an explicit-alignment scheme that enables pure next-token prediction. Leveraging a single-stage, group-quantized TF-Codec-based generator, the approach achieves low latency while maintaining high speech quality. Experiments on synthetic and real DNS data show improved perceptual quality and intelligibility over a TFNet baseline, with ablations confirming the benefits of explicit alignment and autoregressive design for robustness and temporal coherence.

Abstract

Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. Extensive results on both synthetic and real-recorded test set show its superiority over data-driven approaches in terms of noise robustness and temporal speech coherence.
Paper Structure (18 sections, 3 equations, 2 figures, 2 tables)

This paper contains 18 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The proposed generative framework for speech enhancement.
  • Figure 2: The proposed framework with the explicit-alignment scheme for generation. The noisy token is extracted by a noisy feature extractor. The clean token is extracted from the pre-trained TF-Codec and managed in a group-VQ manner. (a) The proposed explicit-alignment based generative model. (b) A typical prefix-based conditional generative model.