Table of Contents
Fetching ...

A Semantic Information-based Hierarchical Speech Enhancement Method Using Factorized Codec and Diffusion Model

Yang Xiang, Canan Huang, Desheng Hu, Jingguang Tian, Xinhui Hu, Chao Zhang

TL;DR

To address limitations of conventional SE methods that directly estimate full speech signals, this work introduces SISE, which factorizes speech into semantic $Z_s$ and acoustic $Z_a$ attributes using a semantic-based codec and a diffusion-based generator. The model first predicts the semantic attribute from noisy input and then generates the corresponding acoustic attribute conditioned on the semantic token, enabling hierarchical, stepwise reconstruction via a decoder. Experiments on DNS-based noisy data and SeedTTS tasks show SISE outperforms state-of-the-art SE baselines in speech quality (DNSMOS) and enhances downstream TTS speaker similarity, particularly in far-field conditions. This approach suggests that incorporating semantic information and hierarchical diffusion-based generation yields more robust SE and potential gains for related speech processing tasks.

Abstract

Most current speech enhancement (SE) methods recover clean speech from noisy inputs by directly estimating time-frequency masks or spectrums. However, these approaches often neglect the distinct attributes, such as semantic content and acoustic details, inherent in speech signals, which can hinder performance in downstream tasks. Moreover, their effectiveness tends to degrade in complex acoustic environments. To overcome these challenges, we propose a novel, semantic information-based, step-by-step factorized SE method using factorized codec and diffusion model. Unlike traditional SE methods, our hierarchical modeling of semantic and acoustic attributes enables more robust clean speech recovery, particularly in challenging acoustic scenarios. Moreover, this method offers further advantages for downstream TTS tasks. Experimental results demonstrate that our algorithm not only outperforms SOTA baselines in terms of speech quality but also enhances TTS performance in noisy environments.

A Semantic Information-based Hierarchical Speech Enhancement Method Using Factorized Codec and Diffusion Model

TL;DR

To address limitations of conventional SE methods that directly estimate full speech signals, this work introduces SISE, which factorizes speech into semantic and acoustic attributes using a semantic-based codec and a diffusion-based generator. The model first predicts the semantic attribute from noisy input and then generates the corresponding acoustic attribute conditioned on the semantic token, enabling hierarchical, stepwise reconstruction via a decoder. Experiments on DNS-based noisy data and SeedTTS tasks show SISE outperforms state-of-the-art SE baselines in speech quality (DNSMOS) and enhances downstream TTS speaker similarity, particularly in far-field conditions. This approach suggests that incorporating semantic information and hierarchical diffusion-based generation yields more robust SE and potential gains for related speech processing tasks.

Abstract

Most current speech enhancement (SE) methods recover clean speech from noisy inputs by directly estimating time-frequency masks or spectrums. However, these approaches often neglect the distinct attributes, such as semantic content and acoustic details, inherent in speech signals, which can hinder performance in downstream tasks. Moreover, their effectiveness tends to degrade in complex acoustic environments. To overcome these challenges, we propose a novel, semantic information-based, step-by-step factorized SE method using factorized codec and diffusion model. Unlike traditional SE methods, our hierarchical modeling of semantic and acoustic attributes enables more robust clean speech recovery, particularly in challenging acoustic scenarios. Moreover, this method offers further advantages for downstream TTS tasks. Experimental results demonstrate that our algorithm not only outperforms SOTA baselines in terms of speech quality but also enhances TTS performance in noisy environments.

Paper Structure

This paper contains 6 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the SISE framework with a factorized codec and factorized diffusion model. The red part will be applied in both the training and inference processes, while the black part will be used only during the training stage.
  • Figure 2: An overview of training diagram of factorized diffusion model, which consists of a semantic diffusion, an acoustic diffusion, and a noisy encoder.