Table of Contents
Fetching ...

DDTSE: Discriminative Diffusion Model for Target Speech Extraction

Leying Zhang, Yao Qian, Linfeng Yu, Heming Wang, Hemin Yang, Long Zhou, Shujie Liu, Yanmin Qian

TL;DR

DDTSE tackles target speech extraction in multi-speaker noisy environments by integrating the diffusion forward process with a discriminative reconstruction objective. It introduces a two-stage training strategy and provides two inference modes (DDTSE-only and X+DDTSE) to balance speech quality and speed, achieving up to a 3x faster inference than conventional diffusion methods while improving perceptual quality. The approach demonstrates strong performance gains over discriminative and diffusion baselines in both multi- and single-speaker scenarios and offers a practical plug-in for existing TSE pipelines. This work advances real-time, high-fidelity TSE by uniting generative diffusion concepts with discriminative optimization and flexible inference strategies.

Abstract

Diffusion models have gained attention in speech enhancement tasks, providing an alternative to conventional discriminative methods. However, research on target speech extraction under multi-speaker noisy conditions remains relatively unexplored. Moreover, the superior quality of diffusion methods typically comes at the cost of slower inference speed. In this paper, we introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE). We apply the same forward process as diffusion models and utilize the reconstruction loss similar to discriminative methods. Furthermore, we devise a two-stage training strategy to emulate the inference process during model training. DDTSE not only works as a standalone system, but also can further improve the performance of discriminative models without additional retraining. Experimental results demonstrate that DDTSE not only achieves higher perceptual quality but also accelerates the inference process by 3 times compared to the conventional diffusion model.

DDTSE: Discriminative Diffusion Model for Target Speech Extraction

TL;DR

DDTSE tackles target speech extraction in multi-speaker noisy environments by integrating the diffusion forward process with a discriminative reconstruction objective. It introduces a two-stage training strategy and provides two inference modes (DDTSE-only and X+DDTSE) to balance speech quality and speed, achieving up to a 3x faster inference than conventional diffusion methods while improving perceptual quality. The approach demonstrates strong performance gains over discriminative and diffusion baselines in both multi- and single-speaker scenarios and offers a practical plug-in for existing TSE pipelines. This work advances real-time, high-fidelity TSE by uniting generative diffusion concepts with discriminative optimization and flexible inference strategies.

Abstract

Diffusion models have gained attention in speech enhancement tasks, providing an alternative to conventional discriminative methods. However, research on target speech extraction under multi-speaker noisy conditions remains relatively unexplored. Moreover, the superior quality of diffusion methods typically comes at the cost of slower inference speed. In this paper, we introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE). We apply the same forward process as diffusion models and utilize the reconstruction loss similar to discriminative methods. Furthermore, we devise a two-stage training strategy to emulate the inference process during model training. DDTSE not only works as a standalone system, but also can further improve the performance of discriminative models without additional retraining. Experimental results demonstrate that DDTSE not only achieves higher perceptual quality but also accelerates the inference process by 3 times compared to the conventional diffusion model.
Paper Structure (22 sections, 10 equations, 3 figures, 5 tables, 4 algorithms)

This paper contains 22 sections, 10 equations, 3 figures, 5 tables, 4 algorithms.

Figures (3)

  • Figure 1: Comparison of score-based diffusion model, discriminative model and our proposed model. The x-axis represents the timestep. (a) and (b) are the forward and reverse process of score-based diffusion model lemercier2023stormrichter2023speech. (c) is the inference process of discriminative method with one-step prediction. (d) is the inference process of our proposed DDTSE-only mode. The solid gray line is the model prediction in each step. The dashed gray line and the dotted circles are the results obtained by adding noise according to Eq.\ref{['eq:mu']} and \ref{['eq:sigma']}.
  • Figure 2: The overall architecture of DDTSE. Left: The model architecture. Right: The (up/down sample) residual block in this model.
  • Figure 3: Comparison of DNSMOS distribution between X+DDTSE and corresponding discriminative model (X) DPCCN and NCSN++ in noisy and clean scenarios. Values are the higher the better.