Table of Contents
Fetching ...

Aligning Generative Speech Enhancement with Perceptual Feedback

Haoyang Li, Nana Hou, Yuchen Hu, Jixun Yao, Sabato Marco Siniscalchi, Xuyi Zhuang, Deheng Ye, Wei Yang, Eng Siong Chng

TL;DR

This work tackles the misalignment between token-level objectives and human perceptual preferences in LM-based speech enhancement. It introduces GSEPF, a perceptually aligned two-stage LM framework that uses Direct Preference Optimization guided by a neural MOS predictor (UTMOS) to directly optimize perceptual quality. Experiments on the DNS 2020 test sets show consistent gains in objective perceptual metrics (e.g., DNSMOS, UTMOS, NISQA) and clear subjective improvements, including a user preference advantage in A/B tests. By integrating perceptual feedback with a simple, scalable training pipeline, the paper demonstrates a paradigm shift toward perceptually driven enhancement in LM-based SE and suggests avenues for extending alignment to speaker similarity and multi-objective goals.

Abstract

Language Model (LM)-based speech enhancement (SE) has recently emerged as a promising direction, but existing approaches predominantly rely on token-level likelihood objectives that weakly reflect human perception. This mismatch limits progress, as optimizing signal accuracy does not always improve naturalness or listening comfort. We address this gap by introducing a perceptually aligned LM-based SE approach. Our method applies Direct Preference Optimization (DPO) with UTMOS, a neural MOS predictor, as a proxy for human ratings, directly steering models toward perceptually preferred outputs. This design directly connects model training to perceptual quality and is broadly applicable within LM-based SE frameworks. On the Deep Noise Suppression Challenge 2020 test sets, our approach consistently improves speech quality metrics, achieving relative gains of up to 56%. To our knowledge, this is the first integration of perceptual feedback into LM-based SE and the first application of DPO in the SE domain, establishing a new paradigm for perceptually aligned enhancement with SE.

Aligning Generative Speech Enhancement with Perceptual Feedback

TL;DR

This work tackles the misalignment between token-level objectives and human perceptual preferences in LM-based speech enhancement. It introduces GSEPF, a perceptually aligned two-stage LM framework that uses Direct Preference Optimization guided by a neural MOS predictor (UTMOS) to directly optimize perceptual quality. Experiments on the DNS 2020 test sets show consistent gains in objective perceptual metrics (e.g., DNSMOS, UTMOS, NISQA) and clear subjective improvements, including a user preference advantage in A/B tests. By integrating perceptual feedback with a simple, scalable training pipeline, the paper demonstrates a paradigm shift toward perceptually driven enhancement in LM-based SE and suggests avenues for extending alignment to speaker similarity and multi-objective goals.

Abstract

Language Model (LM)-based speech enhancement (SE) has recently emerged as a promising direction, but existing approaches predominantly rely on token-level likelihood objectives that weakly reflect human perception. This mismatch limits progress, as optimizing signal accuracy does not always improve naturalness or listening comfort. We address this gap by introducing a perceptually aligned LM-based SE approach. Our method applies Direct Preference Optimization (DPO) with UTMOS, a neural MOS predictor, as a proxy for human ratings, directly steering models toward perceptually preferred outputs. This design directly connects model training to perceptual quality and is broadly applicable within LM-based SE frameworks. On the Deep Noise Suppression Challenge 2020 test sets, our approach consistently improves speech quality metrics, achieving relative gains of up to 56%. To our knowledge, this is the first integration of perceptual feedback into LM-based SE and the first application of DPO in the SE domain, establishing a new paradigm for perceptually aligned enhancement with SE.

Paper Structure

This paper contains 19 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Pipeline to obtain preference pairs $A^+$ and $A^-$ from the reference S2S LM $\pi_\text{ref}$ during training. $A^+$ and $A^-$ are disjoint subsets of $\{\hat{A}^{(n)}\}_{n=1}^N$, each of size $Z$, such that $A^+ \cap A^- = \emptyset$.
  • Figure 2: A/B test on naturalness and listening comfort between GenSE* and GSEPFCE+DPO. The proposed method received 378 votes vs. 222 for the baseline, winning 23/30 cases.
  • Figure 3: Example case study: GenSE* vs. proposed GSEPFCE+DPO (mel-spectrograms). The proposed method reduces artifacts and better preserves speech harmonics compared to the baseline.