Aligning Generative Speech Enhancement with Perceptual Feedback
Haoyang Li, Nana Hou, Yuchen Hu, Jixun Yao, Sabato Marco Siniscalchi, Xuyi Zhuang, Deheng Ye, Wei Yang, Eng Siong Chng
TL;DR
This work tackles the misalignment between token-level objectives and human perceptual preferences in LM-based speech enhancement. It introduces GSEPF, a perceptually aligned two-stage LM framework that uses Direct Preference Optimization guided by a neural MOS predictor (UTMOS) to directly optimize perceptual quality. Experiments on the DNS 2020 test sets show consistent gains in objective perceptual metrics (e.g., DNSMOS, UTMOS, NISQA) and clear subjective improvements, including a user preference advantage in A/B tests. By integrating perceptual feedback with a simple, scalable training pipeline, the paper demonstrates a paradigm shift toward perceptually driven enhancement in LM-based SE and suggests avenues for extending alignment to speaker similarity and multi-objective goals.
Abstract
Language Model (LM)-based speech enhancement (SE) has recently emerged as a promising direction, but existing approaches predominantly rely on token-level likelihood objectives that weakly reflect human perception. This mismatch limits progress, as optimizing signal accuracy does not always improve naturalness or listening comfort. We address this gap by introducing a perceptually aligned LM-based SE approach. Our method applies Direct Preference Optimization (DPO) with UTMOS, a neural MOS predictor, as a proxy for human ratings, directly steering models toward perceptually preferred outputs. This design directly connects model training to perceptual quality and is broadly applicable within LM-based SE frameworks. On the Deep Noise Suppression Challenge 2020 test sets, our approach consistently improves speech quality metrics, achieving relative gains of up to 56%. To our knowledge, this is the first integration of perceptual feedback into LM-based SE and the first application of DPO in the SE domain, establishing a new paradigm for perceptually aligned enhancement with SE.
