The PESQetarian: On the Relevance of Goodhart's Law for Speech Enhancement
Danilo de Oliveira, Simon Welker, Julius Richter, Timo Gerkmann
TL;DR
This work questions the reliability of optimizing speech-enhancement models purely for PESQ by applying Goodhart's law to signal processing. Using a differentiable PESQ loss paired with an SDR term in a reduced NCSN++ framework on the VB-DMD dataset, the authors show that a PESQ-maximizing model (the PESQetarian) can achieve state-of-the-art PESQ scores while sounding worse to human listeners. They additionally reveal vulnerabilities of PESQ through a 'click trick' that artificially inflates PESQ, and demonstrate that per-utterance oracle PESQ optimization yields PESQ gains at the expense of SI-SDR and overall perceptual quality, underscoring the ill effects of metric-only optimization. The study advocates for multi-metric evaluation, listening tests, and the development of more robust, perceptually aligned evaluation strategies to prevent metric exploitation in speech enhancement.
Abstract
To obtain improved speech enhancement models, researchers often focus on increasing performance according to specific instrumental metrics. However, when the same metric is used in a loss function to optimize models, it may be detrimental to aspects that the given metric does not see. The goal of this paper is to illustrate the risk of overfitting a speech enhancement model to the metric used for evaluation. For this, we introduce enhancement models that exploit the widely used PESQ measure. Our "PESQetarian" model achieves 3.82 PESQ on VB-DMD while scoring very poorly in a listening experiment. While the obtained PESQ value of 3.82 would imply "state-of-the-art" PESQ-performance on the VB-DMD benchmark, our examples show that when optimizing w.r.t. a metric, an isolated evaluation on the same metric may be misleading. Instead, other metrics should be included in the evaluation and the resulting performance predictions should be confirmed by listening.
