Table of Contents
Fetching ...

The PESQetarian: On the Relevance of Goodhart's Law for Speech Enhancement

Danilo de Oliveira, Simon Welker, Julius Richter, Timo Gerkmann

TL;DR

This work questions the reliability of optimizing speech-enhancement models purely for PESQ by applying Goodhart's law to signal processing. Using a differentiable PESQ loss paired with an SDR term in a reduced NCSN++ framework on the VB-DMD dataset, the authors show that a PESQ-maximizing model (the PESQetarian) can achieve state-of-the-art PESQ scores while sounding worse to human listeners. They additionally reveal vulnerabilities of PESQ through a 'click trick' that artificially inflates PESQ, and demonstrate that per-utterance oracle PESQ optimization yields PESQ gains at the expense of SI-SDR and overall perceptual quality, underscoring the ill effects of metric-only optimization. The study advocates for multi-metric evaluation, listening tests, and the development of more robust, perceptually aligned evaluation strategies to prevent metric exploitation in speech enhancement.

Abstract

To obtain improved speech enhancement models, researchers often focus on increasing performance according to specific instrumental metrics. However, when the same metric is used in a loss function to optimize models, it may be detrimental to aspects that the given metric does not see. The goal of this paper is to illustrate the risk of overfitting a speech enhancement model to the metric used for evaluation. For this, we introduce enhancement models that exploit the widely used PESQ measure. Our "PESQetarian" model achieves 3.82 PESQ on VB-DMD while scoring very poorly in a listening experiment. While the obtained PESQ value of 3.82 would imply "state-of-the-art" PESQ-performance on the VB-DMD benchmark, our examples show that when optimizing w.r.t. a metric, an isolated evaluation on the same metric may be misleading. Instead, other metrics should be included in the evaluation and the resulting performance predictions should be confirmed by listening.

The PESQetarian: On the Relevance of Goodhart's Law for Speech Enhancement

TL;DR

This work questions the reliability of optimizing speech-enhancement models purely for PESQ by applying Goodhart's law to signal processing. Using a differentiable PESQ loss paired with an SDR term in a reduced NCSN++ framework on the VB-DMD dataset, the authors show that a PESQ-maximizing model (the PESQetarian) can achieve state-of-the-art PESQ scores while sounding worse to human listeners. They additionally reveal vulnerabilities of PESQ through a 'click trick' that artificially inflates PESQ, and demonstrate that per-utterance oracle PESQ optimization yields PESQ gains at the expense of SI-SDR and overall perceptual quality, underscoring the ill effects of metric-only optimization. The study advocates for multi-metric evaluation, listening tests, and the development of more robust, perceptually aligned evaluation strategies to prevent metric exploitation in speech enhancement.

Abstract

To obtain improved speech enhancement models, researchers often focus on increasing performance according to specific instrumental metrics. However, when the same metric is used in a loss function to optimize models, it may be detrimental to aspects that the given metric does not see. The goal of this paper is to illustrate the risk of overfitting a speech enhancement model to the metric used for evaluation. For this, we introduce enhancement models that exploit the widely used PESQ measure. Our "PESQetarian" model achieves 3.82 PESQ on VB-DMD while scoring very poorly in a listening experiment. While the obtained PESQ value of 3.82 would imply "state-of-the-art" PESQ-performance on the VB-DMD benchmark, our examples show that when optimizing w.r.t. a metric, an isolated evaluation on the same metric may be misleading. Instead, other metrics should be included in the evaluation and the resulting performance predictions should be confirmed by listening.
Paper Structure (12 sections, 3 equations, 3 figures, 2 tables)

This paper contains 12 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Spectrograms of estimates produced by the models for a given utterance, accompanied by the corresponding PESQ values. The spectrograms were padded to allow for visualization of the click produced by the PESQ-SDR model.
  • Figure 2: Results of the listening experiment. Due to the "clicks" introduced by the PESQ-SDR model (see Section \ref{['section:clicktrick']}), the audio was processed to make the utterances audible, as described in Section \ref{['section:evaluation']}.
  • Figure 3: torchPESQ loss, PESQ, and SI-SDR for each iteration of the single-utterance oracle PESQ optimization procedure. Each line represents one utterance. While large gains in PESQ are achieved, SI-SDR is consistently significantly worsened.