Loud-loss: A Perceptually Motivated Loss Function for Speech Enhancement Based on Equal-Loudness Contours

Zixuan Li; Xueliang Zhang; Changjiang Zhao; Shuai Gao; Lei Miao; Zhipeng Yan; Ying Sun; Chong Zhu

Loud-loss: A Perceptually Motivated Loss Function for Speech Enhancement Based on Equal-Loudness Contours

Zixuan Li, Xueliang Zhang, Changjiang Zhao, Shuai Gao, Lei Miao, Zhipeng Yan, Ying Sun, Chong Zhu

TL;DR

This paper addresses the mismatch between traditional mean squared error losses and human auditory perception in speech enhancement by introducing Loud-loss, a perceptually motivated loss based on equal-loudness contours. The method operates in four stages on the log-power spectrum, partitions the spectrum into Mel-scale sub-bands, computes per-sub-band MSE, and applies psychoacoustic weights to form a final weighted loss. Empirical results on VoiceBank+DEMAND show substantial perceptual improvements, with WB-PESQ rising from $2.17$ to $2.93$ on GTCRN and ESTOI also improving, while maintaining model-agnostic applicability across mapping and masking architectures. The work demonstrates that aligning optimization with psychoacoustic principles yields significant perceptual gains and can complement or surpass traditional losses, offering a practical approach for capacity-constrained speech enhancement systems.

Abstract

The mean squared error (MSE) is a ubiquitous loss function for speech enhancement, but its problem is that the error cannot reflect the auditory perception quality. This is because MSE causes models to over-emphasize low-frequency components which has high energy, leading to the inadequate modeling of perceptually important high-frequency information. To overcome this limitation, we propose a perceptually-weighted loss function grounded in psychoacoustic principles. Specifically, it leverages equal-loudness contours to assign frequency-dependent weights to the reconstruction error, thereby penalizing deviations in a way aligning with human auditory sensitivity. The proposed loss is model-agnostic and flexible, demonstrating strong generality. Experiments on the VoiceBank+DEMAND dataset show that replacing MSE with our loss in a GTCRN model elevates the WB-PESQ score from 2.17 to 2.93-a significant improvement in perceptual quality.

Loud-loss: A Perceptually Motivated Loss Function for Speech Enhancement Based on Equal-Loudness Contours

TL;DR

Abstract

Loud-loss: A Perceptually Motivated Loss Function for Speech Enhancement Based on Equal-Loudness Contours

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)