Enhanced Generative Machine Listener

Vishnu Raj; Gouthaman KV; Shiv Gehlot; Lars Villemoes; Arijit Biswas

Enhanced Generative Machine Listener

Vishnu Raj, Gouthaman KV, Shiv Gehlot, Lars Villemoes, Arijit Biswas

TL;DR

GMLv2 tackles the need for scalable, uncertainty-aware perceptual audio quality evaluation by modeling MUSHRA scores with a Beta distribution whose parameters $(\alpha,\beta)$ are predicted by a neural network from Gammatone spectrogram features. The Beta-loss framework enforces unimodality and yields both expected quality and interpretable uncertainty, addressing bounded score behavior. Expanded training data include Neural Audio Coding (NAC) datasets alongside traditional codecs, enabling strong generalization across content types and codecs. Empirical results show superior correlation with subjective scores and lower outlier rates compared to PEAQ, ViSQOL, and GMLv1, highlighting robust, calibrated predictions suitable for modern audio coding development. Overall, GMLv2 provides a scalable, uncertainty-aware, reference-based metric that aligns closely with human judgments and supports automated evaluation across diverse audio pipelines.

Abstract

We present GMLv2, a reference-based model designed for the prediction of subjective audio quality as measured by MUSHRA scores. GMLv2 introduces a Beta distribution-based loss to model the listener ratings and incorporates additional neural audio coding (NAC) subjective datasets to extend its generalization and applicability. Extensive evaluations on diverse testset demonstrate that proposed GMLv2 consistently outperforms widely used metrics, such as PEAQ and ViSQOL, both in terms of correlation with subjective scores and in reliably predicting these scores across diverse content types and codec configurations. Consequently, GMLv2 offers a scalable and automated framework for perceptual audio quality evaluation, poised to accelerate research and development in modern audio coding technologies.

Enhanced Generative Machine Listener

TL;DR

GMLv2 tackles the need for scalable, uncertainty-aware perceptual audio quality evaluation by modeling MUSHRA scores with a Beta distribution whose parameters

are predicted by a neural network from Gammatone spectrogram features. The Beta-loss framework enforces unimodality and yields both expected quality and interpretable uncertainty, addressing bounded score behavior. Expanded training data include Neural Audio Coding (NAC) datasets alongside traditional codecs, enabling strong generalization across content types and codecs. Empirical results show superior correlation with subjective scores and lower outlier rates compared to PEAQ, ViSQOL, and GMLv1, highlighting robust, calibrated predictions suitable for modern audio coding development. Overall, GMLv2 provides a scalable, uncertainty-aware, reference-based metric that aligns closely with human judgments and supports automated evaluation across diverse audio pipelines.

Enhanced Generative Machine Listener

TL;DR

Abstract

Enhanced Generative Machine Listener

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)