Table of Contents
Fetching ...

MMMOS: Multi-domain Multi-axis Audio Quality Assessment

Yi-Cheng Lin, Jia-Hung Chen, Hung-yi Lee

TL;DR

MMMOS addresses the limitation of single-score audio quality models by introducing a no-reference framework that estimates four orthogonal axes across speech, music, and environmental sounds. It fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, M2D) and explores multiple aggregation strategies and four loss functions, followed by ensembling the top models. The approach yields a substantial reduction in prediction error ($ ext{MSE}$) by 20–30% and gains in ordinal consistency ($ au$) of 4–5% on AudioMOS Challenge metrics, with strong performance across Production Complexity and other axes. The work demonstrates improved robustness and cross-domain generalization by combining diverse audio-domain representations and selective ensembling, offering a practical framework for future cross-domain audio quality prediction.

Abstract

Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall's τ versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.

MMMOS: Multi-domain Multi-axis Audio Quality Assessment

TL;DR

MMMOS addresses the limitation of single-score audio quality models by introducing a no-reference framework that estimates four orthogonal axes across speech, music, and environmental sounds. It fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, M2D) and explores multiple aggregation strategies and four loss functions, followed by ensembling the top models. The approach yields a substantial reduction in prediction error () by 20–30% and gains in ordinal consistency () of 4–5% on AudioMOS Challenge metrics, with strong performance across Production Complexity and other axes. The work demonstrates improved robustness and cross-domain generalization by combining diverse audio-domain representations and selective ensembling, offering a practical framework for future cross-domain audio quality prediction.

Abstract

Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall's τ versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.

Paper Structure

This paper contains 17 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Model Architecture of MMMOS. BLSTM is an optional component depending on the aggregation method (Sec. \ref{['subsec:model']}).