MMMOS: Multi-domain Multi-axis Audio Quality Assessment
Yi-Cheng Lin, Jia-Hung Chen, Hung-yi Lee
TL;DR
MMMOS addresses the limitation of single-score audio quality models by introducing a no-reference framework that estimates four orthogonal axes across speech, music, and environmental sounds. It fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, M2D) and explores multiple aggregation strategies and four loss functions, followed by ensembling the top models. The approach yields a substantial reduction in prediction error ($ ext{MSE}$) by 20–30% and gains in ordinal consistency ($ au$) of 4–5% on AudioMOS Challenge metrics, with strong performance across Production Complexity and other axes. The work demonstrates improved robustness and cross-domain generalization by combining diverse audio-domain representations and selective ensembling, offering a practical framework for future cross-domain audio quality prediction.
Abstract
Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall's τ versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.
