The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction

Wen-Chin Huang; Szu-Wei Fu; Erica Cooper; Ryandhimas E. Zezario; Tomoki Toda; Hsin-Min Wang; Junichi Yamagishi; Yu Tsao

The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction

Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, Yu Tsao

TL;DR

The third edition of the VoiceMOS Challenge was presented, a scientific initiative designed to advance research into automatic prediction of human speech ratings, and it was found that many were able to outperform the baseline systems.

Abstract

We present the third edition of the VoiceMOS Challenge, a scientific initiative designed to advance research into automatic prediction of human speech ratings. There were three tracks. The first track was on predicting the quality of ``zoomed-in'' high-quality samples from speech synthesis systems. The second track was to predict ratings of samples from singing voice synthesis and voice conversion with a large variety of systems, listeners, and languages. The third track was semi-supervised quality prediction for noisy, clean, and enhanced speech, where a very small amount of labeled training data was provided. Among the eight teams from both academia and industry, we found that many were able to outperform the baseline systems. Successful techniques included retrieval-based methods and the use of non-self-supervised representations like spectrograms and pitch histograms. These results showed that the challenge has advanced the field of subjective speech rating prediction.

The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction

TL;DR

Abstract

Paper Structure (20 sections, 3 figures, 6 tables)

This paper contains 20 sections, 3 figures, 6 tables.

Introduction
Challenge Description
Tracks and datasets
Track 1: MOS prediction for "zoomed-in" systems
Track 2: MOS prediction for singing voice
Track 3: Semi-supervised MOS prediction for noisy, clean, and enhanced speech
Challenge rules and phases
Participants and baseline systems
Results
Evaluation metrics
Track 1 results
Track 2 results
Track 3 results
Analysis of the participating systems
Datasets
...and 5 more sections

Figures (3)

Figure 1: Bar plot of system-level SRCC values of all participants in track 1.
Figure 2: Bar plot of system-level SRCC values of all participants in track 2.
Figure 3: Bar plot of utterance-level LCC values of SIG, BAK and OVRL of all participants in track 3.

The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction

TL;DR

Abstract

The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (3)