Table of Contents
Fetching ...

Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

Praveen Srinivasa Varadhan, Amogh Gulati, Ashwin Sankar, Srija Anand, Anirudh Gupta, Anirudh Mukherjee, Shiva Kumar Marepally, Ankur Bhatia, Saloni Jaju, Suvrat Bhooshan, Mitesh M. Khapra

TL;DR

The paper reassesses subjective evaluation for modern TTS, identifying reference-matching bias and judgement ambiguity as key flaws in MUSHRA. It introduces two refined variants, MUSHRA-NMR and MUSHRA-DG, and validates them on a large Hindi-Tamil dataset (Mango) with 246,000 ratings from 492 listeners, with cross-language validation on English. The work demonstrates that the proposed variants yield more reliable, fine-grained assessments while preserving system rankings, and it releases Mango to support future metric development. Collectively, these contributions offer a practical framework for robust TTS evaluation and foster more nuanced understanding of where SOTA systems excel or fail.

Abstract

Despite rapid advancements in TTS models, a consistent and robust human evaluation framework is still lacking. For example, MOS tests fail to differentiate between similar models, and CMOS's pairwise comparisons are time-intensive. The MUSHRA test is a promising alternative for evaluating multiple TTS systems simultaneously, but in this work we show that its reliance on matching human reference speech unduly penalises the scores of modern TTS systems that can exceed human speech quality. More specifically, we conduct a comprehensive assessment of the MUSHRA test, focusing on its sensitivity to factors such as rater variability, listener fatigue, and reference bias. Based on our extensive evaluation involving 492 human listeners across Hindi and Tamil we identify two primary shortcomings: (i) reference-matching bias, where raters are unduly influenced by the human reference, and (ii) judgement ambiguity, arising from a lack of clear fine-grained guidelines. To address these issues, we propose two refined variants of the MUSHRA test. The first variant enables fairer ratings for synthesized samples that surpass human reference quality. The second variant reduces ambiguity, as indicated by the relatively lower variance across raters. By combining these approaches, we achieve both more reliable and more fine-grained assessments. We also release MANGO, a massive dataset of 246,000 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems.

Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

TL;DR

The paper reassesses subjective evaluation for modern TTS, identifying reference-matching bias and judgement ambiguity as key flaws in MUSHRA. It introduces two refined variants, MUSHRA-NMR and MUSHRA-DG, and validates them on a large Hindi-Tamil dataset (Mango) with 246,000 ratings from 492 listeners, with cross-language validation on English. The work demonstrates that the proposed variants yield more reliable, fine-grained assessments while preserving system rankings, and it releases Mango to support future metric development. Collectively, these contributions offer a practical framework for robust TTS evaluation and foster more nuanced understanding of where SOTA systems excel or fail.

Abstract

Despite rapid advancements in TTS models, a consistent and robust human evaluation framework is still lacking. For example, MOS tests fail to differentiate between similar models, and CMOS's pairwise comparisons are time-intensive. The MUSHRA test is a promising alternative for evaluating multiple TTS systems simultaneously, but in this work we show that its reliance on matching human reference speech unduly penalises the scores of modern TTS systems that can exceed human speech quality. More specifically, we conduct a comprehensive assessment of the MUSHRA test, focusing on its sensitivity to factors such as rater variability, listener fatigue, and reference bias. Based on our extensive evaluation involving 492 human listeners across Hindi and Tamil we identify two primary shortcomings: (i) reference-matching bias, where raters are unduly influenced by the human reference, and (ii) judgement ambiguity, arising from a lack of clear fine-grained guidelines. To address these issues, we propose two refined variants of the MUSHRA test. The first variant enables fairer ratings for synthesized samples that surpass human reference quality. The second variant reduces ambiguity, as indicated by the relatively lower variance across raters. By combining these approaches, we achieve both more reliable and more fine-grained assessments. We also release MANGO, a massive dataset of 246,000 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems.

Paper Structure

This paper contains 36 sections, 2 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Visualization of the MUSHRA score distributions per rater across three systems— FS2, ST2, and VITS, along with the reference (REF) and anchor (ANC) for Hindi. Each boxplot represents ratings (0-100) across all test utterances for a system by one rater.While some boxplots exhibit relatively low variance, several display substantial height, indicating that certain raters assign widely varying scores to the same system across utterances. The variation in the means of the boxplot across raters suggests a high level of inter-rater variance. Raters are sorted in ascending order of their mean scores for the reference.
  • Figure 2: Rank correlation of mean scores obtained using subsets of listeners and utterances and mean scores obtained using all listeners and utterances in Hindi.
  • Figure 3: MUSHRA Scores in Hindi show score-variance but rank-invariance across systems when raters who rate Reference $\le \lambda$ for more than $15 \%$ of utterances are rejected. R is the number of raters retained.
  • Figure 4: (Left) Correlation between scores from a subset of listeners and all listeners. (Right) Correlation between scores from a subset of utterances and all utterances.
  • Figure 5: Visualization of the six objective and three perceptual dimensions of the MUSHRA-DG test. The objective scores are represented using stacked bars, where multiple error categories are displayed cumulatively rather than as independent percentages. The subjective dimensions are represented using a scatter plot with scores ranging from 0 to 100.
  • ...and 11 more figures