Table of Contents
Fetching ...

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

Hitomi Jin Ling Tee, Chaoren Wang, Zijie Zhang, Zhizheng Wu

TL;DR

The paper addresses the bottleneck in TTS intelligibility evaluation where $WER$ and MOS fail to capture real-world comprehension. It proposes SP-MCQA, a Spoken-Passage Multiple-Choice Question Answering framework, and SP-MCQA-Eval, an 8.76-hour NPR-based benchmark. Experiments show that low $WER$ does not guarantee high key-information accuracy and reveal phonetic and text-normalization challenges across SOTA models. This work advocates high-level, life-like evaluation criteria and outlines future directions, including multilingual expansion and Audio LLM-based assessment.

Abstract

The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

TL;DR

The paper addresses the bottleneck in TTS intelligibility evaluation where and MOS fail to capture real-world comprehension. It proposes SP-MCQA, a Spoken-Passage Multiple-Choice Question Answering framework, and SP-MCQA-Eval, an 8.76-hour NPR-based benchmark. Experiments show that low does not guarantee high key-information accuracy and reveal phonetic and text-normalization challenges across SOTA models. This work advocates high-level, life-like evaluation criteria and outlines future directions, including multilingual expansion and Audio LLM-based assessment.

Abstract

The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.

Paper Structure

This paper contains 11 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: An overview of SP-MCQA. The process involves two main stages: (1) The creation of SP-MCQA-Eval benchmark dataset, and (2) the pipeline for SP-MCQA evaluation.