Table of Contents
Fetching ...

Crowdsourced Multilingual Speech Intelligibility Testing

Laura Lechler, Kamil Wojcicki

TL;DR

This work tackles the challenge of scalable, multilingual intelligibility testing for modern generative audio by introducing a crowdsourced assessment framework based on the Diagnostic Rhyme Test (DRT). It details multilingual word lists, test sets, blocks, rewards, pre-screening, procedures, and per-file scoring, and publicly releases the collected data. Four experiments across Spanish and English codecs, consistency checks, and cross-language comparisons show that crowdsourced results track lab trends and exhibit strong repeatability, though absolute scores are typically lower in crowdsourcing. The study demonstrates meaningful, language- and codec-sensitive intelligibility insights and lays groundwork for benchmarking and broader multilingual evaluation in real-world, scalable settings.

Abstract

With the advent of generative audio features, there is an increasing need for rapid evaluation of their impact on speech intelligibility. Beyond the existing laboratory measures, which are expensive and do not scale well, there has been comparatively little work on crowdsourced assessment of intelligibility. Standards and recommendations are yet to be defined, and publicly available multilingual test materials are lacking. In response to this challenge, we propose an approach for a crowdsourced intelligibility assessment. We detail the test design, the collection and public release of the multilingual speech data, and the results of our early experiments.

Crowdsourced Multilingual Speech Intelligibility Testing

TL;DR

This work tackles the challenge of scalable, multilingual intelligibility testing for modern generative audio by introducing a crowdsourced assessment framework based on the Diagnostic Rhyme Test (DRT). It details multilingual word lists, test sets, blocks, rewards, pre-screening, procedures, and per-file scoring, and publicly releases the collected data. Four experiments across Spanish and English codecs, consistency checks, and cross-language comparisons show that crowdsourced results track lab trends and exhibit strong repeatability, though absolute scores are typically lower in crowdsourcing. The study demonstrates meaningful, language- and codec-sensitive intelligibility insights and lays groundwork for benchmarking and broader multilingual evaluation in real-world, scalable settings.

Abstract

With the advent of generative audio features, there is an increasing need for rapid evaluation of their impact on speech intelligibility. Beyond the existing laboratory measures, which are expensive and do not scale well, there has been comparatively little work on crowdsourced assessment of intelligibility. Standards and recommendations are yet to be defined, and publicly available multilingual test materials are lacking. In response to this challenge, we propose an approach for a crowdsourced intelligibility assessment. We detail the test design, the collection and public release of the multilingual speech data, and the results of our early experiments.
Paper Structure (17 sections, 4 figures)

This paper contains 17 sections, 4 figures.

Figures (4)

  • Figure 1: Laboratory and crowdsourced intelligibility scores for Spanish WB and NB PCMU conditions.
  • Figure 2: Overall intelligibility scores for English obtained for two codecs under laboratory conditions itu_p807_2016 and via crowdsourcing.
  • Figure 3: Intelligibility scores for English for two codecs under laboratory itu_p807_2016 and crowdsourcing conditions per distinctive feature.
  • Figure 4: Overall intelligibility scores obtained for several languages. Asterisks indicate statistical significance at a level of $p$$<$0.05.