Crowdsourced Multilingual Speech Intelligibility Testing
Laura Lechler, Kamil Wojcicki
TL;DR
This work tackles the challenge of scalable, multilingual intelligibility testing for modern generative audio by introducing a crowdsourced assessment framework based on the Diagnostic Rhyme Test (DRT). It details multilingual word lists, test sets, blocks, rewards, pre-screening, procedures, and per-file scoring, and publicly releases the collected data. Four experiments across Spanish and English codecs, consistency checks, and cross-language comparisons show that crowdsourced results track lab trends and exhibit strong repeatability, though absolute scores are typically lower in crowdsourcing. The study demonstrates meaningful, language- and codec-sensitive intelligibility insights and lays groundwork for benchmarking and broader multilingual evaluation in real-world, scalable settings.
Abstract
With the advent of generative audio features, there is an increasing need for rapid evaluation of their impact on speech intelligibility. Beyond the existing laboratory measures, which are expensive and do not scale well, there has been comparatively little work on crowdsourced assessment of intelligibility. Standards and recommendations are yet to be defined, and publicly available multilingual test materials are lacking. In response to this challenge, we propose an approach for a crowdsourced intelligibility assessment. We detail the test design, the collection and public release of the multilingual speech data, and the results of our early experiments.
