Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
Beomseok Lee, Marco Gaido, Ioan Calapodescu, Laurent Besacier, Matteo Negri
TL;DR
This work investigates using Speech Foundation Models (SFMs) to automate validation of crowdsourced speech data, addressing the cost-QUALITY trade-off in data collection. By comparing a distance-based validation policy with a feature-rich decision tree and a proposed hybrid two-step method, the study demonstrates that SFMs can reduce validation costs by over 40% while preserving final data quality across French, German, and Korean data. The approach combines $CER$, $WER$, $TER$, and $PER$-based signals, translations, and grapheme-phoneme conversions to robustly filter samples, and scales to large real-world pipelines (e.g., 11k+ German samples) with substantial labor savings. The findings highlight practical potential for SFMs to enhance the efficiency and scalability of crowdsourced speech corpora, informing future multi-language deployments and model-assisted data curation.
Abstract
While crowdsourcing is an established solution for facilitating and scaling the collection of speech data, the involvement of non-experts necessitates protocols to ensure final data quality. To reduce the costs of these essential controls, this paper investigates the use of Speech Foundation Models (SFMs) to automate the validation process, examining for the first time the cost/quality trade-off in data acquisition. Experiments conducted on French, German, and Korean data demonstrate that SFM-based validation has the potential to reduce reliance on human validation, resulting in an estimated cost saving of over 40.0% without degrading final data quality. These findings open new opportunities for more efficient, cost-effective, and scalable speech data acquisition.
