Table of Contents
Fetching ...

Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection

Beomseok Lee, Marco Gaido, Ioan Calapodescu, Laurent Besacier, Matteo Negri

TL;DR

This work investigates using Speech Foundation Models (SFMs) to automate validation of crowdsourced speech data, addressing the cost-QUALITY trade-off in data collection. By comparing a distance-based validation policy with a feature-rich decision tree and a proposed hybrid two-step method, the study demonstrates that SFMs can reduce validation costs by over 40% while preserving final data quality across French, German, and Korean data. The approach combines $CER$, $WER$, $TER$, and $PER$-based signals, translations, and grapheme-phoneme conversions to robustly filter samples, and scales to large real-world pipelines (e.g., 11k+ German samples) with substantial labor savings. The findings highlight practical potential for SFMs to enhance the efficiency and scalability of crowdsourced speech corpora, informing future multi-language deployments and model-assisted data curation.

Abstract

While crowdsourcing is an established solution for facilitating and scaling the collection of speech data, the involvement of non-experts necessitates protocols to ensure final data quality. To reduce the costs of these essential controls, this paper investigates the use of Speech Foundation Models (SFMs) to automate the validation process, examining for the first time the cost/quality trade-off in data acquisition. Experiments conducted on French, German, and Korean data demonstrate that SFM-based validation has the potential to reduce reliance on human validation, resulting in an estimated cost saving of over 40.0% without degrading final data quality. These findings open new opportunities for more efficient, cost-effective, and scalable speech data acquisition.

Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection

TL;DR

This work investigates using Speech Foundation Models (SFMs) to automate validation of crowdsourced speech data, addressing the cost-QUALITY trade-off in data collection. By comparing a distance-based validation policy with a feature-rich decision tree and a proposed hybrid two-step method, the study demonstrates that SFMs can reduce validation costs by over 40% while preserving final data quality across French, German, and Korean data. The approach combines , , , and -based signals, translations, and grapheme-phoneme conversions to robustly filter samples, and scales to large real-world pipelines (e.g., 11k+ German samples) with substantial labor savings. The findings highlight practical potential for SFMs to enhance the efficiency and scalability of crowdsourced speech corpora, informing future multi-language deployments and model-assisted data curation.

Abstract

While crowdsourcing is an established solution for facilitating and scaling the collection of speech data, the involvement of non-experts necessitates protocols to ensure final data quality. To reduce the costs of these essential controls, this paper investigates the use of Speech Foundation Models (SFMs) to automate the validation process, examining for the first time the cost/quality trade-off in data acquisition. Experiments conducted on French, German, and Korean data demonstrate that SFM-based validation has the potential to reduce reliance on human validation, resulting in an estimated cost saving of over 40.0% without degrading final data quality. These findings open new opportunities for more efficient, cost-effective, and scalable speech data acquisition.

Paper Structure

This paper contains 20 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Results of validation methods: DW (decision-tree); DW+S (decision-tree + silver labels); distance-based (simple policy); crowdsource (fully crowdsourced); proposed (final policy for experiments in $\S$\ref{['sec:applic']})
  • Figure 2: Zoom-in on a specific area of performance (F1 scores displayed above each data point)
  • Figure 3: Decision tree graph of DW 5+S method.
  • Figure 4: Crowdsource confusion matrix.
  • Figure 5: Distance-based method confusion matrix.
  • ...and 3 more figures