Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset

Farhan Samir; Emily P. Ahn; Shreya Prakash; Márton Soskuthy; Vered Shwartz; Jian Zhu

Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset

Farhan Samir, Emily P. Ahn, Shreya Prakash, Márton Soskuthy, Vered Shwartz, Jian Zhu

TL;DR

This work tackles the challenge of quality auditing in large multilingual datasets by introducing the Preference Proportion Test (PPT), a sample-efficient statistical test that uses human-predicted preferences between gold-standard and model-generated transcripts to identify unreliable language partitions. Through a case study on X-IPAPack, PPT identifies $10$ unreliable language subsets among $L=22$, enabling targeted data cleaning. Training phonetic transcription models on PPT-filtered data yields significant downstream gains, including a $\$25.7\%$ relative improvement on out-of-distribution languages, and a notable improvement on Punjabi when related low-quality data are removed. The findings underscore that even large multilingual datasets require careful, language-aware auditing, as universal phonetic transcription models remain sensitive to training data composition and cannot substitute for diverse high-quality data.

Abstract

Curating datasets that span multiple languages is challenging. To make the collection more scalable, researchers often incorporate one or more imperfect classifiers in the process, like language identification models. These models, however, are prone to failure, resulting in some language subsets being unreliable for downstream tasks. We introduce a statistical test, the Preference Proportion Test, for identifying such unreliable subsets. By annotating only 20 samples for a language subset, we're able to identify systematic transcription errors for 10 language subsets in a recent large multilingual transcribed audio dataset, X-IPAPack (Zhu et al., 2024). We find that filtering this low-quality data out when training models for the downstream task of phonetic transcription brings substantial benefits, most notably a 25.7% relative improvement on transcribing recordings in out-of-distribution languages. Our method lays a path forward for systematic and reliable multilingual dataset auditing.

Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset

TL;DR

unreliable language subsets among

, enabling targeted data cleaning. Training phonetic transcription models on PPT-filtered data yields significant downstream gains, including a

25.7\%$ relative improvement on out-of-distribution languages, and a notable improvement on Punjabi when related low-quality data are removed. The findings underscore that even large multilingual datasets require careful, language-aware auditing, as universal phonetic transcription models remain sensitive to training data composition and cannot substitute for diverse high-quality data.

Abstract

Paper Structure (37 sections, 6 figures, 7 tables)

This paper contains 37 sections, 6 figures, 7 tables.

Introduction
Related work
Case Study: Auditing X-IPAPack
X-IPAPack Contents
Overview.
Transcript composition.
Transcript normalization.
Preference Proportion Test (PPT)
Problem setup.
Hypothesis testing.
Applying the PPT to X-IPAPack
Selecting languages to audit.
Setting parameters for PPT.
Annotation process.
Identifying unreliable language partitions.
...and 22 more sections

Figures (6)

Figure 1: Multilingual datasets have been reported to contain a high degree of quality-control issues, especially for lower-resourced languages. We audit the quality of each language subset in a multilingual dataset by annotating a small sample from it. Specifically, expert annotators select between the gold-standard transcript and one generated by a baseline model prediction. When the baseline model predictions are consistently preferred over the gold-standard, we conclude that the language subset should be flagged for re-labeling.
Figure 2: Number of samples that have to be annotated for the Preference Proportion Test (PPT; Sec. \ref{['sec:quality-audit']}) for different effect sizes, assuming a fixed false positive tolerance of $\alpha=0.05$. We aim to achieve at least $80\%$ statistical power. We use an effect size of $0.3$ for auditing the X-IPAPack dataset; see Sec. \ref{['sec:ppt-application']}.
Figure 3: Benchmarking the models of taguchi2023universal (X-axis) and xu2021simple (Y-axis) on X-IPAPack. Error rates for both models are measured by phonetic feature discrepancies mortensen2016panphon between the model-generated transcripts and the transcripts in X-IPAPack. In Sec \ref{['sec:quality-audit']}, we audit the languages with relatively high-error rates (top-right quadrant, in red), to determine whether the errors may be attributed to poor-quality transcripts in X-IPAPack.
Figure 4: Number of samples where the X-IPAPack transcript was preferred to one generated by a phone recognizer. The bottom $10$ languages have a relatively large number of samples where the X-IPAPack transcript is of lesser quality than the phone recognizer transcript, and thus they fail the Preference Proportion Test.
Figure 5: Interface for performing annotations.
...and 1 more figures

Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset

TL;DR

Abstract

Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (6)