Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset
Farhan Samir, Emily P. Ahn, Shreya Prakash, Márton Soskuthy, Vered Shwartz, Jian Zhu
TL;DR
This work tackles the challenge of quality auditing in large multilingual datasets by introducing the Preference Proportion Test (PPT), a sample-efficient statistical test that uses human-predicted preferences between gold-standard and model-generated transcripts to identify unreliable language partitions. Through a case study on X-IPAPack, PPT identifies $10$ unreliable language subsets among $L=22$, enabling targeted data cleaning. Training phonetic transcription models on PPT-filtered data yields significant downstream gains, including a $\$25.7\%$ relative improvement on out-of-distribution languages, and a notable improvement on Punjabi when related low-quality data are removed. The findings underscore that even large multilingual datasets require careful, language-aware auditing, as universal phonetic transcription models remain sensitive to training data composition and cannot substitute for diverse high-quality data.
Abstract
Curating datasets that span multiple languages is challenging. To make the collection more scalable, researchers often incorporate one or more imperfect classifiers in the process, like language identification models. These models, however, are prone to failure, resulting in some language subsets being unreliable for downstream tasks. We introduce a statistical test, the Preference Proportion Test, for identifying such unreliable subsets. By annotating only 20 samples for a language subset, we're able to identify systematic transcription errors for 10 language subsets in a recent large multilingual transcribed audio dataset, X-IPAPack (Zhu et al., 2024). We find that filtering this low-quality data out when training models for the downstream task of phonetic transcription brings substantial benefits, most notably a 25.7% relative improvement on transcribing recordings in out-of-distribution languages. Our method lays a path forward for systematic and reliable multilingual dataset auditing.
