Table of Contents
Fetching ...

From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset

Jandad Jahani, Mursal Dawodi, Jawid Ahmad Baktash

TL;DR

This work provides a release-level audit of the Pashto component of Mozilla Common Voice (v24.0, Dec 2025) to quantify not just scale but structural maturity in a rapidly growing low-resource ASR resource. It documents dramatic growth from under 2 hours to 2,768.7 total hours, with 975.89 validated hours, while revealing pronounced participation inequality (G = 0.941) and substantial metadata incompleteness (41.97% Undefined gender, limited age diversity). The analysis links validation throughput, demographic gaps, and sentence-level concentration to practical implications for ASR robustness and fairness, and outlines concrete steps to improve dataset maturity, including expanding validation capacity and broader demographic participation. By offering a transparent, reproducible release-level view, the study informs responsible Pashto ASR development and sets benchmarks for similar low-resource, crowdsourced corpora.

Abstract

Large, openly licensed speech datasets are essential for building automatic speech recognition (ASR) systems, yet many widely spoken languages remain underrepresented in public resources. Pashto, spoken by more than 60 million people, has historically lacked large-scale openly licensed speech data suitable for modern ASR development. This paper presents a release-level analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 (December 2025) and contextualizing trends across major releases. We document rapid growth from 1.49 recorded hours in mid-2023 to 2,768.7 total hours in 2025, including 975.89 validated hours available for supervised ASR training. Beyond scale, we analyze validation throughput, contributor participation inequality, demographic metadata completeness, and sentence-level concentration in the validated subset. We find that participation is extremely concentrated (Gini = 0.941), age representation is strongly skewed toward young adults, and 41.97\% of clips lack self-reported gender labels, limiting subgroup auditing based on metadata. At the textual level, prompt reuse is moderate: 35.88\% of unique sentences account for 50\% of validated clips, suggesting that structural concentration is driven primarily by uneven contributor activity rather than dominance of a small prompt set. These results provide a quantitative audit of a rapidly scaling low-resource speech corpus and highlight practical priorities for improving dataset maturity, including expanded validation capacity and broader demographic participation.

From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset

TL;DR

This work provides a release-level audit of the Pashto component of Mozilla Common Voice (v24.0, Dec 2025) to quantify not just scale but structural maturity in a rapidly growing low-resource ASR resource. It documents dramatic growth from under 2 hours to 2,768.7 total hours, with 975.89 validated hours, while revealing pronounced participation inequality (G = 0.941) and substantial metadata incompleteness (41.97% Undefined gender, limited age diversity). The analysis links validation throughput, demographic gaps, and sentence-level concentration to practical implications for ASR robustness and fairness, and outlines concrete steps to improve dataset maturity, including expanding validation capacity and broader demographic participation. By offering a transparent, reproducible release-level view, the study informs responsible Pashto ASR development and sets benchmarks for similar low-resource, crowdsourced corpora.

Abstract

Large, openly licensed speech datasets are essential for building automatic speech recognition (ASR) systems, yet many widely spoken languages remain underrepresented in public resources. Pashto, spoken by more than 60 million people, has historically lacked large-scale openly licensed speech data suitable for modern ASR development. This paper presents a release-level analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 (December 2025) and contextualizing trends across major releases. We document rapid growth from 1.49 recorded hours in mid-2023 to 2,768.7 total hours in 2025, including 975.89 validated hours available for supervised ASR training. Beyond scale, we analyze validation throughput, contributor participation inequality, demographic metadata completeness, and sentence-level concentration in the validated subset. We find that participation is extremely concentrated (Gini = 0.941), age representation is strongly skewed toward young adults, and 41.97\% of clips lack self-reported gender labels, limiting subgroup auditing based on metadata. At the textual level, prompt reuse is moderate: 35.88\% of unique sentences account for 50\% of validated clips, suggesting that structural concentration is driven primarily by uneven contributor activity rather than dominance of a small prompt set. These results provide a quantitative audit of a rapidly scaling low-resource speech corpus and highlight practical priorities for improving dataset maturity, including expanded validation capacity and broader demographic participation.
Paper Structure (28 sections, 1 equation, 4 figures, 3 tables)

This paper contains 28 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Growth of the Pashto Common Voice corpus across major releases, showing total and validated hours from June 2023 (v14.0) to December 2025 (v24.0) on a logarithmic scale.
  • Figure 2: Lorenz curve of validated clip contributions across contributors in the Pashto Common Voice v24.0 release. The corresponding Gini coefficient (0.941) indicates a highly unequal, long-tail contribution structure. The Gini coefficient was computed over the distribution of validated clip counts per contributor (client_id ) in version 24.0.
  • Figure 3: Distribution of validated clip durations in the Pashto Common Voice v24.0 release. Durations are capped at 20 seconds for readability, reflecting the short-utterance design of the Common Voice platform.
  • Figure 4: Cumulative coverage of validated clips by unique sentences in the Pashto Common Voice v24.0 release, illustrating the concentration of recordings across a relatively small subset of prompts.