Table of Contents
Fetching ...

YODAS: Youtube-Oriented Dataset for Audio and Speech

Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, Shinji Watanabe

TL;DR

YODAS tackles the need for industry-scale, multilingual speech data by introducing a public Creative Commons-licensed YouTube-derived dataset with manual, automatic, and unlabeled subsets totaling over 500k hours in 100+ languages. The authors implement a three-client data collection pipeline (Keyword-based, Channel-based, Download) coordinated by a Master Node to assemble CC-licensed content and subtitles, and perform comprehensive speech and text analyses. They establish monolingual ASR baselines using XLSR representations with CTC on the manual subset, showing strong performance and demonstrating that manual subtitles yield better results than automatic ones, while highlighting the importance of alignment filtering. By providing a scalable resource and baseline results, YODAS enables supervised, weakly supervised, and self-supervised training across many languages, with potential impact on multilingual ASR research and reproducibility; it will be released on HuggingFace.

Abstract

In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets are apt for self-supervised learning applications. YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license. We introduce the collection methodology utilized for YODAS, which contributes to the large-scale speech dataset construction. Subsequently, we provide a comprehensive analysis of speech, text contained within the dataset. Finally, we describe the speech recognition baselines over the top-15 languages.

YODAS: Youtube-Oriented Dataset for Audio and Speech

TL;DR

YODAS tackles the need for industry-scale, multilingual speech data by introducing a public Creative Commons-licensed YouTube-derived dataset with manual, automatic, and unlabeled subsets totaling over 500k hours in 100+ languages. The authors implement a three-client data collection pipeline (Keyword-based, Channel-based, Download) coordinated by a Master Node to assemble CC-licensed content and subtitles, and perform comprehensive speech and text analyses. They establish monolingual ASR baselines using XLSR representations with CTC on the manual subset, showing strong performance and demonstrating that manual subtitles yield better results than automatic ones, while highlighting the importance of alignment filtering. By providing a scalable resource and baseline results, YODAS enables supervised, weakly supervised, and self-supervised training across many languages, with potential impact on multilingual ASR research and reproducibility; it will be released on HuggingFace.

Abstract

In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets are apt for self-supervised learning applications. YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license. We introduce the collection methodology utilized for YODAS, which contributes to the large-scale speech dataset construction. Subsequently, we provide a comprehensive analysis of speech, text contained within the dataset. Finally, we describe the speech recognition baselines over the top-15 languages.
Paper Structure (16 sections, 7 figures, 6 tables)

This paper contains 16 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Diagram of our data collection architecture: It incorporates three types of clients: Keyword-based, Channel-based, and Download workers. Each worker fulfills specific tasks and interacts with both the master node and the YouTube platform. Additionally, the Download worker also handles the transfer of downloaded data to external storage.
  • Figure 2: language distribution of unique query keywords used in one of our shards (i.e., worker).
  • Figure 3: Total duration (measured in hours) in the manual and automatic subset. The lower-blue bar shows the duration of the manual subset, the top-orange bar indicates the automatic subset. The combined duration is illustrated on top of each bar.
  • Figure 4: the score histogram and scatter plot of the relationship between the duration and the alignment in the manual subset.
  • Figure 5: the score histogram and scatter plot between the duration and the alignment score in the automatic subset.
  • ...and 2 more figures