Table of Contents
Fetching ...

Open the Data! Chuvash Datasets

Nikolay Plotnikov, Alexander Antonov

TL;DR

The paper addresses data scarcity for Chuvash language by releasing four open datasets spanning monolingual text, Chuvash–Russian and Chuvash–English parallel corpora, and an audio corpus. The datasets enable NLP, MT, ASR, and TTS research and are positioned for future multimodal LLM development. Key contributions include 3.9M monolingual sentences, ~1.4M Chuvash–Russian sentence pairs, ~200k Chuvash–English sentence pairs, and an audio corpus of ~38 hours (63 hours when combined with Common Voice). Public release via Hugging Face and dataset design aligned with Common Voice facilitates integration, comparison, and language preservation in the digital age.

Abstract

In this paper, we introduce four comprehensive datasets for the Chuvash language, aiming to support and enhance linguistic research and technological development for this underrepresented language. These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset. Each dataset is meticulously curated to serve various applications such as machine translation, linguistic analysis, and speech recognition, providing valuable resources for scholars and developers working with the Chuvash language. Together, these datasets represent a significant step towards preserving and promoting the Chuvash language in the digital age.

Open the Data! Chuvash Datasets

TL;DR

The paper addresses data scarcity for Chuvash language by releasing four open datasets spanning monolingual text, Chuvash–Russian and Chuvash–English parallel corpora, and an audio corpus. The datasets enable NLP, MT, ASR, and TTS research and are positioned for future multimodal LLM development. Key contributions include 3.9M monolingual sentences, ~1.4M Chuvash–Russian sentence pairs, ~200k Chuvash–English sentence pairs, and an audio corpus of ~38 hours (63 hours when combined with Common Voice). Public release via Hugging Face and dataset design aligned with Common Voice facilitates integration, comparison, and language preservation in the digital age.

Abstract

In this paper, we introduce four comprehensive datasets for the Chuvash language, aiming to support and enhance linguistic research and technological development for this underrepresented language. These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset. Each dataset is meticulously curated to serve various applications such as machine translation, linguistic analysis, and speech recognition, providing valuable resources for scholars and developers working with the Chuvash language. Together, these datasets represent a significant step towards preserving and promoting the Chuvash language in the digital age.
Paper Structure (5 sections)