Table of Contents
Fetching ...

RoDia: A New Dataset for Romanian Dialect Identification from Speech

Codrut Rotaru, Nicolae-Catalin Ristea, Radu Tudor Ionescu

TL;DR

It is believed that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification, and a set of competitive models to be used as baselines for future research are introduced.

Abstract

We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.

RoDia: A New Dataset for Romanian Dialect Identification from Speech

TL;DR

It is believed that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification, and a set of competitive models to be used as baselines for future research are introduced.

Abstract

We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.
Paper Structure (7 sections, 3 figures, 4 tables)

This paper contains 7 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The administrative regions of Romania and the dominant dialect spoken within each region. RoDia is the first benchmark to contain samples representing these five Romanian dialects.
  • Figure 2: Age and gender statistics for the RoDia dataset.
  • Figure 3: Confusion matrix on the test set for the wav2vec 2.0 Baevski-NeurIPS-2020 model.