Table of Contents
Fetching ...

LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect

Hedi Naouara, Jean-Pierre Lorré, Jérôme Louradour

TL;DR

This paper tackles the challenge of developing ASR for the Tunisian Arabic Dialect, a low-resource and highly code-switched language, by releasing the LinTO audio and textual datasets. It presents a diverse textual corpus with normalization and Arabizi transliteration, and a complementary audio corpus (including raw and augmented data) collected from multiple sources, all at 16 kHz and supporting code-switching with French and English. A notable contribution is the Voice Conversion Augmentation (VCA) pipeline based on SoftVC VITS, HuBert, and HiFi-GAN, which, together with transcript alignment and data cleaning, yields a larger, more speaker-diverse training set and improves WER on code-switched content. The paper also reports baseline ASR experiments, demonstrates the limitations of current large models like Whisper on Tunisian dialects, and establishes a first-from-scratch baseline on open LinTO data, highlighting the datasets’ potential to spur research and practical development for Tunisian ASR and related dialect research.

Abstract

Developing Automatic Speech Recognition (ASR) systems for Tunisian Arabic Dialect is challenging due to the dialect's linguistic complexity and the scarcity of annotated speech datasets. To address these challenges, we propose the LinTO audio and textual datasets -- comprehensive resources that capture phonological and lexical features of Tunisian Arabic Dialect. These datasets include a variety of texts from numerous sources and real-world audio samples featuring diverse speakers and code-switching between Tunisian Arabic Dialect and English or French. By providing high-quality audio paired with precise transcriptions, the LinTO audio and textual datasets aim to provide qualitative material to build and benchmark ASR systems for the Tunisian Arabic Dialect. Keywords -- Tunisian Arabic Dialect, Speech-to-Text, Low-Resource Languages, Audio Data Augmentation

LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect

TL;DR

This paper tackles the challenge of developing ASR for the Tunisian Arabic Dialect, a low-resource and highly code-switched language, by releasing the LinTO audio and textual datasets. It presents a diverse textual corpus with normalization and Arabizi transliteration, and a complementary audio corpus (including raw and augmented data) collected from multiple sources, all at 16 kHz and supporting code-switching with French and English. A notable contribution is the Voice Conversion Augmentation (VCA) pipeline based on SoftVC VITS, HuBert, and HiFi-GAN, which, together with transcript alignment and data cleaning, yields a larger, more speaker-diverse training set and improves WER on code-switched content. The paper also reports baseline ASR experiments, demonstrates the limitations of current large models like Whisper on Tunisian dialects, and establishes a first-from-scratch baseline on open LinTO data, highlighting the datasets’ potential to spur research and practical development for Tunisian ASR and related dialect research.

Abstract

Developing Automatic Speech Recognition (ASR) systems for Tunisian Arabic Dialect is challenging due to the dialect's linguistic complexity and the scarcity of annotated speech datasets. To address these challenges, we propose the LinTO audio and textual datasets -- comprehensive resources that capture phonological and lexical features of Tunisian Arabic Dialect. These datasets include a variety of texts from numerous sources and real-world audio samples featuring diverse speakers and code-switching between Tunisian Arabic Dialect and English or French. By providing high-quality audio paired with precise transcriptions, the LinTO audio and textual datasets aim to provide qualitative material to build and benchmark ASR systems for the Tunisian Arabic Dialect. Keywords -- Tunisian Arabic Dialect, Speech-to-Text, Low-Resource Languages, Audio Data Augmentation

Paper Structure

This paper contains 12 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Details of results across three training conditions on TunSwitch Code-Switching (CS) test set. left: Word Error Rates (WER) decomposed into insertion (Ins), deletion (Del) and substitution (Subs) rates. right: F1, Recall and Precision scores on Latin words. All 95% confidence intervals are computed by performing bootstrap resampling.