Table of Contents
Fetching ...

Low-resource speech recognition and dialect identification of Irish in a multi-task framework

Liam Lonergan, Mengjie Qian, Neasa Ní Chiaráin, Christer Gobl, Ailbhe Ní Chasaide

TL;DR

This work addresses low-resource Irish ASR and dialect identification by integrating a hybrid CTC/Attention encoder–decoder with Intermediate CTC (InterCTC) in a multi-task setup. It systematically evaluates Conformer and E-branchformer encoders, includes dialect-tagging as an auxiliary task, and employs multi-task language-model shallow fusion to boost performance. The best results come from the E-branchformer Large model with an optimal InterCTC configuration and LM shallow fusion, achieving a DID accuracy of about $81.5\%$ and WER competitive with the TDNN-HMM baseline, while surpassing the previous ECAPA-TDNN results in DID. Overall, the study demonstrates that multi-task InterCTC approaches can meaningfully improve Irish DID and yield competitive ASR performance in a low-resource setting, marking a promising direction for future Irish speech technology.

Abstract

This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID). Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DID (ECAPA-TDNN). An optimal InterCTC setting is initially established using a Conformer encoder. This setting is then used to train a model with an E-branchformer encoder and the performance of both architectures are compared. A multi-task fine-tuning approach is adopted for language model (LM) shallow fusion. The experiments yielded an improvement in DID accuracy of 10.8% relative to a baseline ECAPA-TDNN, and WER performance approaching the TDNN-HMM model. This multi-task approach emerges as a promising strategy for Irish low-resource ASR and DID.

Low-resource speech recognition and dialect identification of Irish in a multi-task framework

TL;DR

This work addresses low-resource Irish ASR and dialect identification by integrating a hybrid CTC/Attention encoder–decoder with Intermediate CTC (InterCTC) in a multi-task setup. It systematically evaluates Conformer and E-branchformer encoders, includes dialect-tagging as an auxiliary task, and employs multi-task language-model shallow fusion to boost performance. The best results come from the E-branchformer Large model with an optimal InterCTC configuration and LM shallow fusion, achieving a DID accuracy of about and WER competitive with the TDNN-HMM baseline, while surpassing the previous ECAPA-TDNN results in DID. Overall, the study demonstrates that multi-task InterCTC approaches can meaningfully improve Irish DID and yield competitive ASR performance in a low-resource setting, marking a promising direction for future Irish speech technology.

Abstract

This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID). Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DID (ECAPA-TDNN). An optimal InterCTC setting is initially established using a Conformer encoder. This setting is then used to train a model with an E-branchformer encoder and the performance of both architectures are compared. A multi-task fine-tuning approach is adopted for language model (LM) shallow fusion. The experiments yielded an improvement in DID accuracy of 10.8% relative to a baseline ECAPA-TDNN, and WER performance approaching the TDNN-HMM model. This multi-task approach emerges as a promising strategy for Irish low-resource ASR and DID.
Paper Structure (21 sections, 12 equations, 7 tables)