Table of Contents
Fetching ...

Building Robust and Scalable Multilingual ASR for Indian Languages

Arjun Gangwar, Kaousheik Jayakumar, S. Umesh

TL;DR

The paper tackles robust multilingual ASR for low-resource Indian languages with dialect identification under restricted data tracks. It introduces a Common Label Set (CLS) to exploit phonemic similarities and a Multi-Decoder architecture to retain CLS gains while producing native-script outputs, addressing error propagation seen in cascaded systems. Through ESPnet-based experiments on Track 1 and Track 2 data, the authors show that a multi-decoder approach with character-level MT and ASR initialization delivers superior WER/CER and language/dialect ID accuracy compared to baselines and cascaded pipelines. The approach demonstrates a scalable, end-to-end framework that leverages phonemic commonalities to improve ASR and ID tasks for multilingual Indian languages with limited external data.

Abstract

This paper describes the systems developed by SPRING Lab, Indian Institute of Technology Madras, for the ASRU MADASR 2.0 challenge. The systems developed focuses on adapting ASR systems to improve in predicting the language and dialect of the utterance among 8 languages across 33 dialects. We participated in Track 1 and Track 2, which restricts the use of additional data and develop from-the-scratch multilingual systems. We presented a novel training approach using Multi-Decoder architecture with phonemic Common Label Set (CLS) as intermediate representation. It improved the performance over the baseline (in the CLS space). We also discuss various methods used to retain the gain obtained in the phonemic space while converting them back to the corresponding grapheme representations. Our systems beat the baseline in 3 languages (Track 2) in terms of WER/CER and achieved the highest language ID and dialect ID accuracy among all participating teams (Track 2).

Building Robust and Scalable Multilingual ASR for Indian Languages

TL;DR

The paper tackles robust multilingual ASR for low-resource Indian languages with dialect identification under restricted data tracks. It introduces a Common Label Set (CLS) to exploit phonemic similarities and a Multi-Decoder architecture to retain CLS gains while producing native-script outputs, addressing error propagation seen in cascaded systems. Through ESPnet-based experiments on Track 1 and Track 2 data, the authors show that a multi-decoder approach with character-level MT and ASR initialization delivers superior WER/CER and language/dialect ID accuracy compared to baselines and cascaded pipelines. The approach demonstrates a scalable, end-to-end framework that leverages phonemic commonalities to improve ASR and ID tasks for multilingual Indian languages with limited external data.

Abstract

This paper describes the systems developed by SPRING Lab, Indian Institute of Technology Madras, for the ASRU MADASR 2.0 challenge. The systems developed focuses on adapting ASR systems to improve in predicting the language and dialect of the utterance among 8 languages across 33 dialects. We participated in Track 1 and Track 2, which restricts the use of additional data and develop from-the-scratch multilingual systems. We presented a novel training approach using Multi-Decoder architecture with phonemic Common Label Set (CLS) as intermediate representation. It improved the performance over the baseline (in the CLS space). We also discuss various methods used to retain the gain obtained in the phonemic space while converting them back to the corresponding grapheme representations. Our systems beat the baseline in 3 languages (Track 2) in terms of WER/CER and achieved the highest language ID and dialect ID accuracy among all participating teams (Track 2).

Paper Structure

This paper contains 10 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: CLS representations of different languages
  • Figure 2: Multi-decoder architecture