Table of Contents
Fetching ...

Open Source State-Of-the-Art Solution for Romanian Speech Recognition

Gabriel Pirlogeanu, Alexandru-Lucian Georgescu, Horia Cucu

TL;DR

This work addresses Romanian ASR by adapting NVIDIA's FastConformer to a low-resource language setting, using a 17-layer encoder with around $110$ million parameters and a hybrid CTC-TDT decoder. It leverages a large training corpus of approximately $2636$ hours, combining manual and weakly labeled data, and evaluates multiple decoding strategies, including CTC beam search with a $6$-gram language model. The approach yields state-of-the-art WER across seven Romanian benchmarks, with relative improvements up to $27\%$ on oratory speech and strong gains on spontaneous and dialectal data, while offering practical decoding efficiency. The work also demonstrates the value of weak supervision and open-sourcing the trained model and recipes to accelerate progress in low-resource Romanian ASR and deployment in real-time settings.

Abstract

In this work, we present a new state-of-the-art Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture--explored here for the first time in the context of Romanian. We train our model on a large corpus of, mostly, weakly supervised transcriptions, totaling over 2,600 hours of speech. Leveraging a hybrid decoder with both Connectionist Temporal Classification (CTC) and Token-Duration Transducer (TDT) branches, we evaluate a range of decoding strategies including greedy, ALSD, and CTC beam search with a 6-gram token-level language model. Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks, including read, spontaneous, and domain-specific speech, with up to 27% relative WER reduction compared to previous best-performing systems. In addition to improved transcription accuracy, our approach demonstrates practical decoding efficiency, making it suitable for both research and deployment in low-latency ASR applications.

Open Source State-Of-the-Art Solution for Romanian Speech Recognition

TL;DR

This work addresses Romanian ASR by adapting NVIDIA's FastConformer to a low-resource language setting, using a 17-layer encoder with around million parameters and a hybrid CTC-TDT decoder. It leverages a large training corpus of approximately hours, combining manual and weakly labeled data, and evaluates multiple decoding strategies, including CTC beam search with a -gram language model. The approach yields state-of-the-art WER across seven Romanian benchmarks, with relative improvements up to on oratory speech and strong gains on spontaneous and dialectal data, while offering practical decoding efficiency. The work also demonstrates the value of weak supervision and open-sourcing the trained model and recipes to accelerate progress in low-resource Romanian ASR and deployment in real-time settings.

Abstract

In this work, we present a new state-of-the-art Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture--explored here for the first time in the context of Romanian. We train our model on a large corpus of, mostly, weakly supervised transcriptions, totaling over 2,600 hours of speech. Leveraging a hybrid decoder with both Connectionist Temporal Classification (CTC) and Token-Duration Transducer (TDT) branches, we evaluate a range of decoding strategies including greedy, ALSD, and CTC beam search with a 6-gram token-level language model. Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks, including read, spontaneous, and domain-specific speech, with up to 27% relative WER reduction compared to previous best-performing systems. In addition to improved transcription accuracy, our approach demonstrates practical decoding efficiency, making it suitable for both research and deployment in low-latency ASR applications.

Paper Structure

This paper contains 12 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Comparative analysis of the decoding strategies explored in this work, evaluated in terms of ASR accuracy and inference latency on Romanian speech.