Open Source State-Of-the-Art Solution for Romanian Speech Recognition
Gabriel Pirlogeanu, Alexandru-Lucian Georgescu, Horia Cucu
TL;DR
This work addresses Romanian ASR by adapting NVIDIA's FastConformer to a low-resource language setting, using a 17-layer encoder with around $110$ million parameters and a hybrid CTC-TDT decoder. It leverages a large training corpus of approximately $2636$ hours, combining manual and weakly labeled data, and evaluates multiple decoding strategies, including CTC beam search with a $6$-gram language model. The approach yields state-of-the-art WER across seven Romanian benchmarks, with relative improvements up to $27\%$ on oratory speech and strong gains on spontaneous and dialectal data, while offering practical decoding efficiency. The work also demonstrates the value of weak supervision and open-sourcing the trained model and recipes to accelerate progress in low-resource Romanian ASR and deployment in real-time settings.
Abstract
In this work, we present a new state-of-the-art Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture--explored here for the first time in the context of Romanian. We train our model on a large corpus of, mostly, weakly supervised transcriptions, totaling over 2,600 hours of speech. Leveraging a hybrid decoder with both Connectionist Temporal Classification (CTC) and Token-Duration Transducer (TDT) branches, we evaluate a range of decoding strategies including greedy, ALSD, and CTC beam search with a 6-gram token-level language model. Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks, including read, spontaneous, and domain-specific speech, with up to 27% relative WER reduction compared to previous best-performing systems. In addition to improved transcription accuracy, our approach demonstrates practical decoding efficiency, making it suitable for both research and deployment in low-latency ASR applications.
