RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks

Alexandra Diaconu; Mădălina Vînaga; Bogdan Alexe

RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks

Alexandra Diaconu, Mădălina Vînaga, Bogdan Alexe

TL;DR

RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), is introduced, and results show that even limited fine-tuning on real speech from RO-N3WS yields substantial WER improvements over zero-shot baselines.

Abstract

We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. RO-N3WS comprises over 126 hours of transcribed audio collected from broadcast news, literary audiobooks, film dialogue, children's stories, and conversational podcast speech. This diversity enables robust training and fine-tuning across stylistically distinct domains. We evaluate several state-of-the-art ASR systems (Whisper, Wav2Vec 2.0) in both zero-shot and fine-tuned settings, and conduct controlled comparisons using synthetic data generated with expressive TTS models. Our results show that even limited fine-tuning on real speech from RO-N3WS yields substantial WER improvements over zero-shot baselines. We will release all models, scripts, and data splits to support reproducible research in multilingual ASR, domain adaptation, and lightweight deployment.

RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks

TL;DR

Abstract

Paper Structure (28 sections, 3 figures, 11 tables)

This paper contains 28 sections, 3 figures, 11 tables.

Introduction
Related work
Dataset Analysis
Named Entity Density.
Prosodic Variation and Emotional Expressiveness.
RO-N3WS: Construction, Annotation, and Splits
Experimental evaluation
Models Evaluated
Open-Source Models
Commercial Models (Black-Box APIs)
Evaluation metrics
Zero-shot Evaluation
In-domain generalization.
Out-of-distribution robustness.
Supervised fine-tuning scenario
...and 13 more sections

Figures (3)

Figure 1: Recording-duration histograms (in seconds) of collected audio files from ProTV News (left) and Observator News (right).
Figure 2: Recording-duration histograms (in seconds) for out-of-distribution subsets: audiobooks, Romanian films, children’s stories and podcasts.
Figure 3: Learning curves on the ProTV and Antena 1 test sets. WER is reported after fine-tuning Wav2Vec 2.0 and Whisper Small on 5, 10, and the full 17 training chunks of either the same bulletin or the other bulletin. Best viewed in color.

RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks

TL;DR

Abstract

RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks

Authors

TL;DR

Abstract

Table of Contents

Figures (3)