Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets
Máté Gedeon, Piroska Zsófia Barta, Péter Mihajlik, Tekla Etelka Gráczi, Anna Kohári, Katalin Mády
TL;DR
This work tackles the underrepresentation of Hungarian in ASR by introducing two substantial datasets, BEA-Large (255 hours from 433 speakers) and BEA-Dialogue (85 hours of conversations), derived from the BEA corpus. It provides reproducible baselines using publicly available models, with a Fast Conformer-CTC fine-tuned on BEA-Large achieving $WER$ $=14.18\%$ and $CER$ $=4.56\%$ on spontaneous speech, and diarization DERs between $13.05\%$ and $18.26\%$. BEA-Dialogue enables robust conversational ASR and speaker diarization research, including SOT-based speaker-change modeling and cpWER/cpCER evaluation, while Whisper baselines illustrate cross-model transfer potential. Overall, the paper delivers large-scale Hungarian spontaneous and conversational data along with baselines to spur progress and offers a framework for similar resource-creation efforts in other languages.
Abstract
The advancement of automatic speech recognition (ASR) has been largely enhanced by extensive datasets in high-resource languages, while languages such as Hungarian remain underrepresented due to limited spontaneous and conversational corpora. To address this gap, we introduce two new datasets -- BEA-Large and BEA-Dialogue -- constructed from the previously unprocessed portions of the Hungarian speech corpus named BEA. BEA-Large extends BEA-Base with 255 hours of spontaneous speech from 433 speakers, enriched with detailed segment-level metadata. BEA-Dialogue, comprising 85 hours of spontaneous conversations, is a Hungarian speech corpus featuring natural dialogues partitioned into speaker-independent subsets, supporting research in conversational ASR and speaker diarization. We establish reproducible baselines on these datasets using publicly available ASR models, with the fine-tuned Fast Conformer model achieving word error rates as low as 14.18\% on spontaneous and 4.8\% on repeated speech. Diarization experiments yield diarization error rates between 13.05\% and 18.26\%, providing reference points for future improvements. The results highlight the persistent difficulty of conversational ASR, particularly due to disfluencies, overlaps, and informal speech patterns. By releasing these datasets and baselines, we aim to advance Hungarian speech technology and offer a methodological framework for developing spontaneous and conversational benchmarks in other languages.
