A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives
Jan Lehečka, Josef V. Psutka, Luboš Šmídl, Pavel Ircing, Josef Psutka
TL;DR
The paper systematically evaluates whether bilingual or trilingual Wav2Vec 2.0 ASR models can outperform monolingual models on a multilingual oral history archive (MALACH) and on CommonVoice. Using English, German, and Czech, the authors pre-train mono-, bi-, and tri-lingual Wav2Vec models from balanced datasets and compare them to large-scale multilingual baselines (XLS-R and Whisper), fine-tuned under uniform settings. Across datasets and languages, monolingual Wav2Vec models generally yield lower WER than multilingual configurations, while only very large multilingual models achieve comparable or better performance at a high computational cost. The work provides practical guidance for deploying ASR in multilingual historical archives and contributes publicly released pre-trained models for the research community.
Abstract
In this paper, we are comparing monolingual Wav2Vec 2.0 models with various multilingual models to see whether we could improve speech recognition performance on a unique oral history archive containing a lot of mixed-language sentences. Our main goal is to push forward research on this unique dataset, which is an extremely valuable part of our cultural heritage. Our results suggest that monolingual speech recognition models are, in most cases, superior to multilingual models, even when processing the oral history archive full of mixed-language sentences from non-native speakers. We also performed the same experiments on the public CommonVoice dataset to verify our results. We are contributing to the research community by releasing our pre-trained models to the public.
