Table of Contents
Fetching ...

A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives

Jan Lehečka, Josef V. Psutka, Luboš Šmídl, Pavel Ircing, Josef Psutka

TL;DR

The paper systematically evaluates whether bilingual or trilingual Wav2Vec 2.0 ASR models can outperform monolingual models on a multilingual oral history archive (MALACH) and on CommonVoice. Using English, German, and Czech, the authors pre-train mono-, bi-, and tri-lingual Wav2Vec models from balanced datasets and compare them to large-scale multilingual baselines (XLS-R and Whisper), fine-tuned under uniform settings. Across datasets and languages, monolingual Wav2Vec models generally yield lower WER than multilingual configurations, while only very large multilingual models achieve comparable or better performance at a high computational cost. The work provides practical guidance for deploying ASR in multilingual historical archives and contributes publicly released pre-trained models for the research community.

Abstract

In this paper, we are comparing monolingual Wav2Vec 2.0 models with various multilingual models to see whether we could improve speech recognition performance on a unique oral history archive containing a lot of mixed-language sentences. Our main goal is to push forward research on this unique dataset, which is an extremely valuable part of our cultural heritage. Our results suggest that monolingual speech recognition models are, in most cases, superior to multilingual models, even when processing the oral history archive full of mixed-language sentences from non-native speakers. We also performed the same experiments on the public CommonVoice dataset to verify our results. We are contributing to the research community by releasing our pre-trained models to the public.

A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives

TL;DR

The paper systematically evaluates whether bilingual or trilingual Wav2Vec 2.0 ASR models can outperform monolingual models on a multilingual oral history archive (MALACH) and on CommonVoice. Using English, German, and Czech, the authors pre-train mono-, bi-, and tri-lingual Wav2Vec models from balanced datasets and compare them to large-scale multilingual baselines (XLS-R and Whisper), fine-tuned under uniform settings. Across datasets and languages, monolingual Wav2Vec models generally yield lower WER than multilingual configurations, while only very large multilingual models achieve comparable or better performance at a high computational cost. The work provides practical guidance for deploying ASR in multilingual historical archives and contributes publicly released pre-trained models for the research community.

Abstract

In this paper, we are comparing monolingual Wav2Vec 2.0 models with various multilingual models to see whether we could improve speech recognition performance on a unique oral history archive containing a lot of mixed-language sentences. Our main goal is to push forward research on this unique dataset, which is an extremely valuable part of our cultural heritage. Our results suggest that monolingual speech recognition models are, in most cases, superior to multilingual models, even when processing the oral history archive full of mixed-language sentences from non-native speakers. We also performed the same experiments on the public CommonVoice dataset to verify our results. We are contributing to the research community by releasing our pre-trained models to the public.
Paper Structure (13 sections, 2 figures, 2 tables)

This paper contains 13 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Scheme of pre-training mono-, bi- and trilingual Wav2Vec models.
  • Figure 2: WER change after adding more languages into pre-training. We abbreviated CommonVoice to CV. We plot also confidence intervals at 95% confidence level.