Table of Contents
Fetching ...

On the Cross-lingual Transferability of Pre-trained wav2vec2-based Models

Jonatas Grosman, Cassio Almeida, Guilherme Schardong, Hélio Lopes

TL;DR

This work comprehensively evaluates cross-lingual transferability of wav2vec 2.0–based representations by fine-tuning 15 pre-trained models on 18 languages. It demonstrates that diversity in pre-training data matters more than data volume, and that Indo-European languages generally yield better downstream performance, with positive cross-lingual transfer observed—especially when pre-training language families align with the target language. Multilingual pre-training often outperforms monolingual pre-training, though English-centered results show nuanced patterns. The findings offer practical guidance for selecting pre-trained models and for designing future multilingual pre-training with attention to language-family diversity and data balance.

Abstract

Using representations provided by a large pre-trained model has become the primary strategy for achieving state-of-the-art results in a wide range of tasks. A recently proposed large pre-trained model, wav2vec 2.0, was seminal for several other works on pre-training large models on speech data. Many models are being pre-trained using the same architecture as wav2vec 2.0 and are getting state-of-the-art in various speech-related tasks. Previous work has demonstrated that the data used during the pre-training of these wav2vec2-based models can impact the model's performance in downstream tasks, and this should be taken into consideration before utilizing these models. However, few works have proposed investigating further how the transfer knowledge of these pre-trained models behaves in different languages, even when the target language differs from the one used during the model's pre-training. Our work aims to investigate the cross-lingual transferability of these wav2vec2-based models. We performed several fine-tuning experiments on the speech recognition task in 18 languages using 15 large pre-trained models. The results of our experiments showed us that the size of data used during the pre-training of these models is not as important to the final performance as the diversity. We noticed that the performance of Indo-European languages is superior to non-Indo-European languages in the evaluated models. We have observed a positive cross-lingual transfer of knowledge using monolingual models, which was evident in all the languages we used, but more pronounced when the language used during pre-training was more similar to the downstream task language. With these findings, we aim to assist the scientific community in utilizing existing wav2vec2-based pre-trained models, as well as facilitate the pre-training of new ones.

On the Cross-lingual Transferability of Pre-trained wav2vec2-based Models

TL;DR

This work comprehensively evaluates cross-lingual transferability of wav2vec 2.0–based representations by fine-tuning 15 pre-trained models on 18 languages. It demonstrates that diversity in pre-training data matters more than data volume, and that Indo-European languages generally yield better downstream performance, with positive cross-lingual transfer observed—especially when pre-training language families align with the target language. Multilingual pre-training often outperforms monolingual pre-training, though English-centered results show nuanced patterns. The findings offer practical guidance for selecting pre-trained models and for designing future multilingual pre-training with attention to language-family diversity and data balance.

Abstract

Using representations provided by a large pre-trained model has become the primary strategy for achieving state-of-the-art results in a wide range of tasks. A recently proposed large pre-trained model, wav2vec 2.0, was seminal for several other works on pre-training large models on speech data. Many models are being pre-trained using the same architecture as wav2vec 2.0 and are getting state-of-the-art in various speech-related tasks. Previous work has demonstrated that the data used during the pre-training of these wav2vec2-based models can impact the model's performance in downstream tasks, and this should be taken into consideration before utilizing these models. However, few works have proposed investigating further how the transfer knowledge of these pre-trained models behaves in different languages, even when the target language differs from the one used during the model's pre-training. Our work aims to investigate the cross-lingual transferability of these wav2vec2-based models. We performed several fine-tuning experiments on the speech recognition task in 18 languages using 15 large pre-trained models. The results of our experiments showed us that the size of data used during the pre-training of these models is not as important to the final performance as the diversity. We noticed that the performance of Indo-European languages is superior to non-Indo-European languages in the evaluated models. We have observed a positive cross-lingual transfer of knowledge using monolingual models, which was evident in all the languages we used, but more pronounced when the language used during pre-training was more similar to the downstream task language. With these findings, we aim to assist the scientific community in utilizing existing wav2vec2-based pre-trained models, as well as facilitate the pre-training of new ones.

Paper Structure

This paper contains 13 sections, 1 equation, 40 figures, 23 tables.

Figures (40)

  • Figure 1: Multilingual pre-trained models performance over grouped language families on the validation set
  • Figure 2: Multilingual pre-trained models performance over grouped language families on the test set
  • Figure 3: Overall performance over the pre-trained models for Arabic language
  • Figure 4: Overall performance over the pre-trained models for German language
  • Figure 5: Overall performance over the pre-trained models for English language
  • ...and 35 more figures