Table of Contents
Fetching ...

mHuBERT-147: A Compact Multilingual HuBERT Model

Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu

TL;DR

The findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.

Abstract

We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.

mHuBERT-147: A Compact Multilingual HuBERT Model

TL;DR

The findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.

Abstract

We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.
Paper Structure (55 sections, 1 equation, 3 figures, 22 tables)

This paper contains 55 sections, 1 equation, 3 figures, 22 tables.

Figures (3)

  • Figure 1: Speech amount per language in a logarithmic scale, with different levels of speech resourcefulness: $\geq$ 800 h (blue), $\geq$ 100 h (green), $\geq$ 50 h (orange), $\geq$ 10 h (red), $\le$ 10 h (purple). Some languages are excluded from the plot. Best seen in color.
  • Figure 2: Loss curves for the 3 iterations of mHuBERT-147 training. For the 3rd iteration, the step jump from 1.8M to 0 is due to the optimizer re-initialization: this model crashed in fp16 and had to be reinitialized using fp32 for the last 200 K updates. Best seen in color.
  • Figure 3: Linear interpolation of average PER performance of mHuBERT-147 across different iterations, and with different number of updates. The bold scores correspond to scores obtained at the following updates, from left to right: 400 K, 1 M, 1.5 M and 2 M.