Table of Contents
Fetching ...

OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

Yihong Liu, Peiqin Lin, Mingyang Wang, Hinrich Schütze

TL;DR

OFA addresses the high resource cost of expanding vocabularies in multilingual language models by initializing unseen subword embeddings through a factorized, crosslingual embedding space and leveraging external multilingual word vectors. It replaces the full embedding matrix with two smaller matrices and a shared primitive basis, enabling efficient multilingual continued pretraining with fewer parameters while accelerating convergence. Empirical results on RoBERTa and XLM-R source models across five downstream tasks show OFA matches or exceeds baselines with reduced carbon footprints, often performing best at moderate latent dimensions. The work demonstrates strong crosslingual transfer, environmental benefits, and broad applicability to encoder-based architectures, with code and models publicly available.

Abstract

Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the model, thus weakening the efficiency. To address these issues, we propose a novel framework: $\textbf{O}$ne $\textbf{F}$or $\textbf{A}$ll ($\textbf{OFA}$), which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively. OFA takes advantage of external well-aligned multilingual static word vectors and injects the alignment knowledge into the subword embeddings. In addition, OFA applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which largely reduces the number of parameters. We show OFA accelerates the convergence of continued pretraining, which is environmentally friendly as much fewer carbon footprints are generated. Through extensive experiments, we demonstrate OFA can achieve competitive or better performance than default continued pretraining baselines on a wide range of crosslingual downstream tasks. We make our code and models publicly available.

OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

TL;DR

OFA addresses the high resource cost of expanding vocabularies in multilingual language models by initializing unseen subword embeddings through a factorized, crosslingual embedding space and leveraging external multilingual word vectors. It replaces the full embedding matrix with two smaller matrices and a shared primitive basis, enabling efficient multilingual continued pretraining with fewer parameters while accelerating convergence. Empirical results on RoBERTa and XLM-R source models across five downstream tasks show OFA matches or exceeds baselines with reduced carbon footprints, often performing best at moderate latent dimensions. The work demonstrates strong crosslingual transfer, environmental benefits, and broad applicability to encoder-based architectures, with code and models publicly available.

Abstract

Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the model, thus weakening the efficiency. To address these issues, we propose a novel framework: ne or ll (), which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively. OFA takes advantage of external well-aligned multilingual static word vectors and injects the alignment knowledge into the subword embeddings. In addition, OFA applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which largely reduces the number of parameters. We show OFA accelerates the convergence of continued pretraining, which is environmentally friendly as much fewer carbon footprints are generated. Through extensive experiments, we demonstrate OFA can achieve competitive or better performance than default continued pretraining baselines on a wide range of crosslingual downstream tasks. We make our code and models publicly available.
Paper Structure (42 sections, 3 equations, 4 figures, 30 tables)

This paper contains 42 sections, 3 equations, 4 figures, 30 tables.

Figures (4)

  • Figure 1: Qualitative comparisons between baselines and Ofa. Ofa consistently achieves competitive or better performance than the baselines using both (a) monolingual (RoBERTa) or (b) multilingual (XLM-R) PLMs as the source model, with fewer carbon footprints (C.F.) during the continued pretraining, indicating higher efficiency. The stride of each axis in the chart is different.
  • Figure 2: Summary of Ofa. Different color indicates the block is specific to different languages. Green: source languages; blue: target languages; orange: both.
  • Figure 3: The training loss as well as the performance on five downstream tasks from step 0 (without continued pretraining) to step 100K (10th checkpoints). We see that models initialized by Ofa converge faster than baseline models (RoBERTa-rand and XLM-R-rand) whose new subwords are randomly initialized during continued pretraining. For most of the downstream tasks, models with lower embedding dimensions can achieve better performance after only 10K steps compared with their full-dimensional counterparts (Ofa-mono-768 and Ofa-multi-768).
  • Figure 4: Information preserved (percentage of variance explained by the selected components) under different dimensions of the semantic space (number of principal components). Generally trend: multilingual models generally preserve more information than monolingual ones when embeddings are reduced to the same dimension.