Table of Contents
Fetching ...

On Initializing Transformers with Pre-trained Embeddings

Ha Young Kim, Niranjan Balasubramanian, Byungkon Kang

TL;DR

This paper analyzes the surprising finding that pre-trained embeddings can underperform random initialization in transformer models, except for certain subtypes like BERT/mBERT. It identifies two core factors—embedding distribution variance and interactions with position encodings—that shape these outcomes, and shows that standardizing pre-trained embeddings to the Xavier variance range often yields improvements for GloVe, T5, and mT5, while BERT/mBERT remain robust due to closer variance alignment. The authors provide extensive empirical evidence across translation and classification tasks, demonstrating that distribution preservation of semantic information matters (as shown by shuffle tests) and that position encodings can both assist and overshadow embedding structure depending on variance. The work has practical implications for initializing transformer embeddings, suggesting distribution-aware preprocessing as a cost-effective route to better training dynamics and performance, rather than blindly adopting pre-trained vectors.

Abstract

It has become common practice now to use random initialization schemes, rather than the pre-trained embeddings, when training transformer based models from scratch. Indeed, we find that pre-trained word embeddings from GloVe, and some sub-word embeddings extracted from language models such as T5 and mT5 fare much worse compared to random initialization. This is counter-intuitive given the well-known representational and transfer-learning advantages of pre-training. Interestingly, we also find that BERT and mBERT embeddings fare better than random initialization, showing the advantages of pre-trained representations. In this work, we posit two potential factors that contribute to these mixed results: the model sensitivity to parameter distribution and the embedding interactions with position encodings. We observe that pre-trained GloVe, T5, and mT5 embeddings have a wider distribution of values. As argued in the initialization studies, such large value initializations can lead to poor training because of saturated outputs. Further, the larger embedding values can, in effect, absorb the smaller position encoding values when added together, thus losing position information. Standardizing the pre-trained embeddings to a narrow range (e.g. as prescribed by Xavier) leads to substantial gains for Glove, T5, and mT5 embeddings. On the other hand, BERT pre-trained embeddings, while larger, are still relatively closer to Xavier initialization range which may allow it to effectively transfer the pre-trained knowledge.

On Initializing Transformers with Pre-trained Embeddings

TL;DR

This paper analyzes the surprising finding that pre-trained embeddings can underperform random initialization in transformer models, except for certain subtypes like BERT/mBERT. It identifies two core factors—embedding distribution variance and interactions with position encodings—that shape these outcomes, and shows that standardizing pre-trained embeddings to the Xavier variance range often yields improvements for GloVe, T5, and mT5, while BERT/mBERT remain robust due to closer variance alignment. The authors provide extensive empirical evidence across translation and classification tasks, demonstrating that distribution preservation of semantic information matters (as shown by shuffle tests) and that position encodings can both assist and overshadow embedding structure depending on variance. The work has practical implications for initializing transformer embeddings, suggesting distribution-aware preprocessing as a cost-effective route to better training dynamics and performance, rather than blindly adopting pre-trained vectors.

Abstract

It has become common practice now to use random initialization schemes, rather than the pre-trained embeddings, when training transformer based models from scratch. Indeed, we find that pre-trained word embeddings from GloVe, and some sub-word embeddings extracted from language models such as T5 and mT5 fare much worse compared to random initialization. This is counter-intuitive given the well-known representational and transfer-learning advantages of pre-training. Interestingly, we also find that BERT and mBERT embeddings fare better than random initialization, showing the advantages of pre-trained representations. In this work, we posit two potential factors that contribute to these mixed results: the model sensitivity to parameter distribution and the embedding interactions with position encodings. We observe that pre-trained GloVe, T5, and mT5 embeddings have a wider distribution of values. As argued in the initialization studies, such large value initializations can lead to poor training because of saturated outputs. Further, the larger embedding values can, in effect, absorb the smaller position encoding values when added together, thus losing position information. Standardizing the pre-trained embeddings to a narrow range (e.g. as prescribed by Xavier) leads to substantial gains for Glove, T5, and mT5 embeddings. On the other hand, BERT pre-trained embeddings, while larger, are still relatively closer to Xavier initialization range which may allow it to effectively transfer the pre-trained knowledge.
Paper Structure (18 sections, 5 equations, 1 figure, 7 tables)

This paper contains 18 sections, 5 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: IWSLT2017 with mBERT (left), mT5 (center), and Multi30k GloVe (right) embeddings validation BLEU results throughout training epochs between various embedding initializations. Notice the variance of the pre-embeddings from Table \ref{['table:sub_emb_statistics']} ($\sigma_{mT5}>\sigma_{GloVe}>\sigma_{mBERT}$) and how that affects their relative performance gap to Xavier.