Table of Contents
Fetching ...

Knowledge Inheritance for Pre-trained Language Models

Yujia Qin, Yankai Lin, Jing Yi, Jiajie Zhang, Xu Han, Zhengyan Zhang, Yusheng Su, Zhiyuan Liu, Peng Li, Maosong Sun, Jie Zhou

TL;DR

Knowledge Inheritance (KI) introduces a distillation-based framework to transfer knowledge from pre-trained, smaller PLMs to larger models during pre-training, reducing computational cost while enabling multi-teacher and cross-domain transfer. The approach uses a KL-divergence loss between teacher and student logits combined with the student’s self-supervised objective, guided by a decaying inheritance rate to shift from teacher guidance to self-learning. Empirical analyses show KI accelerates convergence, improves downstream performance, and supports generational accumulation, with benefits influenced by teacher architecture, data size, and domain similarity. The framework extends naturally to domain adaptation, demonstrating efficient knowledge transfer from domain-specific teachers and the potential for continual, multi-domain learning of large PLMs.

Abstract

Recent explorations of large-scale pre-trained language models (PLMs) have revealed the power of PLMs with huge amounts of parameters, setting off a wave of training ever-larger PLMs. However, it requires tremendous computational resources to train a large-scale PLM, which may be practically unaffordable. In addition, existing large-scale PLMs are mainly trained from scratch individually, ignoring that many well-trained PLMs are available. To this end, we explore the question how could existing PLMs benefit training large-scale PLMs in future. Specifically, we introduce a pre-training framework named "knowledge inheritance" (KI) and explore how could knowledge distillation serve as auxiliary supervision during pre-training to efficiently learn larger PLMs. Experimental results demonstrate the superiority of KI in training efficiency. We also conduct empirical analyses to explore the effects of teacher PLMs' pre-training settings, including model architecture, pre-training data, etc. Finally, we show that KI could be applied to domain adaptation and knowledge transfer.

Knowledge Inheritance for Pre-trained Language Models

TL;DR

Knowledge Inheritance (KI) introduces a distillation-based framework to transfer knowledge from pre-trained, smaller PLMs to larger models during pre-training, reducing computational cost while enabling multi-teacher and cross-domain transfer. The approach uses a KL-divergence loss between teacher and student logits combined with the student’s self-supervised objective, guided by a decaying inheritance rate to shift from teacher guidance to self-learning. Empirical analyses show KI accelerates convergence, improves downstream performance, and supports generational accumulation, with benefits influenced by teacher architecture, data size, and domain similarity. The framework extends naturally to domain adaptation, demonstrating efficient knowledge transfer from domain-specific teachers and the potential for continual, multi-domain learning of large PLMs.

Abstract

Recent explorations of large-scale pre-trained language models (PLMs) have revealed the power of PLMs with huge amounts of parameters, setting off a wave of training ever-larger PLMs. However, it requires tremendous computational resources to train a large-scale PLM, which may be practically unaffordable. In addition, existing large-scale PLMs are mainly trained from scratch individually, ignoring that many well-trained PLMs are available. To this end, we explore the question how could existing PLMs benefit training large-scale PLMs in future. Specifically, we introduce a pre-training framework named "knowledge inheritance" (KI) and explore how could knowledge distillation serve as auxiliary supervision during pre-training to efficiently learn larger PLMs. Experimental results demonstrate the superiority of KI in training efficiency. We also conduct empirical analyses to explore the effects of teacher PLMs' pre-training settings, including model architecture, pre-training data, etc. Finally, we show that KI could be applied to domain adaptation and knowledge transfer.

Paper Structure

This paper contains 40 sections, 3 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: (a) The validation PPL curve for pre-training $\mathcal{M}_L$ under KI framework ($\texttt{BASE} \rightarrow \texttt{LARGE}$) and the self-learning baseline ($\texttt{LARGE}$). The teacher's ($\texttt{BASE}$) performance is $4.18$. (b) Pre-training $\texttt{BASE}$ under KI with three strategies for the inheritance rate $\alpha_t$: Linear, Heviside and Constant. The teacher's ($\texttt{MEDIUM}$) performance is $4.95$. (c) Pre-training $\texttt{BASE}$ under KI with top-$K$ logits, we vary $K$ in $\{10, 50, 100, 1000\}$, respectively.
  • Figure 2: (a) Experiments on GPT. (b) KI over generations. (c) Effects of $\mathcal{M}_S$'s architecture (depth).
  • Figure 3: Effects of $\mathcal{M}_S$'s pre-training (a) data size, (b) data domain and (c) data privacy for KI.
  • Figure 4: Left: effects of $\mathcal{M}_L$'s model size. Middle: effects of $\mathcal{M}_S$'s number of pre-training steps. Right: effects of $\mathcal{M}_L$'s batch size.
  • Figure 5: Left: the PPL curve when choosing the teacher PLM with different hidden sizes. Middle & Right: adapting $\text{RoBERTa}_{\texttt{BASE\_WB}}$ to CS (middle) / BIO (right) domain with different number of training steps on different sizes of domain data. We compare two strategies: self-learning and KI. For example, $\text{RoBERTa}_{\texttt{CS\_3400M}}$ denotes post-training $\text{RoBERTa}_{\texttt{BASE\_WB}}$ with the self-learning strategy on the $3,400$M token CS domain corpus. $\text{RoBERTa}_{\texttt{BASE\_WB} \rightarrow \texttt{CS\_3400M}}$ denotes post-training $\text{RoBERTa}_{\texttt{BASE\_WB}}$ with the KI strategy on the $3,400$M token CS domain corpus.
  • ...and 2 more figures