Knowledge Inheritance for Pre-trained Language Models
Yujia Qin, Yankai Lin, Jing Yi, Jiajie Zhang, Xu Han, Zhengyan Zhang, Yusheng Su, Zhiyuan Liu, Peng Li, Maosong Sun, Jie Zhou
TL;DR
Knowledge Inheritance (KI) introduces a distillation-based framework to transfer knowledge from pre-trained, smaller PLMs to larger models during pre-training, reducing computational cost while enabling multi-teacher and cross-domain transfer. The approach uses a KL-divergence loss between teacher and student logits combined with the student’s self-supervised objective, guided by a decaying inheritance rate to shift from teacher guidance to self-learning. Empirical analyses show KI accelerates convergence, improves downstream performance, and supports generational accumulation, with benefits influenced by teacher architecture, data size, and domain similarity. The framework extends naturally to domain adaptation, demonstrating efficient knowledge transfer from domain-specific teachers and the potential for continual, multi-domain learning of large PLMs.
Abstract
Recent explorations of large-scale pre-trained language models (PLMs) have revealed the power of PLMs with huge amounts of parameters, setting off a wave of training ever-larger PLMs. However, it requires tremendous computational resources to train a large-scale PLM, which may be practically unaffordable. In addition, existing large-scale PLMs are mainly trained from scratch individually, ignoring that many well-trained PLMs are available. To this end, we explore the question how could existing PLMs benefit training large-scale PLMs in future. Specifically, we introduce a pre-training framework named "knowledge inheritance" (KI) and explore how could knowledge distillation serve as auxiliary supervision during pre-training to efficiently learn larger PLMs. Experimental results demonstrate the superiority of KI in training efficiency. We also conduct empirical analyses to explore the effects of teacher PLMs' pre-training settings, including model architecture, pre-training data, etc. Finally, we show that KI could be applied to domain adaptation and knowledge transfer.
