Table of Contents
Fetching ...

Recyclable Tuning for Continual Pre-training

Yujia Qin, Cheng Qian, Xu Han, Yankai Lin, Huadong Wang, Ruobing Xie, Zhiyuan Liu, Maosong Sun, Jie Zhou

TL;DR

<3-5 sentence high-level summary> This work addresses the resource waste and knowledge loss that arise when continually pre-trained PLMs are upgraded and prior task-adaptation weights become outdated. It demonstrates that upgraded PLMs retain substantial compatibility with old adaptations and reveals strong parametric and representational connections between successive models through analyses of mode connectivity and attention patterns. Building on these insights, the authors propose two recyclable tuning strategies—initialization-based adaptation and distillation-based adaptation—showing faster convergence and improved task performance, with their benefits enhanced when source and target tasks are related. They also discuss a training-free weight-recycling alternative and broader implications for sustainable NLP across model evolution.

Abstract

Continual pre-training is the paradigm where pre-trained language models (PLMs) continually acquire fresh knowledge from growing data and gradually get upgraded. Before an upgraded PLM is released, we may have tuned the original PLM for various tasks and stored the adapted weights. However, when tuning the upgraded PLM, these outdated adapted weights will typically be ignored and discarded, causing a potential waste of resources. We bring this issue to the forefront and contend that proper algorithms for recycling outdated adapted weights should be developed. To this end, we formulate the task of recyclable tuning for continual pre-training. In pilot studies, we find that after continual pre-training, the upgraded PLM remains compatible with the outdated adapted weights to some extent. Motivated by this finding, we analyze the connection between continually pre-trained PLMs from two novel aspects, i.e., mode connectivity, and functional similarity. Based on the corresponding findings, we propose both an initialization-based method and a distillation-based method for our task. We demonstrate their feasibility in improving the convergence and performance for tuning the upgraded PLM. We also show that both methods can be combined to achieve better performance. The source codes are publicly available at https://github.com/thunlp/RecyclableTuning.

Recyclable Tuning for Continual Pre-training

TL;DR

<3-5 sentence high-level summary> This work addresses the resource waste and knowledge loss that arise when continually pre-trained PLMs are upgraded and prior task-adaptation weights become outdated. It demonstrates that upgraded PLMs retain substantial compatibility with old adaptations and reveals strong parametric and representational connections between successive models through analyses of mode connectivity and attention patterns. Building on these insights, the authors propose two recyclable tuning strategies—initialization-based adaptation and distillation-based adaptation—showing faster convergence and improved task performance, with their benefits enhanced when source and target tasks are related. They also discuss a training-free weight-recycling alternative and broader implications for sustainable NLP across model evolution.

Abstract

Continual pre-training is the paradigm where pre-trained language models (PLMs) continually acquire fresh knowledge from growing data and gradually get upgraded. Before an upgraded PLM is released, we may have tuned the original PLM for various tasks and stored the adapted weights. However, when tuning the upgraded PLM, these outdated adapted weights will typically be ignored and discarded, causing a potential waste of resources. We bring this issue to the forefront and contend that proper algorithms for recycling outdated adapted weights should be developed. To this end, we formulate the task of recyclable tuning for continual pre-training. In pilot studies, we find that after continual pre-training, the upgraded PLM remains compatible with the outdated adapted weights to some extent. Motivated by this finding, we analyze the connection between continually pre-trained PLMs from two novel aspects, i.e., mode connectivity, and functional similarity. Based on the corresponding findings, we propose both an initialization-based method and a distillation-based method for our task. We demonstrate their feasibility in improving the convergence and performance for tuning the upgraded PLM. We also show that both methods can be combined to achieve better performance. The source codes are publicly available at https://github.com/thunlp/RecyclableTuning.
Paper Structure (70 sections, 6 equations, 13 figures, 12 tables)

This paper contains 70 sections, 6 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Task formulation. The original PLM $\mathcal{M}_i$ is upgraded to $\mathcal{M}_{i+1}$ through continual pre-training on emerging data $\mathcal{D}_{i+1}$. Our goal is to recycle the existing adapted weights $\Delta_i$ of $\mathcal{M}_i$ for tuning $\mathcal{M}_{i+1}$.
  • Figure 2: (a, b): performance variation w.r.t. pre-training steps ($t$) when applying the outdated weights ($\Delta_{0}^\mathcal{T}$) to $\mathcal{M}_1(t)$. (c, d): performance variation when applying the outdated weights ($\Delta_{0}^\mathcal{T}$) to $\{\mathcal{M}_1, \cdots, \mathcal{M}_4\}$.
  • Figure 3: The performance of linear interpolations between two adapted PLMs on ChemProt. $\mu=0$ means $\mathcal{M}_{0}$, and $\mu=1$ means $\mathcal{M}_{1}(t)$ or $\mathcal{M}_\text{IND}$.
  • Figure 4: Linear mode connectivity between the initial $\mathcal{M}_0$ ($\mu=0$) and $4$ sequentially pre-trained PLMs $\{\mathcal{M}_1, \cdots, \mathcal{M}_4\}$ over multiple domains ($\mu=1$).
  • Figure 5: Visualization of attention heads in fine-tuned $\mathcal{M}_0$, $\mathcal{M}_1$, and $\mathcal{M}_2$ given the same input. For instance, "$\mathcal{L}4$$\mathcal{H}10$" refers to the $10$-th head in the $4$-th layer. An attention head of $\mathcal{M}_{i+1}$ is trained from that of $\mathcal{M}_i$ in the same column. In the heatmap, the color of the $i$-th element in the $j$-th row indicates the attention value from the $j$-th token to the $i$-th token. For more visualizations (including $\mathcal{M}_\text{IND}$), please refer to \ref{['sec:more_visualziation_attention']}.
  • ...and 8 more figures