Amuro and Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models
Kaiser Sun, Mark Dredze
TL;DR
This work probes how pre-training and fine-tuning interact by fine-tuning multiple intermediate pre-training checkpoints of two LLMs across 18 datasets. It reveals that continued pre-training can yield latent improvements that only become evident after fine-tuning, while supervised fine-tuning can forget domain knowledge and increase sensitivity to prompts, though more pre-training mitigates some of this sensitivity. The results show a dichotomy: tasks learned during pre-training benefit less from fine-tuning, whereas tasks not learned pre-training can gain significantly, with earlier checkpoints often offering larger gains. The study highlights training dynamics as a valuable lens for model development and advocates broader release of intermediate checkpoints to facilitate future research, while acknowledging resource constraints and the need to validate findings on larger models and more datasets.
Abstract
The development of large language models leads to the formation of a pre-train-then-align paradigm, in which the model is typically pre-trained on a large text corpus and undergoes a tuning stage to align the model with human preference or downstream tasks. In this work, we investigate the relationship between pre-training and fine-tuning by fine-tuning multiple intermediate pre-trained model checkpoints. Our results on 18 datasets suggest that i) continual pre-training improves the model in a latent way that unveils after fine-tuning; ii) with extra fine-tuning, the datasets that the model does not demonstrate capability gain much more than those that the model performs well during the pre-training stage; iii) although model benefits significantly through supervised fine-tuning, it may forget previously known domain knowledge and the tasks that are not seen during fine-tuning; iv) the model resembles high sensitivity to evaluation prompts after supervised fine-tuning, but this sensitivity can be alleviated by more pre-training.
