Amuro and Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

Kaiser Sun; Mark Dredze

Amuro and Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

Kaiser Sun, Mark Dredze

TL;DR

This work probes how pre-training and fine-tuning interact by fine-tuning multiple intermediate pre-training checkpoints of two LLMs across 18 datasets. It reveals that continued pre-training can yield latent improvements that only become evident after fine-tuning, while supervised fine-tuning can forget domain knowledge and increase sensitivity to prompts, though more pre-training mitigates some of this sensitivity. The results show a dichotomy: tasks learned during pre-training benefit less from fine-tuning, whereas tasks not learned pre-training can gain significantly, with earlier checkpoints often offering larger gains. The study highlights training dynamics as a valuable lens for model development and advocates broader release of intermediate checkpoints to facilitate future research, while acknowledging resource constraints and the need to validate findings on larger models and more datasets.

Abstract

The development of large language models leads to the formation of a pre-train-then-align paradigm, in which the model is typically pre-trained on a large text corpus and undergoes a tuning stage to align the model with human preference or downstream tasks. In this work, we investigate the relationship between pre-training and fine-tuning by fine-tuning multiple intermediate pre-trained model checkpoints. Our results on 18 datasets suggest that i) continual pre-training improves the model in a latent way that unveils after fine-tuning; ii) with extra fine-tuning, the datasets that the model does not demonstrate capability gain much more than those that the model performs well during the pre-training stage; iii) although model benefits significantly through supervised fine-tuning, it may forget previously known domain knowledge and the tasks that are not seen during fine-tuning; iv) the model resembles high sensitivity to evaluation prompts after supervised fine-tuning, but this sensitivity can be alleviated by more pre-training.

Amuro and Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

TL;DR

Abstract

Paper Structure (33 sections, 1 equation, 15 figures, 8 tables)

This paper contains 33 sections, 1 equation, 15 figures, 8 tables.

Introduction
Background: Model Training
Experimental Setup
Model Choice
Training Procedure
Evaluation
Supervised Fine-Tuning: What does the model learn and forget?
Task Format
Domain Knowledge
Task Transfer
How does the model change across pre-training?
Does more pre-training yield better fine-tuning results?
Discussion
Related Work
Conclusion
...and 18 more sections

Figures (15)

Figure 1: Illustration of the experimental scheme. Intermediate pre-training checkpoints are fine-tuned on different datasets.
Figure 2: Example of model performance with different task formats. The figure of all datasets can be found in Figure \ref{['fig:app:task_format']}.
Figure 3: LLAMA3-8B performance with different task format. Instruct and Default always lead to highest evaluation results.
Figure 4: Example of out-of-domain performance for fine-tuned models. The solid blue line represents the fine-tuned checkpoint evaluated on an out-of-domain dataset, and the dashed orange line represents the base checkpoint where the model is not fine-tuned. Figure \ref{['fig:ood:detrimental']} shows an example of fine-tuning hurting OOD performance, while Figure \ref{['fig:ood:beneficial']} shows an example of fine-tuning boosting OOD performance as pre-traininng proceeds.
Figure 5: Ratio of out-of-domain performance change for each task, averaged across checkpoints.
...and 10 more figures

Amuro and Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

TL;DR

Abstract

Amuro and Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)