Table of Contents
Fetching ...

Efficient Model Development through Fine-tuning Transfer

Pin-Jie Lin, Rishab Balasubramanian, Fengyuan Liu, Nikhil Kandpal, Tu Vu

TL;DR

This work tackles the inefficiency of updating large language models across version releases by proposing diff-vector transfer: extracting the fine-tuning update $\Delta_s = m'_s - m_s$ from a source version and adding it to a different target base $m_t$ to approximate the target’s fine-tuned state $m'_t$ without retraining. Grounded in linear mode connectivity, the method demonstrates substantial performance gains across multiple open-weight models, languages, and tasks, including strong results on IFEval and Global MMLU, and even enabling step-by-step reasoning. The study further shows that using the merged weights as a starting point accelerates subsequent fine-tuning and that iterative recycling-then-finetuning can improve efficiency in continual deployment scenarios. Multilingual experiments illustrate meaningful language-specific gains, suggesting a practical approach for cost-effective, ongoing LLM development. Overall, fine-tuning transfer emerges as a robust strategy to reduce training costs while maintaining competitive performance, especially when source and target checkpoints are in a linearly connected region of parameter space. $\Delta_s$ transfers, linear connectivity, and recycling strategies together offer a scalable path for continuous model improvement.

Abstract

Modern LLMs struggle with efficient updates, as each new pretrained model version requires repeating expensive alignment processes. This challenge also applies to domain- or languagespecific models, where fine-tuning on specialized data must be redone for every new base model release. In this paper, we explore the transfer of fine-tuning updates between model versions. Specifically, we derive the diff vector (representing the weight changes from finetuning) from one source model version and apply it to the base model of a different target version. Through empirical evaluations on various open-weight model versions, we show that transferring diff vectors can significantly improve the performance of the target base model. For example, transferring the fine-tuning updates from Llama 3.0 8B improves Llama 3.1 8B by 46.9% on IFEval and 15.7% on LiveCodeBench without additional training, even surpassing Llama 3.1 8B Instruct. Furthermore, we demonstrate performance gains on multilingual tasks, with 4.7% and 15.5% improvements on Global MMLU for Malagasy and Turkish, respectively. We observe that these merged models provide stronger initializations for further fine-tuning. Lastly, our controlled experiments suggest that fine-tuning transfer is most effective when source and target models lie in a linearly connected region of parameter space, and we provide a theoretical analysis of our method. Taken together, fine-tuning transfer offers a cost-efficient and practical strategy for continuous LLM development. Our code is available at github.com/pjlintw/finetuning-transfer.

Efficient Model Development through Fine-tuning Transfer

TL;DR

This work tackles the inefficiency of updating large language models across version releases by proposing diff-vector transfer: extracting the fine-tuning update from a source version and adding it to a different target base to approximate the target’s fine-tuned state without retraining. Grounded in linear mode connectivity, the method demonstrates substantial performance gains across multiple open-weight models, languages, and tasks, including strong results on IFEval and Global MMLU, and even enabling step-by-step reasoning. The study further shows that using the merged weights as a starting point accelerates subsequent fine-tuning and that iterative recycling-then-finetuning can improve efficiency in continual deployment scenarios. Multilingual experiments illustrate meaningful language-specific gains, suggesting a practical approach for cost-effective, ongoing LLM development. Overall, fine-tuning transfer emerges as a robust strategy to reduce training costs while maintaining competitive performance, especially when source and target checkpoints are in a linearly connected region of parameter space. transfers, linear connectivity, and recycling strategies together offer a scalable path for continuous model improvement.

Abstract

Modern LLMs struggle with efficient updates, as each new pretrained model version requires repeating expensive alignment processes. This challenge also applies to domain- or languagespecific models, where fine-tuning on specialized data must be redone for every new base model release. In this paper, we explore the transfer of fine-tuning updates between model versions. Specifically, we derive the diff vector (representing the weight changes from finetuning) from one source model version and apply it to the base model of a different target version. Through empirical evaluations on various open-weight model versions, we show that transferring diff vectors can significantly improve the performance of the target base model. For example, transferring the fine-tuning updates from Llama 3.0 8B improves Llama 3.1 8B by 46.9% on IFEval and 15.7% on LiveCodeBench without additional training, even surpassing Llama 3.1 8B Instruct. Furthermore, we demonstrate performance gains on multilingual tasks, with 4.7% and 15.5% improvements on Global MMLU for Malagasy and Turkish, respectively. We observe that these merged models provide stronger initializations for further fine-tuning. Lastly, our controlled experiments suggest that fine-tuning transfer is most effective when source and target models lie in a linearly connected region of parameter space, and we provide a theoretical analysis of our method. Taken together, fine-tuning transfer offers a cost-efficient and practical strategy for continuous LLM development. Our code is available at github.com/pjlintw/finetuning-transfer.

Paper Structure

This paper contains 43 sections, 6 equations, 4 figures, 16 tables, 1 algorithm.

Figures (4)

  • Figure 1: To transfer fine-tuning (e.g., instruction tuning) from a source model version $s$ (e.g., Llama 3.0) to a target version $t$ (Llama 3.1), we first compute the diff vector $\Delta_s = m'_{s} - m_s$ from version $s$, where $m'_{s}$ is the fine-tuned model (instruction-tuned Llama 3.0) and $m_s$ is the base model (pretrained Llama 3.0). Then, we add $\Delta_s$ to the target base model (pretrained Llama 3.1) to approximate the fine-tuned model in version $t$ (instruction-tuned Llama 3.1).
  • Figure 2: GSM8K performance showing that fine-tuning transfer provides a more computationally efficient starting point (i.e., $\mathcal{M}_i + \Delta_{j}$) for further training. Here, $\mathcal{M}_i$ represents different intermediate pretrained checkpoints of OLMo 2 7B (with smaller values of $i$ indicating earlier checkpoints), and $\Delta_i$ refers to the diff vector resulting from the fine-tuning of version $i$. Additional results for $\mathcal{M}_1$, $\mathcal{M}_2$, $\mathcal{M}_4$ can be found in Appendix \ref{['appendix:additional_results_section_5']}.
  • Figure 3: GSM8K performance showing that fine-tuning transfer provides a more computationally efficient starting point (i.e., $\mathcal{M}_i + \Delta_{j}$) for further training. Here, $\mathcal{M}_i$ represents different intermediate pretrained checkpoints of OLMo 2 7B (with smaller values of $i$ indicating earlier checkpoints), and $\Delta_i$ refers to the diff vector resulting from the fine-tuning of version $i$.
  • Figure 4: GSM8K performance showing that both iterative ($\Delta^{iter}$) and direct ($\Delta^{dir}$) recycling-then-finetuning approaches offer faster convergence. At a high level, $\Delta^{iter}$ gradually incorporates fine-tuning updates, i.e., diff vectors, from previous model versions, while $\Delta^{dir}$ directly applies the diff vector from the latest model version to the current model.