Table of Contents
Fetching ...

Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging

Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Pang Wei Koh, Jesse Dodge, Pradeep Dasigi

TL;DR

This work investigates the effectiveness of adding new skills to preexisting models by training on the new skills in isolation and later merging with the general model (e.g. using task vectors), finding that the parallel-train-then-merge procedure is often comparably effective.

Abstract

Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as new instruction datasets targeting new skills are created, or can cause the models to forget older skills. In this work, we investigate the effectiveness of adding new skills to preexisting models by training on the new skills in isolation and later merging with the general model (e.g. using task vectors). In experiments focusing on scientific literature understanding, safety, and coding, we find that the parallel-train-then-merge procedure, which is significantly cheaper than retraining the models on updated data mixtures, is often comparably effective. Our experiments also show that parallel training is especially well-suited for enabling safety features in LMs relative to continued finetuning and retraining, as it dramatically improves model compliance with safe prompts while preserving its ability to refuse dangerous or harmful prompts.

Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging

TL;DR

This work investigates the effectiveness of adding new skills to preexisting models by training on the new skills in isolation and later merging with the general model (e.g. using task vectors), finding that the parallel-train-then-merge procedure is often comparably effective.

Abstract

Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as new instruction datasets targeting new skills are created, or can cause the models to forget older skills. In this work, we investigate the effectiveness of adding new skills to preexisting models by training on the new skills in isolation and later merging with the general model (e.g. using task vectors). In experiments focusing on scientific literature understanding, safety, and coding, we find that the parallel-train-then-merge procedure, which is significantly cheaper than retraining the models on updated data mixtures, is often comparably effective. Our experiments also show that parallel training is especially well-suited for enabling safety features in LMs relative to continued finetuning and retraining, as it dramatically improves model compliance with safe prompts while preserving its ability to refuse dangerous or harmful prompts.

Paper Structure

This paper contains 36 sections, 7 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Trade-offs managed through $\omega$. We highlight the point along each curve that corresponds to using our weighting heuristic, $\omega = \frac{|D|}{|G|}$. This point consistently achieves strong performance on all settings, without requiring held out data. We take advantage of PTM's negligible cost to test different mixture weights to plot 10 checkpoints from evenly spaced values of $\omega$ as well as the heuristic
  • Figure 2: WiSE-FT performance on all of SciRIFF vs. all of SciRIFF mixed with a matching amount of Tülu data. A matching amount of general data in the mix leads to an improvement in skill-specific performance and a much smaller degradation in general skills.
  • Figure 3: Plotting three PTM methods for each scenario. Both linear interpolation and WiSE-FT can achieve very strong domain-specific performance, at the cost of general performance and exaggerated refusals. While task arithmetic also improves in skill-specific performance, it preserves much more of the general skills.
  • Figure 4: We show general skills versus exaggerated refusals, and show a clear relationship between the two skill sets. Additionally, for the same general performance, PTM achieves much higher exaggerated refusals compliance than RT and CFT.
  • Figure 5: We show general skills versus exaggerated refusals, and highlight the point chosen by our heuristic, showing at most a small degradation in exaggerated refusal performance while preserving general skills.
  • ...and 1 more figures