Table of Contents
Fetching ...

Task-Specific Skill Localization in Fine-tuned Language Models

Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, Sanjeev Arora

TL;DR

This paper tackles the problem of pinpointing where task-specific skills learned during fine-tuning reside inside large pretrained language models. It introduces model grafting, a post-hoc mechanism that identifies an ultra-sparse region of parameters to carry the fine-tuned values, enabling a grafted model to nearly match the original fine-tuned performance without any retraining. The approach yields substantial gains in calibration and OOD generalization, and reveals modular, partially disjoint skill localization across tasks, with promising implications for multi-task and continual learning. Overall, grafting offers a compact, transferable lens on fine-tuning, enabling efficient storage, better calibration, and potential improvements in robustness and continual learning scenarios.

Abstract

Pre-trained language models can be fine-tuned to solve diverse NLP tasks, including in few-shot settings. Thus fine-tuning allows the model to quickly pick up task-specific ``skills,'' but there has been limited study of where these newly-learnt skills reside inside the massive model. This paper introduces the term skill localization for this problem and proposes a solution. Given the downstream task and a model fine-tuned on that task, a simple optimization is used to identify a very small subset of parameters ($\sim0.01$% of model parameters) responsible for ($>95$%) of the model's performance, in the sense that grafting the fine-tuned values for just this tiny subset onto the pre-trained model gives performance almost as well as the fine-tuned model. While reminiscent of recent works on parameter-efficient fine-tuning, the novel aspects here are that: (i) No further re-training is needed on the subset (unlike, say, with lottery tickets). (ii) Notable improvements are seen over vanilla fine-tuning with respect to calibration of predictions in-distribution ($40$-$90$% error reduction) as well as the quality of predictions out-of-distribution (OOD). In models trained on multiple tasks, a stronger notion of skill localization is observed, where the sparse regions corresponding to different tasks are almost disjoint, and their overlap (when it happens) is a proxy for task similarity. Experiments suggest that localization via grafting can assist certain forms of continual learning.

Task-Specific Skill Localization in Fine-tuned Language Models

TL;DR

This paper tackles the problem of pinpointing where task-specific skills learned during fine-tuning reside inside large pretrained language models. It introduces model grafting, a post-hoc mechanism that identifies an ultra-sparse region of parameters to carry the fine-tuned values, enabling a grafted model to nearly match the original fine-tuned performance without any retraining. The approach yields substantial gains in calibration and OOD generalization, and reveals modular, partially disjoint skill localization across tasks, with promising implications for multi-task and continual learning. Overall, grafting offers a compact, transferable lens on fine-tuning, enabling efficient storage, better calibration, and potential improvements in robustness and continual learning scenarios.

Abstract

Pre-trained language models can be fine-tuned to solve diverse NLP tasks, including in few-shot settings. Thus fine-tuning allows the model to quickly pick up task-specific ``skills,'' but there has been limited study of where these newly-learnt skills reside inside the massive model. This paper introduces the term skill localization for this problem and proposes a solution. Given the downstream task and a model fine-tuned on that task, a simple optimization is used to identify a very small subset of parameters (% of model parameters) responsible for (%) of the model's performance, in the sense that grafting the fine-tuned values for just this tiny subset onto the pre-trained model gives performance almost as well as the fine-tuned model. While reminiscent of recent works on parameter-efficient fine-tuning, the novel aspects here are that: (i) No further re-training is needed on the subset (unlike, say, with lottery tickets). (ii) Notable improvements are seen over vanilla fine-tuning with respect to calibration of predictions in-distribution (-% error reduction) as well as the quality of predictions out-of-distribution (OOD). In models trained on multiple tasks, a stronger notion of skill localization is observed, where the sparse regions corresponding to different tasks are almost disjoint, and their overlap (when it happens) is a proxy for task similarity. Experiments suggest that localization via grafting can assist certain forms of continual learning.
Paper Structure (38 sections, 2 theorems, 17 equations, 15 figures, 9 tables)

This paper contains 38 sections, 2 theorems, 17 equations, 15 figures, 9 tables.

Key Result

Theorem 4.2

Under ass:quantization, we have Moreover, with probability at least $1-\delta$, the variance term can be bounded by

Figures (15)

  • Figure 1: Grafting learns a binary mask $\bm{\gamma}$ using the fine-tuned ($\bm{\theta}_{\textrm{ft}}$) and pre-trained ($\bm{\theta}_{\textrm{pre}}$) models, and creates a grafted model $\overline{\bm{\theta}_{\textrm{ft}}}(\bm{\gamma})$. For parameters in the region corresponding to $\bm{\gamma}$, $\overline{\bm{\theta}_{\textrm{ft}}}(\bm{\gamma})$ gets its values from $\bm{\theta}_{\textrm{ft}}$, while all other parameters default to $\bm{\theta}_{\textrm{pre}}$.
  • Figure 2: Accuracies of the grafting regions learned using our procedure in \ref{['sec:learning_patches']} and regions corresponding to the top-$s$ parameters based on magnitude of movement during FT. The learned region performs much better at low sparsity levels.
  • Figure 3: Testing existence of sparse grafting regions for prompt-based FT and standard FT fine-tuning (uses a linear head on top of [CLS] token). Skill localization is equally good for FT approaches.
  • Figure 4: (a) Grafting regions that only contain biases of the model don't give good skill localization. (b) Localizing with lottery ticket pruning (setting remaining parameters to $0$) does not perform well at any sparsity level without re-training.
  • Figure 5: Grafting accuracy for FT with SGD and AdamW. For both SST-2 and QNLI, the AdamW trained model is much worse at skill localization through grafting. However, a small $\ell_1$ regularization on the parameter movement during FT recovers localization.
  • ...and 10 more figures

Theorems & Definitions (2)

  • Theorem 4.2
  • Theorem 4.4