Table of Contents
Fetching ...

Efficient Model Editing with Task-Localized Sparse Fine-tuning

Leonardo Iurada, Marco Ciccone, Tatiana Tommasi

TL;DR

The paper addresses the challenge of efficiently editing large pre-trained models without incurring heavy linearization costs or risking interference between tasks. It introduces TaLoS, a sparse fine-tuning method that enforces function localization and exploits weight disentanglement by updating only the least-sensitive parameters identified via the diagonal Fisher Information $F_{[j,j]}$. This approach yields a near-linearized training regime and scalable task arithmetic, demonstrated by superior results in Task Addition and Task Negation across vision and language domains, along with structured, hardware-friendly sparsity patterns. The findings suggest practical benefits for deploying adaptable foundation models with modular, conflict-free task vectors, while providing insights into the localization and sparsity structure of transformer parameters, particularly in attention projections.

Abstract

Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.

Efficient Model Editing with Task-Localized Sparse Fine-tuning

TL;DR

The paper addresses the challenge of efficiently editing large pre-trained models without incurring heavy linearization costs or risking interference between tasks. It introduces TaLoS, a sparse fine-tuning method that enforces function localization and exploits weight disentanglement by updating only the least-sensitive parameters identified via the diagonal Fisher Information . This approach yields a near-linearized training regime and scalable task arithmetic, demonstrated by superior results in Task Addition and Task Negation across vision and language domains, along with structured, hardware-friendly sparsity patterns. The findings suggest practical benefits for deploying adaptable foundation models with modular, conflict-free task vectors, while providing insights into the localization and sparsity structure of transformer parameters, particularly in attention projections.

Abstract

Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.

Paper Structure

This paper contains 21 sections, 14 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Relative performance when pruning parameters with low sensitivity. The heatmaps illustrate the effect of pruning the parameters with the lowest sensitivity (measured by $[F_{[j,j]}(\bm{\theta}_0, \mathcal{D}_t)]_{j=1}^m$) on different tasks across various pre-trained models using data from different tasks. Each grid compares the accuracy ratios for models after pruning, where the rows represent the task dataset $\mathcal{D}_t$ used to identify the parameters with the lowest sensitivity, and the columns show the model's zero-shot performance on each task after pruning those parameters. The accuracy ratios are normalized by the model's performance before pruning. The sparsity ratio (10%) was found as the maximal sparsity that minimally influenced the model's output on the mask calibration dataset.
  • Figure 2: Visualizing weight disentanglement error. The heatmaps illustrate the disentanglement error $\xi(\alpha_1, \alpha_2)$ of each fine-tuning strategy on both a CLIP ViT-B/32 model (top) and a T5-Small model (bottom) across two task pairs. Lighter areas highlight regions of the weight space where disentanglement is more pronounced. The red box indicates the search space within which the optimal $\alpha$ values were searched (refer to Appendix \ref{['sec:implementation']}). We chose the task pairs to visualize by following Tangent_task_arith_2023 for vision and a criterion akin to the one used in tang2023parameter for language.
  • Figure 3: Function localization. The heatmaps present the accuracy ratios for fine-tuned models across tasks for CLIP ViT-B/32 (top) and T5-Small (bottom) models. Each row indicates a model fine-tuned on a specific task, with columns representing its performance on different test datasets. Accuracy ratios are normalized by the pre-trained model's performance. Lighter colors indicate better performance, suggesting minimal interference between the fine-tuned model and other tasks' input spaces. The red diagonal highlights each model's test performance on its specific fine-tuning task.
  • Figure 4: Visualization of mask calibration. Percentage of parameters selected for sparse fine-tuning in a transformer block of a ViT-B/32 (left) and a T5-Small (right) models, after our method's mask calibration vs. LoTA's mask calibration, at 90% sparsity. On ViT-B/32, we calibrate the masks on the Cars dataset krause2013cars, while on T5-Small we use QASC khot2020qasc. Full visualizations of all masked layers are reported in Appendix \ref{['sec:full_mask_vis']}.
  • Figure 5: Visualization of mask calibration. Percentage of parameters selected for sparse fine-tuning in a ViT-B/32 (top) and a T5-Small (bottom) models, after our method's mask calibration vs. LoTA's mask calibration, at 90% sparsity. On ViT-B/32, we calibrate the masks on the Cars dataset krause2013cars, while on T5-Small we use QASC khot2020qasc.
  • ...and 6 more figures