Table of Contents
Fetching ...

Task Addition and Weight Disentanglement in Closed-Vocabulary Models

Adam Hazimeh, Alessandro Favero, Pascal Frossard

TL;DR

The paper investigates editing closed-vocabulary pre-trained image classifiers with task arithmetic, previously demonstrated mainly on open-vocabulary models. It defines task vectors $\tau_t = \theta_{\rm ft}^t - \theta_{\rm pre}$ and applies $\theta_{\rm new} = \theta_{\rm pre} + \sum_t \lambda_t \tau_t$ to fuse tasks, with an upfront head alignment via linear probing for closed settings. The main findings show that weight disentanglement is a general consequence of pre-training and enables effective task addition across supervised, self-supervised, and CLIP-like pre-training, with larger data and model scales enhancing performance. Furthermore, linear probing often matches or rivals task addition as a cheaper baseline, suggesting practical alternatives for multi-task editing in non-language-supervised models. Overall, the work broadens the applicability of task arithmetic to a wider class of pre-trained models and highlights the trade-offs between modularity and computational efficiency in multi-task deployment.

Abstract

Task arithmetic has recently emerged as a promising method for editing pre-trained \textit{open-vocabulary} models, offering a cost-effective alternative to standard multi-task fine-tuning. However, despite the abundance of \textit{closed-vocabulary} models that are not pre-trained with language supervision, applying task arithmetic to these models remains unexplored. In this paper, we deploy and study task addition in closed-vocabulary image classification models. We consider different pre-training schemes and find that \textit{weight disentanglement} -- the property enabling task arithmetic -- is a general consequence of pre-training, as it appears in different pre-trained closed-vocabulary models. In fact, we find that pre-trained closed-vocabulary vision transformers can also be edited with task arithmetic, achieving high task addition performance and enabling the efficient deployment of multi-task models. Finally, we demonstrate that simple linear probing is a competitive baseline to task addition. Overall, our findings expand the applicability of task arithmetic to a broader class of pre-trained models and open the way for more efficient use of pre-trained models in diverse settings.

Task Addition and Weight Disentanglement in Closed-Vocabulary Models

TL;DR

The paper investigates editing closed-vocabulary pre-trained image classifiers with task arithmetic, previously demonstrated mainly on open-vocabulary models. It defines task vectors and applies to fuse tasks, with an upfront head alignment via linear probing for closed settings. The main findings show that weight disentanglement is a general consequence of pre-training and enables effective task addition across supervised, self-supervised, and CLIP-like pre-training, with larger data and model scales enhancing performance. Furthermore, linear probing often matches or rivals task addition as a cheaper baseline, suggesting practical alternatives for multi-task editing in non-language-supervised models. Overall, the work broadens the applicability of task arithmetic to a wider class of pre-trained models and highlights the trade-offs between modularity and computational efficiency in multi-task deployment.

Abstract

Task arithmetic has recently emerged as a promising method for editing pre-trained \textit{open-vocabulary} models, offering a cost-effective alternative to standard multi-task fine-tuning. However, despite the abundance of \textit{closed-vocabulary} models that are not pre-trained with language supervision, applying task arithmetic to these models remains unexplored. In this paper, we deploy and study task addition in closed-vocabulary image classification models. We consider different pre-training schemes and find that \textit{weight disentanglement} -- the property enabling task arithmetic -- is a general consequence of pre-training, as it appears in different pre-trained closed-vocabulary models. In fact, we find that pre-trained closed-vocabulary vision transformers can also be edited with task arithmetic, achieving high task addition performance and enabling the efficient deployment of multi-task models. Finally, we demonstrate that simple linear probing is a competitive baseline to task addition. Overall, our findings expand the applicability of task arithmetic to a broader class of pre-trained models and open the way for more efficient use of pre-trained models in diverse settings.

Paper Structure

This paper contains 26 sections, 2 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Weight disentanglement error heatmaps for the different pre-training algorithms. The heatmaps show the pairwise weight disentanglement error $\xi_{\tau_1,\tau_2}(\lambda_1, \lambda_2)$ of ViT-B-16 models pre-trained with different schemes. Light areas denote regions of weight space enjoying stronger weight disentanglement. The red box delimits the search space used to compute the best scaling coefficient $\lambda \in [-1, 1]$.
  • Figure 2: Average normalized task addition accuracy for different pre-training algorithms as a function of the scaling coefficient $\lambda$. The value $\lambda=0$ corresponds to simple linear probing with no task vector added to the visual encoder.
  • Figure 3: Weight disentanglement error heatmaps for Supervised pre-training on ImageNet1k vs. ImageNet21k. Both models are based on a ViT-B-16 architecture. The red box delimits the search space used to compute the best scaling coefficient $\lambda$.
  • Figure 4: Weight disentanglement error heatmaps highlighting the effect of model scale. The two models are ViT-B and ViT-L, both pre-trained on ImageNet21k in a supervised manner. The red box delimits the search space used to compute the best scaling coefficient $\lambda$.
  • Figure 5: Full Fine-tuning: Weight disentanglement error heatmaps for the different pre-training algorithms. All models are based on a ViT-B-16 architecture following the full fine-tuning regime. The red box delimits the search space used to compute the best scaling coefficient $\lambda$.
  • ...and 4 more figures