Table of Contents
Fetching ...

Escaping Optimization Stagnation: Taking Steps Beyond Task Arithmetic via Difference Vectors

Jinping Wang, Zhiqiang Gao, Dinggen Zhang, Zhiwu Xie

TL;DR

DV-BASI extends task arithmetic by introducing difference vectors derived from optimization history to enable continuous, multi-step exploration of parameter space. It uses a learnable block-diagonal anisotropic scaling matrix to perturb the current merged weights along the direction of accumulated improvements, enabling escape from local optima without extra modules. The method integrates with existing task-arithmetic methods and supports both supervised and unsupervised settings, achieving state-of-the-art results on benchmarks including task negation, task addition, and test-time adaptation. It also demonstrates that multi-task merged models can outperform individually fine-tuned models and that DV-BASI can enhance single-task performance after fine-tuning, all with modest computational overhead.

Abstract

Current methods for editing pre-trained models face significant challenges, primarily high computational costs and limited scalability. Task arithmetic has recently emerged as a promising solution, using simple arithmetic operations-addition and negation-based on task vectors which are the differences between fine-tuned and pre-trained model weights, to efficiently modify model behavior. However, the full potential of task arithmetic remains underexplored, primarily due to limited mechanisms for overcoming optimization stagnation. To address this challenge, we introduce the notion of difference vector, a generalized form of task vectors derived from the historical movements during optimization. Using difference vectors as directed perturbations, we propose the Difference Vector-based Anisotropic Scaling Iterative algorithm (DV-BASI) to enable a continuous optimization process for task arithmetic methods without relying on any additional modules or components. Notably, by leveraging escapability and directional advantages of difference vectors, the average performance on different tasks of the multi-task model merged by DV-BASI may even outperform models individually fine-tuned. Based on this observation, we extend the application of difference vectors to a feasible fine-tuning method for single-task models. On the practical side, DV-BASI allows expressive searching directions with few learnable parameters and forms a scalable framework. We also integrate DV-BASI with task arithmetic methods and advanced optimization techniques to achieve state-of-the-art performance on both supervised and unsupervised evaluation protocols.

Escaping Optimization Stagnation: Taking Steps Beyond Task Arithmetic via Difference Vectors

TL;DR

DV-BASI extends task arithmetic by introducing difference vectors derived from optimization history to enable continuous, multi-step exploration of parameter space. It uses a learnable block-diagonal anisotropic scaling matrix to perturb the current merged weights along the direction of accumulated improvements, enabling escape from local optima without extra modules. The method integrates with existing task-arithmetic methods and supports both supervised and unsupervised settings, achieving state-of-the-art results on benchmarks including task negation, task addition, and test-time adaptation. It also demonstrates that multi-task merged models can outperform individually fine-tuned models and that DV-BASI can enhance single-task performance after fine-tuning, all with modest computational overhead.

Abstract

Current methods for editing pre-trained models face significant challenges, primarily high computational costs and limited scalability. Task arithmetic has recently emerged as a promising solution, using simple arithmetic operations-addition and negation-based on task vectors which are the differences between fine-tuned and pre-trained model weights, to efficiently modify model behavior. However, the full potential of task arithmetic remains underexplored, primarily due to limited mechanisms for overcoming optimization stagnation. To address this challenge, we introduce the notion of difference vector, a generalized form of task vectors derived from the historical movements during optimization. Using difference vectors as directed perturbations, we propose the Difference Vector-based Anisotropic Scaling Iterative algorithm (DV-BASI) to enable a continuous optimization process for task arithmetic methods without relying on any additional modules or components. Notably, by leveraging escapability and directional advantages of difference vectors, the average performance on different tasks of the multi-task model merged by DV-BASI may even outperform models individually fine-tuned. Based on this observation, we extend the application of difference vectors to a feasible fine-tuning method for single-task models. On the practical side, DV-BASI allows expressive searching directions with few learnable parameters and forms a scalable framework. We also integrate DV-BASI with task arithmetic methods and advanced optimization techniques to achieve state-of-the-art performance on both supervised and unsupervised evaluation protocols.

Paper Structure

This paper contains 16 sections, 7 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: The overall iterative procedure of DV-BASI is illustrated in (a). Starting from pre-trained weights $\theta_{\text{pre}}$, the model initially reaches a local optimum $\theta_0$ during the first major optimization step. At each local optimum $\theta_j$, DV-BASI computes difference vectors $\delta_j$ (indicated by purple arrows), which provide directional guidance for further optimization ($\hat{\delta}_j$ denotes the directional vector of $\delta_j$). Based on these difference vectors, we apply anisotropic scaling matrices $\Lambda_j$ to create more flexible exploration directions, aiming to find a potentially better global solution. (b) provides a detailed illustration of the anisotropic scaling process for a difference vector. Assume each difference vector has two parameter blocks $\delta_j = (\delta_j^{(1)}, \delta_j^{(2)})$. Each block is independently scaled by the anisotropic matrix $\Lambda_j$ (where $\Lambda_j = (\Lambda_j^{(1)}, \Lambda_j^{(2)})$), which offers more expressive searching directions compared to using a scalar scaling coefficient $\alpha$aTLAS. (c) visualizes the iterative optimization path of DV-BASI in a loss landscape. It demonstrates how difference vectors function as directed perturbations, effectively helping model weights escape from the current local optima (red circles) to continue searching anisotropically (the purple line represents the anisotropic scaling trajectory of DV-BASI, based on gradient descent) for a potentially better solution in the parameter space.
  • Figure 2: Figure (a) and (b) show the stepwise relative accuracy of supervised model merging (using ViT-B/32 as pre-trained backbone) and its growth within 4 DV-BASI iterations. Figure (c) compares the unsupervised model merging performance of 10 different initial scaling coefficients (0.1 to 1.0) among 3 pre-trained backbones (ViT-B/32, ViT-B/16, and ViT-L/14).