Table of Contents
Fetching ...

On Task Vectors and Gradients

Luca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Giuseppe Alessio D'Inverno, Fabrizio Silvestri, Emanuele Rodolà

TL;DR

The paper establishes a rigorous link between task vectors and gradients, showing that a task vector from one epoch of finetuning equals the scaled negative gradient under full-batch gradient descent. It then proves that multi-epoch finetuning preserves this equivalence only up to a second-order error term $O(\eta^2)$, with explicit bounds for feed-forward networks. Empirically, the first-epoch gradient dominates the finetuning trajectory across seven vision benchmarks, explaining why merging one-epoch finetuned models can match merging fully converged models. Overall, task arithmetic is reframed as approximate multitask learning driven by early training dynamics, offering a principled basis for efficient model merging.

Abstract

Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.

On Task Vectors and Gradients

TL;DR

The paper establishes a rigorous link between task vectors and gradients, showing that a task vector from one epoch of finetuning equals the scaled negative gradient under full-batch gradient descent. It then proves that multi-epoch finetuning preserves this equivalence only up to a second-order error term , with explicit bounds for feed-forward networks. Empirically, the first-epoch gradient dominates the finetuning trajectory across seven vision benchmarks, explaining why merging one-epoch finetuned models can match merging fully converged models. Overall, task arithmetic is reframed as approximate multitask learning driven by early training dynamics, offering a principled basis for efficient model merging.

Abstract

Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.

Paper Structure

This paper contains 23 sections, 8 theorems, 59 equations, 4 figures.

Key Result

Theorem 1

Let $\theta_{\text{TA}}^{(k)} = \theta_{\text{base}} + {\alpha} \sum_{t \in T} \tau_t^{(k)}$ be the model obtained using vanilla task arithmetics with parameter $\alpha$. Let $\{\theta_t^{(k)}\}_{t\in T}$ be produced by running $k$ full‑batch GD epochs with step size $\eta$ on each task, and let $\t

Figures (4)

  • Figure 1: Left: endpoint models are finetuned with SGD for more than one epoch. Right: endpoint models are finetuned with GD for a single epoch. In this case, task vectors are equivalent to negative gradients.
  • Figure 2: Task arithmetic accuracy: $1$ epoch vs. converged.
  • Figure 3: Analysis of first-epoch gradients.
  • Figure 4: Checkpoint projection of different merging strategies.

Theorems & Definitions (12)

  • Theorem 1
  • Proposition 1
  • Remark 1
  • Lemma 1
  • Theorem 2: Uniform bound on the coefficient vector $C(\{\theta_{\mathrm{MT}}^{(j)}\}\bigr)$
  • Proposition
  • Theorem
  • Lemma
  • proof
  • proof : Proof Proposition and Theorem
  • ...and 2 more