Table of Contents
Fetching ...

DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging

Kotaro Yoshida, Yuji Naraki, Takafumi Horie, Ryotaro Shimizu, Hiroki Naganuma

TL;DR

This work first investigates the vulnerabilities of model-merging methods and pinpoint the source-model characteristics that critically underlie them, then proposes DisTaC (Distillation for Task vector Conditioning), a novel method that pre-conditions these problematic task vectors before the merge.

Abstract

Model merging has emerged as an efficient and flexible paradigm for multi-task learning, with numerous methods being proposed in recent years. However, these state-of-the-art techniques are typically evaluated on benchmark suites that are highly favorable to model merging, and their robustness in more realistic settings remains largely unexplored. In this work, we first investigate the vulnerabilities of model-merging methods and pinpoint the source-model characteristics that critically underlie them. Specifically, we identify two factors that are particularly harmful to the merging process: (1) disparities in task vector norms, and (2) the low confidence of the source models. To address this issue, we propose DisTaC (Distillation for Task vector Conditioning), a novel method that pre-conditions these problematic task vectors before the merge. DisTaC leverages knowledge distillation to adjust a task vector's norm and increase source-model confidence while preserving its essential task-specific knowledge. Our extensive experiments demonstrate that by pre-conditioning task vectors with DisTaC, state-of-the-art merging techniques can successfully integrate models exhibiting the harmful traits -- where they would otherwise fail -- achieving significant performance gains.

DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging

TL;DR

This work first investigates the vulnerabilities of model-merging methods and pinpoint the source-model characteristics that critically underlie them, then proposes DisTaC (Distillation for Task vector Conditioning), a novel method that pre-conditions these problematic task vectors before the merge.

Abstract

Model merging has emerged as an efficient and flexible paradigm for multi-task learning, with numerous methods being proposed in recent years. However, these state-of-the-art techniques are typically evaluated on benchmark suites that are highly favorable to model merging, and their robustness in more realistic settings remains largely unexplored. In this work, we first investigate the vulnerabilities of model-merging methods and pinpoint the source-model characteristics that critically underlie them. Specifically, we identify two factors that are particularly harmful to the merging process: (1) disparities in task vector norms, and (2) the low confidence of the source models. To address this issue, we propose DisTaC (Distillation for Task vector Conditioning), a novel method that pre-conditions these problematic task vectors before the merge. DisTaC leverages knowledge distillation to adjust a task vector's norm and increase source-model confidence while preserving its essential task-specific knowledge. Our extensive experiments demonstrate that by pre-conditioning task vectors with DisTaC, state-of-the-art merging techniques can successfully integrate models exhibiting the harmful traits -- where they would otherwise fail -- achieving significant performance gains.

Paper Structure

This paper contains 53 sections, 2 theorems, 41 equations, 9 figures, 6 tables, 1 algorithm.

Key Result

Proposition 1

Let $\boldsymbol{\tau}_1,\boldsymbol{\tau}_2\in\mathbb{R}^d$ with $\|\boldsymbol{\tau}_2\|>0$, and define $\delta\coloneqq\|\boldsymbol{\tau}_1\|/\|\boldsymbol{\tau}_2\|$. Assume $\boldsymbol{\tau}_1\!\perp\!\boldsymbol{\tau}_2$. For $\boldsymbol{\tau}_{\mathrm{merge}}=\boldsymbol{\tau}_1+\boldsymbo Hence, when $\delta\ll 1$, the merge is nearly perfectly aligned with $\boldsymbol{\tau}_2$ while i

Figures (9)

  • Figure 1: Failure Cases of Multi-Task Model Merging. All results were obtained using CLIP with a ViT-B-32 backbone on the eight vision tasks. (a) Comparison of normalized accuracy after merging models from different fine-tuning configurations averaged over eight vision tasks. The gray bar represents the conventional setting (a uniform learning rate of $10^{-5}$ with hard labels). The blue bar indicates the result of merging after training just one task with a learning rate (LR) of $10^{-4}$. The yellow bar shows the result when all tasks were trained with label smoothing (LS). Both the blue and yellow configurations show a significant performance degradation compared to the conventional setting. (b) Change in the task vector norm after fine-tuning with different learning rates for the same number of steps across eight vision tasks. The gray bar uses a learning rate of $10^{-5}$, matching the conventional benchmark, while the blue bar uses $10^{-4}$. We observe a 5 to 7-fold difference in the resulting task vector norms. (c) Change in the entropy of the model's predictive probabilities after fine-tuning with or without label smoothing across eight vision tasks. The vertical axis is on a logarithmic scale. Training with label smoothing increases the entropy by three orders of magnitude.
  • Figure 2: Evolution of DisTaC over steps. Results are averaged over the eight vision tasks with ViT-B-32; the error band shows one standard deviation around the mean. (a) Norm Mismatch: the blue curve plots normalized test accuracy relative to the teacher, and the green curve shows the percentage change in the task vector norm from the DisTaC initialization. Within roughly 100 steps, accuracy recovers to (or exceeds) the teacher's level while the task vector norm remains virtually unchanged from its $\kappa_t$-adjusted target. (b) Low Confidence: the blue curve again reports normalized test accuracy, whereas the orange curve tracks the test prediction entropy. About 100 steps suffice to drive the entropy substantially lower, yet the teacher-level accuracy is fully preserved.
  • Figure 3: Effect of scaling task vectors on test accuracy. For each of the eight vision tasks (ViT-B-32), we evaluate the model $\boldsymbol{\theta}_{\text{pre}} + \kappa_t \boldsymbol{\tau}$ as the scaling factor $\kappa_t$ varies from $0.0$ to $3.0$. Model performance is more robust to shrinking the task vector than to stretching it, suggesting that when harmonizing task vector norms, longer vectors should be shrunk to match shorter ones.
  • Figure 4: Impact of label smoothing on confidence calibration and merge performance. (a) Average reliability diagram for ViT-B-32 across eight vision tasks under different label-smoothing strengths $\alpha$. Without label smoothing ($\alpha=0$, dark purple) the model is strongly overconfident; as $\alpha$ increases to $0.01$ the model becomes well-calibrated, and at $\alpha=0.1$ it turns underconfident. (b) Test normalized accuracy obtained when the corresponding source models are merged. Merge performance decreases monotonically with larger $\alpha$, revealing a clear trade-off: lower confidence comes at the cost of lower accuracy after merging.
  • Figure 5: Layer-wise average task-vector norms for weight parameters in ViT-B-32, averaged over eight vision tasks. Gray bars correspond to a fine-tuning learning rate of $10^{-5}$, blue bars to $10^{-4}$.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • proof