Table of Contents
Fetching ...

Localizing Task Information for Improved Model Merging and Compression

Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-Jimenez, François Fleuret, Pascal Frossard

TL;DR

Multi-task merging of fine-tuned checkpoints often suffers performance drops not from information erasure but from task interference. The authors show that task-specific information remains in the merged vector and introduce TALL-masks to localize it, enabling almost lossless reconstruction of single-task performance. They also propose Consensus Merging to remove selfish and catastrophic weights, improving merging across vision and NLP benchmarks. Additionally, the approach enables aggressive compression, reducing storage from 57Gb to 8.2Gb while preserving ~99.7% of original performance.

Abstract

Model merging and task arithmetic have emerged as promising scalable approaches to merge multiple single-task checkpoints to one multi-task model, but their applicability is reduced by significant performance loss. Previous works have linked these drops to interference in the weight space and erasure of important task-specific features. Instead, in this work we show that the information required to solve each task is still preserved after merging as different tasks mostly use non-overlapping sets of weights. We propose TALL-masks, a method to identify these task supports given a collection of task vectors and show that one can retrieve >99% of the single task accuracy by applying our masks to the multi-task vector, effectively compressing the individual checkpoints. We study the statistics of intersections among constructed masks and reveal the existence of selfish and catastrophic weights, i.e., parameters that are important exclusively to one task and irrelevant to all tasks but detrimental to multi-task fusion. For this reason, we propose Consensus Merging, an algorithm that eliminates such weights and improves the general performance of existing model merging approaches. Our experiments in vision and NLP benchmarks with up to 20 tasks, show that Consensus Merging consistently improves existing approaches. Furthermore, our proposed compression scheme reduces storage from 57Gb to 8.2Gb while retaining 99.7% of original performance.

Localizing Task Information for Improved Model Merging and Compression

TL;DR

Multi-task merging of fine-tuned checkpoints often suffers performance drops not from information erasure but from task interference. The authors show that task-specific information remains in the merged vector and introduce TALL-masks to localize it, enabling almost lossless reconstruction of single-task performance. They also propose Consensus Merging to remove selfish and catastrophic weights, improving merging across vision and NLP benchmarks. Additionally, the approach enables aggressive compression, reducing storage from 57Gb to 8.2Gb while preserving ~99.7% of original performance.

Abstract

Model merging and task arithmetic have emerged as promising scalable approaches to merge multiple single-task checkpoints to one multi-task model, but their applicability is reduced by significant performance loss. Previous works have linked these drops to interference in the weight space and erasure of important task-specific features. Instead, in this work we show that the information required to solve each task is still preserved after merging as different tasks mostly use non-overlapping sets of weights. We propose TALL-masks, a method to identify these task supports given a collection of task vectors and show that one can retrieve >99% of the single task accuracy by applying our masks to the multi-task vector, effectively compressing the individual checkpoints. We study the statistics of intersections among constructed masks and reveal the existence of selfish and catastrophic weights, i.e., parameters that are important exclusively to one task and irrelevant to all tasks but detrimental to multi-task fusion. For this reason, we propose Consensus Merging, an algorithm that eliminates such weights and improves the general performance of existing model merging approaches. Our experiments in vision and NLP benchmarks with up to 20 tasks, show that Consensus Merging consistently improves existing approaches. Furthermore, our proposed compression scheme reduces storage from 57Gb to 8.2Gb while retaining 99.7% of original performance.
Paper Structure (34 sections, 10 equations, 11 figures, 7 tables)

This paper contains 34 sections, 10 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Illustration of our mask construction algorithm (left) along with the applications (right) on model compression and model merging. Each block corresponds to the same weight matrix, and color intensity reflects the value of each parameter -- empty means zero value. Given single-task vectors $\{\bm{\tau}_{t}\xspace\}_{t=1}^4$ and the merged vector $\bm{\tau}_\textrm{MTL}\xspace$, our method constructs per-task masks $\{\bm{m}_t\}_{t=1}^4$, pinpointing the important parameters for each original task vector. For model merging, we keep only the 'general' weights selected by more than one mask and produce the consensus mask $\bm{m}_{\textrm{consensus}}$ and the final merged vector. For compression, we evaluate on each task with reconstructed task vectors by masking out the irrelevant weights, retaining almost full performance without saving the individual task vectors.
  • Figure 2: TALL-masks localizes task-specific information. The bar plot shows the percentage of parameters selected by TALL-masks, while the blue line shows the normalized validation accuracy achieved by the re-constructed $\hat{\bm{\theta}_t}$ with the selected masks using \ref{['eq:construct_theta_hat']}. The lightblue dashed line shows the task arithmetic baseline where the information is not localized. Our task-specific masks allow the restoration of full performance, showing that all knowledge embedded in the initial fine-tuned checkpoints is preserved post merging.
  • Figure 3: The distribution of mask agreements in the merged vector produced by two model merging methods, Task Arithmetic and TIES. A non-negligent fraction of weights is important exclusively to one task (selfish) while another fraction is irrelevant to all tasks (catastrophic). Our method eliminates both categories to improve model merging.
  • Figure 4: Comparison of absolute accuracy (%) of individual tasks for the computer vision benchmarks and ViT-B/32. Results for ViT-B/16 and ViT-L/14 are provided in the appendix. Our Consensus Merging shows higher performance compared to model merging baselines, especially for the settings with more tasks. Our compression algorithm consistently matches the performance of the individual fine-tuned models at a fraction of the memory, while model merging techniques are not robust to the increase of tasks.
  • Figure 5: Averaged normalized accuracy vs. number of tasks for computer vision benchmarks. Our proposed specialist algorithm maintains initial performance regardless of task combination and heavily compresses the fine-tuned checkpoints.
  • ...and 6 more figures