Table of Contents
Fetching ...

Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks

MohammadReza Davari, Eugene Belilovsky

TL;DR

This paper tackles the challenge of merging multiple fine-tunings of a single foundation model to create a scalable multi-task model without additional training. It introduces Model Breadcrumbs, a sparse, layerwise masking method that computes per-task weight differences, applies masks to prune outliers and negligible perturbations, and aggregates the masked directions with a small growth factor to form a unified model. Across vision (CLIP ViT variants on eight datasets) and NLP (T5-base on GLUE tasks) settings, the approach demonstrates robust hyperparameter generalization, strong multi-task performance, and favorable efficiency gains over prior methods like Task Arithmetic and Fisher merging. The work highlights practical benefits for updatable ML and cross-domain applicability, while noting dependencies on the quality of initial fine-tuned models and privacy considerations for data access.

Abstract

The rapid development of AI systems has been greatly influenced by the emergence of foundation models. A common approach for targeted problems involves fine-tuning these pre-trained foundation models for specific target tasks, resulting in a rapid spread of models fine-tuned across a diverse array of tasks. This work focuses on the problem of merging multiple fine-tunings of the same foundation model derived from a spectrum of auxiliary tasks. We introduce a new simple method, Model Breadcrumbs, which consists of a sparsely defined weight set that guides model adaptation within the weight space of a pre-trained model. These breadcrumbs are constructed by subtracting the weights from a pre-trained model before and after fine-tuning, followed by a sparsification process that eliminates weight outliers and negligible perturbations. Our experiments demonstrate the effectiveness of Model Breadcrumbs to simultaneously improve performance across multiple tasks. This contribution aligns with the evolving paradigm of updatable machine learning, reminiscent of the collaborative principles underlying open-source software development, fostering a community-driven effort to reliably update machine learning models. Our method is shown to be more efficient and unlike previous proposals does not require hyperparameter tuning for each new task added. Through extensive experimentation involving various models, tasks, and modalities we establish that integrating Model Breadcrumbs offers a simple, efficient, and highly effective approach for constructing multi-task models and facilitating updates to foundation models.

Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks

TL;DR

This paper tackles the challenge of merging multiple fine-tunings of a single foundation model to create a scalable multi-task model without additional training. It introduces Model Breadcrumbs, a sparse, layerwise masking method that computes per-task weight differences, applies masks to prune outliers and negligible perturbations, and aggregates the masked directions with a small growth factor to form a unified model. Across vision (CLIP ViT variants on eight datasets) and NLP (T5-base on GLUE tasks) settings, the approach demonstrates robust hyperparameter generalization, strong multi-task performance, and favorable efficiency gains over prior methods like Task Arithmetic and Fisher merging. The work highlights practical benefits for updatable ML and cross-domain applicability, while noting dependencies on the quality of initial fine-tuned models and privacy considerations for data access.

Abstract

The rapid development of AI systems has been greatly influenced by the emergence of foundation models. A common approach for targeted problems involves fine-tuning these pre-trained foundation models for specific target tasks, resulting in a rapid spread of models fine-tuned across a diverse array of tasks. This work focuses on the problem of merging multiple fine-tunings of the same foundation model derived from a spectrum of auxiliary tasks. We introduce a new simple method, Model Breadcrumbs, which consists of a sparsely defined weight set that guides model adaptation within the weight space of a pre-trained model. These breadcrumbs are constructed by subtracting the weights from a pre-trained model before and after fine-tuning, followed by a sparsification process that eliminates weight outliers and negligible perturbations. Our experiments demonstrate the effectiveness of Model Breadcrumbs to simultaneously improve performance across multiple tasks. This contribution aligns with the evolving paradigm of updatable machine learning, reminiscent of the collaborative principles underlying open-source software development, fostering a community-driven effort to reliably update machine learning models. Our method is shown to be more efficient and unlike previous proposals does not require hyperparameter tuning for each new task added. Through extensive experimentation involving various models, tasks, and modalities we establish that integrating Model Breadcrumbs offers a simple, efficient, and highly effective approach for constructing multi-task models and facilitating updates to foundation models.
Paper Structure (14 sections, 4 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 4 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Method overview. We start with a foundational model that has undergone fine-tuning on various tasks. Next, we build a fine-tuning trajectory for each fine-tuned model by subtracting the pre-trained model weights from each of the fine-tuned models (task vectors). We then, at each layer, apply a masking operation over the absolute value of the the resulting trajectory, eliminating both outliers and small values. Finally, these masked task vectors are aggregated and combined with the reference pre-trained model to create a unified multi-task model.
  • Figure 2: The solid line is the averaged normalized accuracy across all evaluation points. Each data point corresponds to an experiment involving a subset of the 8 tasks under study. Notably, it is evident that the Model Breadcrumbs (with 90% sparsity), consistently outperform the Task Arithmetic ilharco2022editing. Specifically, in the experiment involving all eight tasks, the Model Breadcrumbs outperform the Task Arithmetic by a substantial margin of 5.7%.
  • Figure 3: Validation Free Setting. For the ViT-B-32 model, we tune the hyperparameters of each method (Breadcrumbs and Task Arithmetic) based on the first 1, 2, or 3 tasks and add additional tasks using those hyperparameters (validation set free). For the ViT-L-14 model, the Breadcrumbs method was only tune for the 1 task scenario and evaluate on the additional tasks using those hyperparameters, though the Task Arithmetic appraoch was given more chances to adjust its hyperparameters (task 1, 2, and 3). We observe that Breadcrumbs substantially outperforms task vectors in this setting.
  • Figure 4: The 200-task sequence originates from the ImageNet dataset deng2009imagenet, created by dividing the data into 200 5-class classification tasks. After encountering 10 tasks using the ViT-L-14 model, the best hyperparameters for each method (Breadcrumbs with 85% sparsity and Task Arithmetic ilharco2022editing) are selected and fixed. Each point on the plot represents the evaluation of the method over all tasks observed up to that point. With an increasing number of tasks, Model Breadcrumbs consistently outperforms Task Arithmetic ilharco2022editing by a substantial margin, highlighting the robustness of hyperparameters in the Model Breadcrumbs approach.
  • Figure 5: Comparative performance analysis of Model Breadcrumbs and Task Arithmetic ilharco2022editing methods across varying CLIP model scales (ViT-B-32, ViT-B-16, and ViT-L-14) as the number of tasks increases. The solid line represents the averaged normalized accuracy across all evaluation points. Each data point corresponds to an experiment involving a subset of the 8 tasks under study. Our findings highlight the potential of larger-scale models to mitigate performance degradation and, as seen in Figure \ref{['fig:scale-partial']}, the capability of Model Breadcrumbs to produce multi-task models that surpass individual fine-tuned models for specific tasks.
  • ...and 5 more figures