Table of Contents
Fetching ...

ZipIt! Merging Models from Different Tasks without Training

George Stoica, Daniel Bolya, Jakob Bjorner, Pratik Ramesh, Taylor Hearn, Judy Hoffman

TL;DR

ZipIt! addresses the challenge of merging neural networks trained on completely different tasks without additional training. It introduces a general, graph-based merging framework that can fuse features within each model (not just across models) and supports partial zipping to create multi-head architectures, enabling efficient cross-task integration. The approach employs a merge matrix $M_i$ and an unmerge matrix $U_i$ to pair and average correlated features, then propagates these operations through the network to align subsequent layers. Empirical results on CIFAR-10/100 and ImageNet-1k, including multi-dataset and multimodal setups, show significant improvements over permutation-based baselines and often approach ensemble performance, especially with partial zipping and increased model width. Theoretical analysis provides tighter bounds on the merging barrier when within-model merges are allowed, and ZipIt! demonstrates practical feasibility for building multi-task systems without retraining.

Abstract

Typical deep visual recognition models are capable of performing the one task they were trained on. In this paper, we tackle the extremely difficult problem of combining distinct models with different initializations, each solving a separate task, into one multi-task model without any additional training. Prior work in model merging permutes one model to the space of the other then averages them together. While this works for models trained on the same task, we find that this fails to account for the differences in models trained on disjoint tasks. Thus, we introduce "ZipIt!", a general method for merging two arbitrary models of the same architecture that incorporates two simple strategies. First, in order to account for features that aren't shared between models, we expand the model merging problem to allow for merging features within each model by defining a general "zip" operation. Second, we add support for partially zipping the models up until a specified layer, naturally creating a multi-head model. We find that these two changes combined account for 20-60% improvement over prior work, making it more feasible to merge models trained on disjoint tasks without retraining.

ZipIt! Merging Models from Different Tasks without Training

TL;DR

ZipIt! addresses the challenge of merging neural networks trained on completely different tasks without additional training. It introduces a general, graph-based merging framework that can fuse features within each model (not just across models) and supports partial zipping to create multi-head architectures, enabling efficient cross-task integration. The approach employs a merge matrix and an unmerge matrix to pair and average correlated features, then propagates these operations through the network to align subsequent layers. Empirical results on CIFAR-10/100 and ImageNet-1k, including multi-dataset and multimodal setups, show significant improvements over permutation-based baselines and often approach ensemble performance, especially with partial zipping and increased model width. Theoretical analysis provides tighter bounds on the merging barrier when within-model merges are allowed, and ZipIt! demonstrates practical feasibility for building multi-task systems without retraining.

Abstract

Typical deep visual recognition models are capable of performing the one task they were trained on. In this paper, we tackle the extremely difficult problem of combining distinct models with different initializations, each solving a separate task, into one multi-task model without any additional training. Prior work in model merging permutes one model to the space of the other then averages them together. While this works for models trained on the same task, we find that this fails to account for the differences in models trained on disjoint tasks. Thus, we introduce "ZipIt!", a general method for merging two arbitrary models of the same architecture that incorporates two simple strategies. First, in order to account for features that aren't shared between models, we expand the model merging problem to allow for merging features within each model by defining a general "zip" operation. Second, we add support for partially zipping the models up until a specified layer, naturally creating a multi-head model. We find that these two changes combined account for 20-60% improvement over prior work, making it more feasible to merge models trained on disjoint tasks without retraining.
Paper Structure (66 sections, 23 equations, 15 figures, 10 tables)

This paper contains 66 sections, 23 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Setting and ZipIt! (a) Prior work merges differently initialized models from the same dataset with the same label sets: e.g., merging two models both trained to classify dog breeds. (b) Our setting expands this to merging models from different datasets with different label sets: e.g., merging a model that classifies dog breeds with one that classifies bird species. (c) ZipIt! merges these models without retraining by identifying shared features.
  • Figure 2: Task Loss Landscapes for models in Tab. \ref{['tab:cifar50+50']}. Model A and Model B lie in low loss basins for their own tasks, but not for the other task. Thus, any interpolation between Model A and a permuted Model B (e.g., Git Re-basin) lies outside the minima for both tasks and thus performs poorly. In contrast, ZipIt! improves the merge by finding a model that lies in a low loss basin for both.
  • Figure 2: ImageNet-1k (200+200) Results. Merging ResNet-50 models trained from scratch on disjoint 200 category subsets (Task A and B) of ImageNet-1k. Prior work performs poorly, but ZipIt! makes this task feasible. $^\ddag$ainsworth2022git.
  • Figure 3: ZipIt! merges models layer-wise by exploiting redundancy in their features. (a) Output features $f^{A}$ and $f^{B}$ from two disjoint layers are (b) paired with other features based on the similarity of their activations. (c) We produce a merge matrix M to combine the pairs into a single shared output feature space, and a corresponding unmerge matrix U that undoes this operation. (d) We then propagate U up the network to align the next layer's input space, and simultaneously receive the previous layer's U to align our input space. (e) We apply Eq. \ref{['eq:zip']} to "zip" the layers together using the M for the output and U for the input, producing a single layer (f). We then repeat (a) on the next layer.
  • Figure 3: Multi-Dataset Results. Merging ResNet-50 models trained on completely different datasets: Stanford Dogs (SD), Oxford Pets (OP), CUB200 (CUB), and NABirds (NAB). We report average per-task accuracy over merging model pairs, and all four.
  • ...and 10 more figures