ZipIt! Merging Models from Different Tasks without Training
George Stoica, Daniel Bolya, Jakob Bjorner, Pratik Ramesh, Taylor Hearn, Judy Hoffman
TL;DR
ZipIt! addresses the challenge of merging neural networks trained on completely different tasks without additional training. It introduces a general, graph-based merging framework that can fuse features within each model (not just across models) and supports partial zipping to create multi-head architectures, enabling efficient cross-task integration. The approach employs a merge matrix $M_i$ and an unmerge matrix $U_i$ to pair and average correlated features, then propagates these operations through the network to align subsequent layers. Empirical results on CIFAR-10/100 and ImageNet-1k, including multi-dataset and multimodal setups, show significant improvements over permutation-based baselines and often approach ensemble performance, especially with partial zipping and increased model width. Theoretical analysis provides tighter bounds on the merging barrier when within-model merges are allowed, and ZipIt! demonstrates practical feasibility for building multi-task systems without retraining.
Abstract
Typical deep visual recognition models are capable of performing the one task they were trained on. In this paper, we tackle the extremely difficult problem of combining distinct models with different initializations, each solving a separate task, into one multi-task model without any additional training. Prior work in model merging permutes one model to the space of the other then averages them together. While this works for models trained on the same task, we find that this fails to account for the differences in models trained on disjoint tasks. Thus, we introduce "ZipIt!", a general method for merging two arbitrary models of the same architecture that incorporates two simple strategies. First, in order to account for features that aren't shared between models, we expand the model merging problem to allow for merging features within each model by defining a general "zip" operation. Second, we add support for partially zipping the models up until a specified layer, naturally creating a multi-head model. We find that these two changes combined account for 20-60% improvement over prior work, making it more feasible to merge models trained on disjoint tasks without retraining.
