Learning Scalable Model Soup on a Single GPU: An Efficient Subspace Training Strategy
Tao Li, Weisen Jiang, Fanghui Liu, Xiaolin Huang, James T. Kwok
TL;DR
The paper tackles the high memory and computational costs of Learned-Soup when forming a model soup from many fine-tuned models. It introduces MEHL-Soup, which recasts mixing as hyperplane subspace learning and uses block coordinate gradient descent to update coefficients by loading only a mini-batch of models, enabling single-GPU scaling. The authors prove convergence and extend to HL-Soup+ with layer-wise coefficients, achieving substantial memory (over 13×) and soup-construction time (over 9×) reductions while improving accuracy over Greedy-Soup and Learned-Soup across ViT-B/32 and ViT-L/14 tasks. The approach delivers a practical, robust, and scalable solution for combining multiple fine-tuned models, with extrapolated coefficients enabling better generalization and wider applicability to large architectures.
Abstract
Pre-training followed by fine-tuning is widely adopted among practitioners. The performance can be improved by "model soups"~\cite{wortsman2022model} via exploring various hyperparameter configurations.The Learned-Soup, a variant of model soups, significantly improves the performance but suffers from substantial memory and time costs due to the requirements of (i) having to load all fine-tuned models simultaneously, and (ii) a large computational graph encompassing all fine-tuned models. In this paper, we propose Memory Efficient Hyperplane Learned Soup (MEHL-Soup) to tackle this issue by formulating the learned soup as a hyperplane optimization problem and introducing block coordinate gradient descent to learn the mixing coefficients. At each iteration, MEHL-Soup only needs to load a few fine-tuned models and build a computational graph with one combined model. We further extend MEHL-Soup to MEHL-Soup+ in a layer-wise manner. Experimental results on various ViT models and data sets show that MEHL-Soup(+) outperforms Learned-Soup(+) in terms of test accuracy, and also reduces memory usage by more than $13\times$. Moreover, MEHL-Soup(+) can be run on a single GPU and achieves $9\times$ speed up in soup construction compared with the Learned-Soup. The code is released at https://github.com/nblt/MEHL-Soup.
