Table of Contents
Fetching ...

Learning Scalable Model Soup on a Single GPU: An Efficient Subspace Training Strategy

Tao Li, Weisen Jiang, Fanghui Liu, Xiaolin Huang, James T. Kwok

TL;DR

The paper tackles the high memory and computational costs of Learned-Soup when forming a model soup from many fine-tuned models. It introduces MEHL-Soup, which recasts mixing as hyperplane subspace learning and uses block coordinate gradient descent to update coefficients by loading only a mini-batch of models, enabling single-GPU scaling. The authors prove convergence and extend to HL-Soup+ with layer-wise coefficients, achieving substantial memory (over 13×) and soup-construction time (over 9×) reductions while improving accuracy over Greedy-Soup and Learned-Soup across ViT-B/32 and ViT-L/14 tasks. The approach delivers a practical, robust, and scalable solution for combining multiple fine-tuned models, with extrapolated coefficients enabling better generalization and wider applicability to large architectures.

Abstract

Pre-training followed by fine-tuning is widely adopted among practitioners. The performance can be improved by "model soups"~\cite{wortsman2022model} via exploring various hyperparameter configurations.The Learned-Soup, a variant of model soups, significantly improves the performance but suffers from substantial memory and time costs due to the requirements of (i) having to load all fine-tuned models simultaneously, and (ii) a large computational graph encompassing all fine-tuned models. In this paper, we propose Memory Efficient Hyperplane Learned Soup (MEHL-Soup) to tackle this issue by formulating the learned soup as a hyperplane optimization problem and introducing block coordinate gradient descent to learn the mixing coefficients. At each iteration, MEHL-Soup only needs to load a few fine-tuned models and build a computational graph with one combined model. We further extend MEHL-Soup to MEHL-Soup+ in a layer-wise manner. Experimental results on various ViT models and data sets show that MEHL-Soup(+) outperforms Learned-Soup(+) in terms of test accuracy, and also reduces memory usage by more than $13\times$. Moreover, MEHL-Soup(+) can be run on a single GPU and achieves $9\times$ speed up in soup construction compared with the Learned-Soup. The code is released at https://github.com/nblt/MEHL-Soup.

Learning Scalable Model Soup on a Single GPU: An Efficient Subspace Training Strategy

TL;DR

The paper tackles the high memory and computational costs of Learned-Soup when forming a model soup from many fine-tuned models. It introduces MEHL-Soup, which recasts mixing as hyperplane subspace learning and uses block coordinate gradient descent to update coefficients by loading only a mini-batch of models, enabling single-GPU scaling. The authors prove convergence and extend to HL-Soup+ with layer-wise coefficients, achieving substantial memory (over 13×) and soup-construction time (over 9×) reductions while improving accuracy over Greedy-Soup and Learned-Soup across ViT-B/32 and ViT-L/14 tasks. The approach delivers a practical, robust, and scalable solution for combining multiple fine-tuned models, with extrapolated coefficients enabling better generalization and wider applicability to large architectures.

Abstract

Pre-training followed by fine-tuning is widely adopted among practitioners. The performance can be improved by "model soups"~\cite{wortsman2022model} via exploring various hyperparameter configurations.The Learned-Soup, a variant of model soups, significantly improves the performance but suffers from substantial memory and time costs due to the requirements of (i) having to load all fine-tuned models simultaneously, and (ii) a large computational graph encompassing all fine-tuned models. In this paper, we propose Memory Efficient Hyperplane Learned Soup (MEHL-Soup) to tackle this issue by formulating the learned soup as a hyperplane optimization problem and introducing block coordinate gradient descent to learn the mixing coefficients. At each iteration, MEHL-Soup only needs to load a few fine-tuned models and build a computational graph with one combined model. We further extend MEHL-Soup to MEHL-Soup+ in a layer-wise manner. Experimental results on various ViT models and data sets show that MEHL-Soup(+) outperforms Learned-Soup(+) in terms of test accuracy, and also reduces memory usage by more than . Moreover, MEHL-Soup(+) can be run on a single GPU and achieves speed up in soup construction compared with the Learned-Soup. The code is released at https://github.com/nblt/MEHL-Soup.
Paper Structure (17 sections, 1 theorem, 21 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 1 theorem, 21 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

theorem 1.3

If the learning rate $\eta \leq \min \{\frac{1}{\beta}, \frac{1}{\sqrt{T}}\}$, Algorithm alg:mhls satisfies where the expectation is taken over the random mini-batch of samples and models.

Figures (3)

  • Figure 1: Distributions of mixing coefficient over all layers and models learned by MEHL-Soup+ and Learned-Soup+ on ImageNet with CLIP ViT-B/32.
  • Figure 2: Test accuracy comparison of Greedy-Soup and MEHL-Soup+ w.r.t. fine-tuning time cost. The experiment is performed on ImageNet with CLIP ViT-B/32. We use different numbers of fine-tuned models (displayed near the points) and measure their corresponding fine-tuning time costs. The model sequence follows the original random search order provided in wortsman2022model.
  • Figure 3: Sensitivity analysis of the hyperparameters in MEHL-Soup+. The experiments are conducted on ImageNet with CLIP ViT-B/32.

Theorems & Definitions (2)

  • theorem 1.3
  • proof : Proof of Theorem 3