Table of Contents
Fetching ...

SoupLM: Model Integration in Large Language and Multi-Modal Models

Yue Bai, Zichen Zhang, Jiasen Lu, Yun Fu

TL;DR

SoupLM tackles the high cost of training large language and multimodal models by proposing a model soup framework that merges isomorphic base models into a single generalist multimodal model with minimal extra cost. It introduces Vanilla Soup, Learnable Soup, and Regularized Soup to interpolate between Vicuna and LLaVA, formalized as $f(\theta^s) = \sum_{i=1}^n \alpha^i \theta^i$ with $\sum \alpha^i = 1$, and explores fine-grained per-mapping interpolation. Through extensive ablations across five meta-sets and multiple rounds, Learnable Soup shows performance gains and reveals interpretable, dataset-dependent interpolation patterns, while Regularized Soup provides stability insights. The work demonstrates a practical path to rapidly integrate domain-specialized variants for LLMs and LMMs with negligible inference cost, potentially mitigating data drift and enabling scalable, cost-efficient multimodal AI deployment.

Abstract

Training large language models (LLMs) and multimodal LLMs necessitates significant computing resources, and existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks. For instance, LLaMA, Vicuna, and LLaVA are three LLM variants trained with LLaMA base models using very different training recipes, tasks, and data modalities. The training cost and complexity for such LLM variants grow rapidly. In this study, we propose to use a soup strategy to assemble these LLM variants into a single well-generalized multimodal LLM (SoupLM) in a cost-efficient manner. Assembling these LLM variants efficiently brings knowledge and specialities trained from different domains and data modalities into an integrated one (e.g., chatbot speciality from user-shared conversations for Vicuna, and visual capacity from vision-language data for LLaVA), therefore, to avoid computing costs of repetitive training on several different domains. We propose series of soup strategies to systematically benchmark performance gains across various configurations, and probe the soup behavior across base models in the interpolation space.

SoupLM: Model Integration in Large Language and Multi-Modal Models

TL;DR

SoupLM tackles the high cost of training large language and multimodal models by proposing a model soup framework that merges isomorphic base models into a single generalist multimodal model with minimal extra cost. It introduces Vanilla Soup, Learnable Soup, and Regularized Soup to interpolate between Vicuna and LLaVA, formalized as with , and explores fine-grained per-mapping interpolation. Through extensive ablations across five meta-sets and multiple rounds, Learnable Soup shows performance gains and reveals interpretable, dataset-dependent interpolation patterns, while Regularized Soup provides stability insights. The work demonstrates a practical path to rapidly integrate domain-specialized variants for LLMs and LMMs with negligible inference cost, potentially mitigating data drift and enabling scalable, cost-efficient multimodal AI deployment.

Abstract

Training large language models (LLMs) and multimodal LLMs necessitates significant computing resources, and existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks. For instance, LLaMA, Vicuna, and LLaVA are three LLM variants trained with LLaMA base models using very different training recipes, tasks, and data modalities. The training cost and complexity for such LLM variants grow rapidly. In this study, we propose to use a soup strategy to assemble these LLM variants into a single well-generalized multimodal LLM (SoupLM) in a cost-efficient manner. Assembling these LLM variants efficiently brings knowledge and specialities trained from different domains and data modalities into an integrated one (e.g., chatbot speciality from user-shared conversations for Vicuna, and visual capacity from vision-language data for LLaVA), therefore, to avoid computing costs of repetitive training on several different domains. We propose series of soup strategies to systematically benchmark performance gains across various configurations, and probe the soup behavior across base models in the interpolation space.
Paper Structure (22 sections, 4 equations, 17 figures, 4 tables)

This paper contains 22 sections, 4 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Vanilla soup evaluations on five meta sets, including MMMU, LLaVA-Bench for multi-modality, and MMLU, GSM8K, Hellaswag for language. The x-axis shows increasing soup ratio from 0.1 to 0.9 of ($\alpha^1$) of LLaVA. The y-axis means the evaluation performance. Green dots serve as soup performances. Two base models are shown in blue and red lines. We find vanilla soup generally outperforms baselines, and direct average with $\alpha^1 = 0.5$ often obtains better results except for the MMMU dataset.
  • Figure 2: Representative MMMU single set evaluation. MM, L, ML, G, and H represent MMMU, LLaVA-Bench, MMLU, GSK8K, and Hellaswag, respectively. For each heatmap, x/y axis means ablated learning rates and epochs. Different colors show the performance variances on evaluation sets.
  • Figure 3: Second round ablation for epoch, sample number, learning rate, and activation. MM, ML, G, H are for MMMU, MMLU, GSM8K, Hellaswag. Colors show performance changes. X-axis is learning rate. Y-axis is number of epoch and activation function. Here, we use 50 samples for LLaVA665K and 50 samples for MMLU.
  • Figure 4: Ratio ablation on MMMU, MMLU, GSM8K, and Hellaswag on LLaVA665K and MMLU meta sets.
  • Figure 5: Learned alpha distribution on LLaVA-Vicuna model space of key mapping across different meta sets and Transformer layers. This set of $\alpha$ is tuned on 9 epochs, 0.3 learning rate, and 1000 samples. Certain layers show stable consistency across different meta sets.
  • ...and 12 more figures