SoupLM: Model Integration in Large Language and Multi-Modal Models
Yue Bai, Zichen Zhang, Jiasen Lu, Yun Fu
TL;DR
SoupLM tackles the high cost of training large language and multimodal models by proposing a model soup framework that merges isomorphic base models into a single generalist multimodal model with minimal extra cost. It introduces Vanilla Soup, Learnable Soup, and Regularized Soup to interpolate between Vicuna and LLaVA, formalized as $f(\theta^s) = \sum_{i=1}^n \alpha^i \theta^i$ with $\sum \alpha^i = 1$, and explores fine-grained per-mapping interpolation. Through extensive ablations across five meta-sets and multiple rounds, Learnable Soup shows performance gains and reveals interpretable, dataset-dependent interpolation patterns, while Regularized Soup provides stability insights. The work demonstrates a practical path to rapidly integrate domain-specialized variants for LLMs and LMMs with negligible inference cost, potentially mitigating data drift and enabling scalable, cost-efficient multimodal AI deployment.
Abstract
Training large language models (LLMs) and multimodal LLMs necessitates significant computing resources, and existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks. For instance, LLaMA, Vicuna, and LLaVA are three LLM variants trained with LLaMA base models using very different training recipes, tasks, and data modalities. The training cost and complexity for such LLM variants grow rapidly. In this study, we propose to use a soup strategy to assemble these LLM variants into a single well-generalized multimodal LLM (SoupLM) in a cost-efficient manner. Assembling these LLM variants efficiently brings knowledge and specialities trained from different domains and data modalities into an integrated one (e.g., chatbot speciality from user-shared conversations for Vicuna, and visual capacity from vision-language data for LLaVA), therefore, to avoid computing costs of repetitive training on several different domains. We propose series of soup strategies to systematically benchmark performance gains across various configurations, and probe the soup behavior across base models in the interpolation space.
