ProFuser: Progressive Fusion of Large Language Models
Tianyuan Shi, Fanqi Wan, Canbin Huang, Xiaojun Quan, Chenliang Li, Ming Yan, Ji Zhang, Minhua Huang, Wu Kai
TL;DR
The paper addresses the challenge of efficiently fusing capabilities from multiple heterogeneous LLMs without excessive inference-time resource use. It introduces ProFuser, a dual-mode advantage framework that evaluates source-model strength in both training (Min-CE) and inference (reward-based) modes, and a progressive fusion strategy that first leverages inference-mode signals and then incorporates training-mode GT data via a two-stage objective. Empirical results show ProFuser improves knowledge, reasoning, and safety across six benchmarks when fusing Vicuna-7B-v1.5 with Llama-2-7B-Chat and MPT-7B-8K-Chat, outperforming FuseLLM and other baselines, while providing greater training stability. The method’s mix of easy-to-hard curriculum, mode-aware evaluation, and robustness across homogeneous and heterogeneous model mixes offers a practical path to more capable, resource-efficient fused LLMs with broad applicability in real-world deployments.
Abstract
While fusing the capacities and advantages of various large language models offers a pathway to construct more powerful and versatile models, a fundamental challenge is to properly select advantageous model during training. Existing fusion methods primarily focus on the training mode that uses cross entropy on ground truth in a teacher-forcing setup to measure a model's advantage, which may provide limited insight towards model advantage. In this paper, we introduce a novel approach that enhances the fusion process by incorporating both the training and inference modes. Our method evaluates model advantage not only through cross entropy during training but also by considering inference outputs, providing a more comprehensive assessment. To combine the two modes effectively, we introduce ProFuser to progressively transition from inference mode to training mode. To validate ProFuser's effectiveness, we fused three models, including Vicuna-7B-v1.5, Llama-2-7B-Chat, and MPT-7B-8K-Chat, and demonstrated the improved performance in knowledge, reasoning, and safety compared to baseline methods.
