Table of Contents
Fetching ...

ProFuser: Progressive Fusion of Large Language Models

Tianyuan Shi, Fanqi Wan, Canbin Huang, Xiaojun Quan, Chenliang Li, Ming Yan, Ji Zhang, Minhua Huang, Wu Kai

TL;DR

The paper addresses the challenge of efficiently fusing capabilities from multiple heterogeneous LLMs without excessive inference-time resource use. It introduces ProFuser, a dual-mode advantage framework that evaluates source-model strength in both training (Min-CE) and inference (reward-based) modes, and a progressive fusion strategy that first leverages inference-mode signals and then incorporates training-mode GT data via a two-stage objective. Empirical results show ProFuser improves knowledge, reasoning, and safety across six benchmarks when fusing Vicuna-7B-v1.5 with Llama-2-7B-Chat and MPT-7B-8K-Chat, outperforming FuseLLM and other baselines, while providing greater training stability. The method’s mix of easy-to-hard curriculum, mode-aware evaluation, and robustness across homogeneous and heterogeneous model mixes offers a practical path to more capable, resource-efficient fused LLMs with broad applicability in real-world deployments.

Abstract

While fusing the capacities and advantages of various large language models offers a pathway to construct more powerful and versatile models, a fundamental challenge is to properly select advantageous model during training. Existing fusion methods primarily focus on the training mode that uses cross entropy on ground truth in a teacher-forcing setup to measure a model's advantage, which may provide limited insight towards model advantage. In this paper, we introduce a novel approach that enhances the fusion process by incorporating both the training and inference modes. Our method evaluates model advantage not only through cross entropy during training but also by considering inference outputs, providing a more comprehensive assessment. To combine the two modes effectively, we introduce ProFuser to progressively transition from inference mode to training mode. To validate ProFuser's effectiveness, we fused three models, including Vicuna-7B-v1.5, Llama-2-7B-Chat, and MPT-7B-8K-Chat, and demonstrated the improved performance in knowledge, reasoning, and safety compared to baseline methods.

ProFuser: Progressive Fusion of Large Language Models

TL;DR

The paper addresses the challenge of efficiently fusing capabilities from multiple heterogeneous LLMs without excessive inference-time resource use. It introduces ProFuser, a dual-mode advantage framework that evaluates source-model strength in both training (Min-CE) and inference (reward-based) modes, and a progressive fusion strategy that first leverages inference-mode signals and then incorporates training-mode GT data via a two-stage objective. Empirical results show ProFuser improves knowledge, reasoning, and safety across six benchmarks when fusing Vicuna-7B-v1.5 with Llama-2-7B-Chat and MPT-7B-8K-Chat, outperforming FuseLLM and other baselines, while providing greater training stability. The method’s mix of easy-to-hard curriculum, mode-aware evaluation, and robustness across homogeneous and heterogeneous model mixes offers a practical path to more capable, resource-efficient fused LLMs with broad applicability in real-world deployments.

Abstract

While fusing the capacities and advantages of various large language models offers a pathway to construct more powerful and versatile models, a fundamental challenge is to properly select advantageous model during training. Existing fusion methods primarily focus on the training mode that uses cross entropy on ground truth in a teacher-forcing setup to measure a model's advantage, which may provide limited insight towards model advantage. In this paper, we introduce a novel approach that enhances the fusion process by incorporating both the training and inference modes. Our method evaluates model advantage not only through cross entropy during training but also by considering inference outputs, providing a more comprehensive assessment. To combine the two modes effectively, we introduce ProFuser to progressively transition from inference mode to training mode. To validate ProFuser's effectiveness, we fused three models, including Vicuna-7B-v1.5, Llama-2-7B-Chat, and MPT-7B-8K-Chat, and demonstrated the improved performance in knowledge, reasoning, and safety compared to baseline methods.
Paper Structure (26 sections, 6 equations, 5 figures, 3 tables)

This paper contains 26 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Left: Performance comparison between Vicuna-7B-v1.5 (Vicuna) and other source models across training and inference modes. In training mode, Vicuna outperforms Llama-2-7B-Chat on 68% of evaluation samples (measured by Min-CE), highlighting its superior token prediction capability. However, this advantage diminishes in inference mode, where Vicuna's success rate drops to 45% (evaluated using reward models). This highlights a gap between token-level prediction and response generation quality. Right: Response length comparison between GPT-4 and Vicuna-7B-v1.5 for the five most frequently occurring system messages in the training set (x-axis IDs correspond to specific system prompts). GPT-4 consistently produces longer and more detailed responses across all prompt types.
  • Figure 2: Overview of the Progressive Model Fusion Method (ProFuser). The framework operates in two sequential stages: inference mode and training mode. In inference mode, reward models (RM) evaluate response quality to identify advantageous outputs, while in training mode, minimum cross-entropy (Min-CE) determines optimal token distributions. Heterogeneous source LLMs (represented by distinct colors) contribute their respective advantages, which are progressively integrated into the target model through an easy-to-hard learning paradigm. This dual-mode approach ensures comprehensive capability transfer from source models to the target model.
  • Figure 3: Results of different model advantage evaluation methods for the inference mode.
  • Figure 4: Training progress comparison between ProFuser and Simultaneous fusion approaches over three epochs. ProFuser (blue) achieves higher final reward model scores despite slower initial progress, converging at epoch 2. Simultaneous fusion (red) shows faster early improvement but converges earlier at epoch 1.25 with lower final performance. Mean and standard deviation of reward model scores are shown for each epoch.
  • Figure 5: Comparison of ProFuser with three popular model merging methods (SLERP, TIES, and DARE) across six benchmarks.