Table of Contents
Fetching ...

FuseChat: Knowledge Fusion of Chat Models

Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, Xiaojun Quan

TL;DR

FuseChat presents a two-stage fuse-and-merge framework that enables knowledge fusion across chat LLMs with diverse architectures and scales. The fuse stage conducts pairwise knowledge fusion using token-aligned distribution matrices to produce identical-structure target LLMs, while the merge stage employs the novel SCE scheme to automatically compute per-parameter merging coefficients from weight-update magnitudes. Evaluations on AlpacaEval 2.0 and MT-Bench across six source LLMs show FuseChat-7B achieving strong performance, comparable to larger fused models and approaching GPT-3.5-Turbo on MT-Bench, while maintaining a compact footprint. The approach reduces training and deployment costs by enabling plug-and-play integration of new sources and avoiding multi-model inference during inference, making it practical for scalable, cross-architecture knowledge fusion.

Abstract

While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in competencies. Knowledge fusion aims to integrate existing LLMs of diverse architectures and capabilities into a more potent LLM through lightweight continual training, thereby reducing the need for costly LLM development. In this work, we propose a new framework for the knowledge fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we conduct pairwise knowledge fusion on source chat LLMs of varying structures and scales to create multiple target LLMs with identical structure and size via lightweight fine-tuning. During this process, a statistics-based token alignment approach is introduced as the cornerstone for fusing LLMs with different structures. Secondly, we merge these target LLMs within the parameter space, where we propose a novel method for determining the merging coefficients based on the magnitude of parameter updates before and after fine-tuning. We implement and validate FuseChat using six prominent chat LLMs with diverse architectures and scales, including OpenChat-3.5-7B, Starling-LM-7B-alpha, NH2-SOLAR-10.7B, InternLM2-Chat-20B, Mixtral-8x7B-Instruct, and Qwen-1.5-Chat-72B. Experimental results on two instruction-following benchmarks, AlpacaEval 2.0 and MT-Bench, demonstrate the superiority of FuseChat-7B over baselines of various sizes. Our model is even comparable to the larger Mixtral-8x7B-Instruct and approaches GPT-3.5-Turbo-1106 on MT-Bench. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/FuseAI}.

FuseChat: Knowledge Fusion of Chat Models

TL;DR

FuseChat presents a two-stage fuse-and-merge framework that enables knowledge fusion across chat LLMs with diverse architectures and scales. The fuse stage conducts pairwise knowledge fusion using token-aligned distribution matrices to produce identical-structure target LLMs, while the merge stage employs the novel SCE scheme to automatically compute per-parameter merging coefficients from weight-update magnitudes. Evaluations on AlpacaEval 2.0 and MT-Bench across six source LLMs show FuseChat-7B achieving strong performance, comparable to larger fused models and approaching GPT-3.5-Turbo on MT-Bench, while maintaining a compact footprint. The approach reduces training and deployment costs by enabling plug-and-play integration of new sources and avoiding multi-model inference during inference, making it practical for scalable, cross-architecture knowledge fusion.

Abstract

While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in competencies. Knowledge fusion aims to integrate existing LLMs of diverse architectures and capabilities into a more potent LLM through lightweight continual training, thereby reducing the need for costly LLM development. In this work, we propose a new framework for the knowledge fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we conduct pairwise knowledge fusion on source chat LLMs of varying structures and scales to create multiple target LLMs with identical structure and size via lightweight fine-tuning. During this process, a statistics-based token alignment approach is introduced as the cornerstone for fusing LLMs with different structures. Secondly, we merge these target LLMs within the parameter space, where we propose a novel method for determining the merging coefficients based on the magnitude of parameter updates before and after fine-tuning. We implement and validate FuseChat using six prominent chat LLMs with diverse architectures and scales, including OpenChat-3.5-7B, Starling-LM-7B-alpha, NH2-SOLAR-10.7B, InternLM2-Chat-20B, Mixtral-8x7B-Instruct, and Qwen-1.5-Chat-72B. Experimental results on two instruction-following benchmarks, AlpacaEval 2.0 and MT-Bench, demonstrate the superiority of FuseChat-7B over baselines of various sizes. Our model is even comparable to the larger Mixtral-8x7B-Instruct and approaches GPT-3.5-Turbo-1106 on MT-Bench. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/FuseAI}.
Paper Structure (32 sections, 9 equations, 12 figures, 9 tables, 1 algorithm)

This paper contains 32 sections, 9 equations, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: Demonstration (left) of distinct strengths of existing chat LLMs and comparison (right) between FuseChat-7B and baseline LLMs. While the left figure plots the percentage of first-ranked responses of each LLM as measured by PairRM jiang2023llm on AlpacaEval 2.0 and MT-Bench, the right shows that FuseChat-7B achieves comparable performance to Mixtral-8x7B and approaches GPT-3.5 on MT-Bench. The red dashed line is linearly fitted from data points of all chat LLMs except FuseChat-7B.
  • Figure 2: Overview of FuseChat in comparison with FuseLLMwan2024knowledge. Distinct animal icons symbolize different LLMs, where each species and size indicate a unique architecture and scale, respectively.
  • Figure 3: The effect of pairwise knowledge fusion for source LLMs across various domains on MT-Bench. It combines the strengths of each source LLM and the pivot (OpenChat-3.5-7B) into a more potent target LLM.
  • Figure 4: The effect of merging target LLMs into FuseChat-7B to combine their strengths across domains on MT-Bench.
  • Figure 5: Starling-LM-7B-alpha as pivot LLM results on AlpacaEval 2.0 and MT-Bench.
  • ...and 7 more figures