Table of Contents
Fetching ...

Channel Merging: Preserving Specialization for Merged Experts

Mingyang Zhang, Jing Liu, Ganggui Ding, Xinyi Yu, Linlin Ou, Bohan Zhuang

TL;DR

Channel Merging introduces channel-level clustering to merge multiple task-tuned LLMs while preserving specialization. It clusters delta parameters at the channel level into K groups and uses an offline lookup to reconstruct active experts, enabling storage-efficient inference with $theta^* = P + lambda sum delta^{t_n}$. Coupled with a task-specific router, it achieves near-unmerged performance on reasoning and coding tasks and competitive results against traditional ensembles at about 53% of the parameter count. The approach demonstrates robust gains across English/Chinese reasoning, mathematics, and code, though it relies on a shared pretrained backbone and may incur parameter increases relative to a single model; future work could push further compression via per-layer grouping.

Abstract

Lately, the practice of utilizing task-specific fine-tuning has been implemented to improve the performance of large language models (LLM) in subsequent tasks. Through the integration of diverse LLMs, the overall competency of LLMs is significantly boosted. Nevertheless, traditional ensemble methods are notably memory-intensive, necessitating the simultaneous loading of all specialized models into GPU memory. To address the inefficiency, model merging strategies have emerged, merging all LLMs into one model to reduce the memory footprint during inference. Despite these advances, model merging often leads to parameter conflicts and performance decline as the number of experts increases. Previous methods to mitigate these conflicts include post-pruning and partial merging. However, both approaches have limitations, particularly in terms of performance and storage efficiency when merged experts increase. To address these challenges, we introduce Channel Merging, a novel strategy designed to minimize parameter conflicts while enhancing storage efficiency. This method clusters and merges channel parameters based on their similarity to form several groups offline. By ensuring that only highly similar parameters are merged within each group, it significantly reduces parameter conflicts. During inference, we can instantly look up the expert parameters from the merged groups, preserving specialized knowledge. Our experiments demonstrate that Channel Merging consistently delivers high performance, matching unmerged models in tasks like English and Chinese reasoning, mathematical reasoning, and code generation. Moreover, it obtains results comparable to model ensemble with just 53% parameters when used with a task-specific router.

Channel Merging: Preserving Specialization for Merged Experts

TL;DR

Channel Merging introduces channel-level clustering to merge multiple task-tuned LLMs while preserving specialization. It clusters delta parameters at the channel level into K groups and uses an offline lookup to reconstruct active experts, enabling storage-efficient inference with . Coupled with a task-specific router, it achieves near-unmerged performance on reasoning and coding tasks and competitive results against traditional ensembles at about 53% of the parameter count. The approach demonstrates robust gains across English/Chinese reasoning, mathematics, and code, though it relies on a shared pretrained backbone and may incur parameter increases relative to a single model; future work could push further compression via per-layer grouping.

Abstract

Lately, the practice of utilizing task-specific fine-tuning has been implemented to improve the performance of large language models (LLM) in subsequent tasks. Through the integration of diverse LLMs, the overall competency of LLMs is significantly boosted. Nevertheless, traditional ensemble methods are notably memory-intensive, necessitating the simultaneous loading of all specialized models into GPU memory. To address the inefficiency, model merging strategies have emerged, merging all LLMs into one model to reduce the memory footprint during inference. Despite these advances, model merging often leads to parameter conflicts and performance decline as the number of experts increases. Previous methods to mitigate these conflicts include post-pruning and partial merging. However, both approaches have limitations, particularly in terms of performance and storage efficiency when merged experts increase. To address these challenges, we introduce Channel Merging, a novel strategy designed to minimize parameter conflicts while enhancing storage efficiency. This method clusters and merges channel parameters based on their similarity to form several groups offline. By ensuring that only highly similar parameters are merged within each group, it significantly reduces parameter conflicts. During inference, we can instantly look up the expert parameters from the merged groups, preserving specialized knowledge. Our experiments demonstrate that Channel Merging consistently delivers high performance, matching unmerged models in tasks like English and Chinese reasoning, mathematical reasoning, and code generation. Moreover, it obtains results comparable to model ensemble with just 53% parameters when used with a task-specific router.

Paper Structure

This paper contains 16 sections, 6 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: This diagram contrasts various methods of handling multiple experts in LLMs. Panel (a) illustrates the conventional model ensemble approach, which requires loading all expert models into memory, leading to significant storage inefficiency. Panel (b) depicts the model merging strategy that simplifies the memory load but results in performance degradation due to parameter conflicts. Panel (c) presents our proposed Channel Merging method, which clusters and merges channel parameters, retaining each expert's unique features and ensuring efficient and effective performance.
  • Figure 2: Layer-wise proportion of channel similarity between Instruction expert and other experts in (a) Mistral-7B model and (b) CodeLLaMA-7B families. The blue portions represent the proportion of channels in the Instruction expert that are more similar to the Code expert, while the red portions indicate the proportion of channels more similar to the Math expert.
  • Figure 3: An illustration of our Channel Merging method. The process involves two core parts: Channel-wise Merging and Instant Lookup. In the channel-wise Merging stage, for each channel, delta parameters from each expert $\mathbf{\delta}_{i}^{t_n}$ are clustered into different groups, and parameters within the same group are merged into $\mathbf{\Theta}_{i}^{k}$. $S^{t_n}$ records the group index that each expert’s parameters have been merged into (shown in ForestGreen). During inference, the activated expert can instantly look up its parameters from the groups based on $S^{t_n}$.
  • Figure 4: Experimental results on four different task categories as the number of experts varies from one to six. 'Channel' and 'Model' represent the accuracy achieved with channel-level and model-level merging, respectively.
  • Figure 5: Heatmap of channel similarities between different expert models.