Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks
Haoyuan Wu, Haisheng Zheng, Zhuolun He, Bei Yu
TL;DR
The paper tackles capacity limits in instruction tuning of large language models by introducing Parameter-Efficient Sparsity Crafting (PESC), a method that upcycles dense models into sparse Mixture-of-Experts (MoE) blocks through the insertion of adapters. By initializing MoE experts from the original dense FFN and constraining training to lightweight adapters, while applying PEFT techniques like QLoRA to update remaining weights, PESC achieves scalable capacity expansion with minimal parameter overhead. The Camelidae family, trained with PESC, demonstrates strong general-task performance across coding, math, and reasoning benchmarks, outperforming several open-source sparse models and dense baselines and approaching GPT-3.5 in some settings. The work highlights the practical potential of adapter-based MoE design for efficient, high-capacity instruction tuning, while acknowledging trade-offs in parameter count and GPU memory versus full sparse upcycling.
Abstract
Large language models (LLMs) have demonstrated considerable proficiency in general natural language processing (NLP) tasks. Instruction tuning, a successful paradigm, enhances the ability of LLMs to follow natural language instructions and exhibit robust generalization across general tasks. However, these models often encounter performance limitations across multiple tasks due to constrained model capacity. Expanding this capacity during the instruction tuning phase poses significant challenges. To address this issue, we introduce parameter-efficient sparsity crafting (PESC), which crafts dense models into sparse models using the mixture-of-experts (MoE) architecture. PESC integrates adapters into the MoE layers of sparse models, differentiating experts without altering the individual weights within these layers. This method significantly reduces computational costs and GPU memory requirements, facilitating model capacity expansion through a minimal parameter increase when guaranteeing the quality of approximation in function space compared to original sparse upcycling. Our empirical evaluation demonstrates the effectiveness of the PESC method. Using PESC during instruction tuning, our best sparse model outperforms other sparse and dense models and exhibits superior general capabilities compared to GPT-3.5. Our code is available at https://github.com/wuhy68/Parameter-Efficient-MoE.
