Table of Contents
Fetching ...

FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models

Zhongyu Zhao, Menghang Dong, Rongyu Zhang, Wenzhao Zheng, Yunpeng Zhang, Huanrui Yang, Dalong Du, Kurt Keutzer, Shanghang Zhang

TL;DR

FactorLLM addresses knowledge storage and inefficiency in large language models by factorizing dense FFNs into uniformly sized subnetworks, cast as a sparse Mixture-of-Experts. A Prior-Approximate Router, learned under a teacher-student framework, guides expert activation with minimal data, enabling fast adaptation to new tasks. The approach preserves much of the original model’s performance while achieving substantial inference speedups and data efficiency, demonstrated across multiple backbones and benchmarks. This work offers a practical path to deploy task-specific knowledge in LLMs with reduced compute and training requirements.

Abstract

Recent research has demonstrated that Feed-Forward Networks (FFNs) in Large Language Models (LLMs) play a pivotal role in storing diverse linguistic and factual knowledge. Conventional methods frequently face challenges due to knowledge confusion stemming from their monolithic and redundant architectures, which calls for more efficient solutions with minimal computational overhead, particularly for LLMs. In this paper, we explore the FFN computation paradigm in LLMs and introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications, while maintaining the same level of performance. Furthermore, we embed a router from the Mixture-of-Experts (MoE), combined with our devised Prior-Approximate (PA) loss term that facilitates the dynamic activation of experts and knowledge adaptation, thereby accelerating computational processes and enhancing performance using minimal training data and fine-tuning steps. FactorLLM thus enables efficient knowledge factorization and activates select groups of experts specifically tailored to designated tasks, emulating the interactive functional segmentation of the human brain. Extensive experiments across various benchmarks demonstrate the effectiveness of our proposed FactorLLM which achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed. Code: https://github.com/zhenwuweihe/FactorLLM.

FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models

TL;DR

FactorLLM addresses knowledge storage and inefficiency in large language models by factorizing dense FFNs into uniformly sized subnetworks, cast as a sparse Mixture-of-Experts. A Prior-Approximate Router, learned under a teacher-student framework, guides expert activation with minimal data, enabling fast adaptation to new tasks. The approach preserves much of the original model’s performance while achieving substantial inference speedups and data efficiency, demonstrated across multiple backbones and benchmarks. This work offers a practical path to deploy task-specific knowledge in LLMs with reduced compute and training requirements.

Abstract

Recent research has demonstrated that Feed-Forward Networks (FFNs) in Large Language Models (LLMs) play a pivotal role in storing diverse linguistic and factual knowledge. Conventional methods frequently face challenges due to knowledge confusion stemming from their monolithic and redundant architectures, which calls for more efficient solutions with minimal computational overhead, particularly for LLMs. In this paper, we explore the FFN computation paradigm in LLMs and introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications, while maintaining the same level of performance. Furthermore, we embed a router from the Mixture-of-Experts (MoE), combined with our devised Prior-Approximate (PA) loss term that facilitates the dynamic activation of experts and knowledge adaptation, thereby accelerating computational processes and enhancing performance using minimal training data and fine-tuning steps. FactorLLM thus enables efficient knowledge factorization and activates select groups of experts specifically tailored to designated tasks, emulating the interactive functional segmentation of the human brain. Extensive experiments across various benchmarks demonstrate the effectiveness of our proposed FactorLLM which achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed. Code: https://github.com/zhenwuweihe/FactorLLM.
Paper Structure (16 sections, 10 equations, 5 figures, 3 tables)

This paper contains 16 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overall Framework ofFactorLLM. Teacher Model: Original transformer blocks with multi-head attention (MHA) and feed-forward layers. Student Model: Modified blocks composed of the same MHA layers and factorized FFN, with a linear router deciding which expert(s) tokens will pass through. Training Process: Input tokens branch into normal transformer layers and FactorLLM to produce ground-truth (GT) and predictions respectively. Transformers freeze to distill FactorLLM based on compositional loss, including mean square error (MSE) between per-layer representations, cross entropy (CE) loss between per-layer optimal and routing masks, and final CE loss between GT and predictions.
  • Figure 2: We construct $N$ experts, $E^k_{\hat{\theta}}$ (with $\text{dim} = d_h/N$), by applying a permutation $P_\delta$ to the pretrained FFN $F_\theta$ (with $\text{dim} = d_h$) and then dividing it equally. Prior approximate routers (PAR), initially randomly initialized, are placed between the MHA layer and the experts. An MSE loss is computed between the outputs from the FFN block $f_\theta$ and the best $K$ experts $f^k_{\hat{\theta}}$ over a dataset $\mathcal{D}$. Then, the top $K$ selections $\mathcal{A}$ are determined using the TopK algorithm. $\mathcal{A}$ and the output of the router $\mathcal{R}$ are combined to compute the cross-entropy (CE) loss.
  • Figure 3: Comparison of FLOPs and performance across different model configurations. The left y-axis represents GFLOPs for both attention and FFN layers, while the right y-axis shows the relative performance percentage.
  • Figure 4: Performance comparison between models with and without the router mechanism. The radar chart highlights differences in multiple performance, demonstrating the impact of router integration.
  • Figure 5: Routing patterns $v.s.$ training steps.