Table of Contents
Fetching ...

S'MoRE: Structural Mixture of Residual Experts for Parameter-Efficient LLM Fine-tuning

Hanqing Zeng, Yinglong Xia, Zhuokai Zhao, Chuan Jiang, Qiang Zhang, Jiayi Liu, Qunshu Zhang, Lizhu Zhang, Xiangjun Fan, Benyu Zhang

TL;DR

Comprehensive theoretical analysis and empirical results demonstrate that S'MoRE achieves superior fine-tuning performance, offering a transformative approach for efficient LLM adaptation.

Abstract

Fine-tuning pre-trained large language models (LLMs) presents a dual challenge of balancing parameter efficiency and model capacity. Existing methods like low-rank adaptations (LoRA) are efficient but lack flexibility, while Mixture-of-Experts (MoE) enhance model capacity at the cost of more & under-utilized parameters. To address these limitations, we propose Structural Mixture of Residual Experts (S'MoRE), a novel framework that seamlessly integrates the efficiency of LoRA with the flexibility of MoE. Conceptually, S'MoRE employs hierarchical low-rank decomposition of expert weights, yielding residuals of varying orders interconnected in a multi-layer structure. By routing input tokens through sub-trees of residuals, S'MoRE emulates the capacity of numerous experts by instantiating and assembling just a few low-rank matrices. We craft the inter-layer propagation of S'MoRE's residuals as a special type of Graph Neural Network (GNN), and prove that under similar parameter budget, S'MoRE improves structural flexibility of traditional MoE (or Mixture-of-LoRA) by exponential order. Comprehensive theoretical analysis and empirical results demonstrate that S'MoRE achieves superior fine-tuning performance, offering a transformative approach for efficient LLM adaptation. Our implementation is available at: https://github.com/ZimpleX/SMoRE-LLM.

S'MoRE: Structural Mixture of Residual Experts for Parameter-Efficient LLM Fine-tuning

TL;DR

Comprehensive theoretical analysis and empirical results demonstrate that S'MoRE achieves superior fine-tuning performance, offering a transformative approach for efficient LLM adaptation.

Abstract

Fine-tuning pre-trained large language models (LLMs) presents a dual challenge of balancing parameter efficiency and model capacity. Existing methods like low-rank adaptations (LoRA) are efficient but lack flexibility, while Mixture-of-Experts (MoE) enhance model capacity at the cost of more & under-utilized parameters. To address these limitations, we propose Structural Mixture of Residual Experts (S'MoRE), a novel framework that seamlessly integrates the efficiency of LoRA with the flexibility of MoE. Conceptually, S'MoRE employs hierarchical low-rank decomposition of expert weights, yielding residuals of varying orders interconnected in a multi-layer structure. By routing input tokens through sub-trees of residuals, S'MoRE emulates the capacity of numerous experts by instantiating and assembling just a few low-rank matrices. We craft the inter-layer propagation of S'MoRE's residuals as a special type of Graph Neural Network (GNN), and prove that under similar parameter budget, S'MoRE improves structural flexibility of traditional MoE (or Mixture-of-LoRA) by exponential order. Comprehensive theoretical analysis and empirical results demonstrate that S'MoRE achieves superior fine-tuning performance, offering a transformative approach for efficient LLM adaptation. Our implementation is available at: https://github.com/ZimpleX/SMoRE-LLM.

Paper Structure

This paper contains 79 sections, 13 theorems, 35 equations, 5 figures, 7 tables.

Key Result

Proposition 3.1

S'MoRE can express MoLRE, when $L=1$ and $\sigma\left( \cdot \right)$ is the identity mapping.

Figures (5)

  • Figure 1: Illustration of the layer propagation and routing process of S'MoRE.
  • Figure 2: $\Gamma_\text{{ S'MoRE}\xspace}$ and $\Gamma_\text{MoMOR}$ w.r.t. $L$ (with $s_\ell=4$, $f_\ell=2$).
  • Figure 3: Examples where the same set of activated experts interconnect differently. MoMOR always generates the same output for (a), (b) and (c), while S'MoRE can distinguish all the three cases. A variant of S'MoRE that performs activation $\sigma$ differently (§\ref{['sec: smore variant']}) can differentiate (a) from (b) or (c), but cannot differentiate (b) from (c). Note that (b) and (c) differ by swapped "1,1" and "1,2".
  • Figure 4: Change of accuracy w.r.t. trainable parameters, corresponding to models in Table \ref{['tab: math code']}.
  • Figure 5: Cost of router (Eq. \ref{['eq: router']}) relative to expert propagation (Eq. \ref{['eq: layer aggr l']}), measured by their number of arithmetic operations. Here we consider S'MoRE with noisy top-$k$ gate on LLaMA 3.2-1B

Theorems & Definitions (22)

  • Proposition 3.1
  • Proposition 3.2
  • Theorem 3.3
  • Theorem 3.4
  • Corollary 3.5
  • Corollary 3.6
  • Proposition C.1
  • proof
  • Proposition C.2
  • proof
  • ...and 12 more