Table of Contents
Fetching ...

T-REX: Mixture-of-Rank-One-Experts with Semantic-aware Intuition for Multi-task Large Language Model Finetuning

Rongyu Zhang, Yijiang Liu, Huanrui Yang, Shenli Zheng, Dan Wang, Yuan Du, Li Du, Shanghang Zhang

TL;DR

A novel framework, T-REX, which leverages the combination of ultra-low rank experts to construct LoRA weights on pretrained LLMs, achieving superior efficiency and generalizability across diverse tasks.

Abstract

Large language models (LLMs) encounter significant adaptation challenges in diverse multitask finetuning. Mixture-of-experts (MoE) provides a promising solution with a dynamic architecture, enabling effective task decoupling. However, scaling up the number of MoE experts incurs substantial parameter and computational overheads and suffers from limited performance gain due to naive routing mechanisms. In this paper, we design a novel framework, mix\underline{\textbf{T}}ure\underline{\textbf{-}}of-\underline{\textbf{R}}ank-on\underline{\textbf{E}}-e\underline{\textbf{X}}perts (\texttt{T-REX}), which leverages the combination of ultra-low rank experts to construct LoRA weights on pretrained LLMs. The rank-1 experts enable a mix-and-match mechanism to quadratically expand the vector subspace of experts with linear parameter overheads, achieving approximate error reduction with optimal efficiency. In addition, T-REX offers implicit guidance to the router, leveraging the inherent semantic clustering of training embeddings as prior knowledge, enabling optimized feature allocation across experts for a smoother convergence. Extensive theoretical and empirical results demonstrate that T-REX achieves superior efficiency and generalizability across diverse tasks. Compared with other LoRA-based methods, T-REX achieves up to 1.78\% mean accuracy improvement with around 30\%-40\% less trainable parameters across 14 public datasets. \href{https://github.com/RoyZry98/T-REX-Pytorch}{Code} is available.

T-REX: Mixture-of-Rank-One-Experts with Semantic-aware Intuition for Multi-task Large Language Model Finetuning

TL;DR

A novel framework, T-REX, which leverages the combination of ultra-low rank experts to construct LoRA weights on pretrained LLMs, achieving superior efficiency and generalizability across diverse tasks.

Abstract

Large language models (LLMs) encounter significant adaptation challenges in diverse multitask finetuning. Mixture-of-experts (MoE) provides a promising solution with a dynamic architecture, enabling effective task decoupling. However, scaling up the number of MoE experts incurs substantial parameter and computational overheads and suffers from limited performance gain due to naive routing mechanisms. In this paper, we design a novel framework, mix\underline{\textbf{T}}ure\underline{\textbf{-}}of-\underline{\textbf{R}}ank-on\underline{\textbf{E}}-e\underline{\textbf{X}}perts (\texttt{T-REX}), which leverages the combination of ultra-low rank experts to construct LoRA weights on pretrained LLMs. The rank-1 experts enable a mix-and-match mechanism to quadratically expand the vector subspace of experts with linear parameter overheads, achieving approximate error reduction with optimal efficiency. In addition, T-REX offers implicit guidance to the router, leveraging the inherent semantic clustering of training embeddings as prior knowledge, enabling optimized feature allocation across experts for a smoother convergence. Extensive theoretical and empirical results demonstrate that T-REX achieves superior efficiency and generalizability across diverse tasks. Compared with other LoRA-based methods, T-REX achieves up to 1.78\% mean accuracy improvement with around 30\%-40\% less trainable parameters across 14 public datasets. \href{https://github.com/RoyZry98/T-REX-Pytorch}{Code} is available.
Paper Structure (33 sections, 3 theorems, 29 equations, 4 figures, 11 tables, 1 algorithm)

This paper contains 33 sections, 3 theorems, 29 equations, 4 figures, 11 tables, 1 algorithm.

Key Result

Lemma 4.1

The adaptation matrix $\Delta W = \mathbf{A}\mathbf{G}\mathbf{B}^\top$ generated by the Mix-and-Match mechanism spans a subspace whose dimensionality grows with a speed of $\mathcal{O}(IJ)$ as LoRA weight ranks $I$ and $J$ increase. Specifically, the vectorized form of $\Delta W$, $\mathrm{vec}(\Del while the rank of the matrix $\Delta W$ itself is bounded by $\mathrm{rank}(\Delta W) \leq \min{(I,

Figures (4)

  • Figure 1: Illustration of the trainable parameters in (a) Vanilla LoRA, (b) LoRA-MoE, and (c) our proposed T-REX. $r_{1},r_{2}\ll min\{m,n\}$. 6 experts are demonstrated for both LoRA-MoE and T-REX.
  • Figure 2: 3D trajectory of router weights for (a) Traditional LoRA, (b) Rank-1 experts with Mix-and-Match, and (c) T-REX with intuition in the MoE training process. Intuition helps to enable a smoother convergence.
  • Figure 3: Out-of-distribution generalization capabilities of T-REX compared with baselines, including LoRA, MoLoRA, and SiRA, on the BBH dataset based on model backbone (a) Gemma 2B and (b) Yi 6B.
  • Figure 4: Embedding visualization of 20 datasets. Semantic clusters do not align with predefined task groupings.

Theorems & Definitions (6)

  • Lemma 4.1: Subspace Expansion with Mix-and-Match
  • proof
  • Theorem 4.2: Approximation Error Bound with Rank-1 Experts
  • proof
  • Theorem 4.3: Expert Combination with Intuition Guidance
  • proof