Table of Contents
Fetching ...

AlphaLoRA: Assigning LoRA Experts Based on Layer Training Quality

Peijun Qing, Chongyang Gao, Yefan Zhou, Xingjian Diao, Yaoqing Yang, Soroush Vosoughi

TL;DR

The analysis reveals that the number of experts per layer correlates with layer training quality, which exhibits significant variability across layers, and introduces AlphaLoRA, a theoretically principled and training-free method for allocating LoRA experts to reduce redundancy further.

Abstract

Parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), are known to enhance training efficiency in Large Language Models (LLMs). Due to the limited parameters of LoRA, recent studies seek to combine LoRA with Mixture-of-Experts (MoE) to boost performance across various tasks. However, inspired by the observed redundancy in traditional MoE structures, previous studies identify similar redundancy among LoRA experts within the MoE architecture, highlighting the necessity for non-uniform allocation of LoRA experts across different layers. In this paper, we leverage Heavy-Tailed Self-Regularization (HT-SR) Theory to design a fine-grained allocation strategy. Our analysis reveals that the number of experts per layer correlates with layer training quality, which exhibits significant variability across layers. Based on this, we introduce AlphaLoRA, a theoretically principled and training-free method for allocating LoRA experts to further mitigate redundancy. Experiments on three models across ten language processing and reasoning benchmarks demonstrate that AlphaLoRA achieves comparable or superior performance over all baselines. Our code is available at https://github.com/morelife2017/alphalora.

AlphaLoRA: Assigning LoRA Experts Based on Layer Training Quality

TL;DR

The analysis reveals that the number of experts per layer correlates with layer training quality, which exhibits significant variability across layers, and introduces AlphaLoRA, a theoretically principled and training-free method for allocating LoRA experts to reduce redundancy further.

Abstract

Parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), are known to enhance training efficiency in Large Language Models (LLMs). Due to the limited parameters of LoRA, recent studies seek to combine LoRA with Mixture-of-Experts (MoE) to boost performance across various tasks. However, inspired by the observed redundancy in traditional MoE structures, previous studies identify similar redundancy among LoRA experts within the MoE architecture, highlighting the necessity for non-uniform allocation of LoRA experts across different layers. In this paper, we leverage Heavy-Tailed Self-Regularization (HT-SR) Theory to design a fine-grained allocation strategy. Our analysis reveals that the number of experts per layer correlates with layer training quality, which exhibits significant variability across layers. Based on this, we introduce AlphaLoRA, a theoretically principled and training-free method for allocating LoRA experts to further mitigate redundancy. Experiments on three models across ten language processing and reasoning benchmarks demonstrate that AlphaLoRA achieves comparable or superior performance over all baselines. Our code is available at https://github.com/morelife2017/alphalora.

Paper Structure

This paper contains 39 sections, 9 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The overview of AlphaLoRA. For a transformer-based model with $m$ layers, AlphaLoRA involves two steps: (Step 1) Conducting ESD analysis on each layer and applying PL fitting to obtain the layer-wise PL_Alpha_Hill value. (Step 2) Converting the layer-wise PL_Alpha_Hill value into the number of experts using a mapping function, followed by initializing the experts for each layer. For instance, Weights 1 represents all the weight matrices (such as attention weight matrix and projection weight matrix) in layer 1.
  • Figure 2: Illustration of the PL_Alpha_Hill values for each layer across three different models.
  • Figure 3: Comparing layer expert number assigned by AlphaLoRA and MoLA. MoLA(2468) allocates 2 experts to each layer for the first 8 layers, 4 experts to each layer for 9-16 layers, 6 experts to each layer for 17-24 layers, and 8 experts to each layer for the last 8 layers, which is denoted as 2468. MoLA(5555) assigns a uniform 5 experts to each layer. The total number of experts is set at 160.
  • Figure 4: Comparison with different shape metric from HT-SR theory on both direct fine-tuning and zero-shot setting.
  • Figure 5: Comparison between AlphaLoRA and MoLA-$\triangledown$ across three configurations with varying total number of experts $T$, specifically 80, 160, and 224 experts.
  • ...and 1 more figures