Table of Contents
Fetching ...

Rank Also Matters: Hierarchical Configuration for Mixture of Adapter Experts in LLM Fine-Tuning

Peizhuang Cong, Wenpu Liu, Wenhan Yu, Haochen Zhao, Tong Yang

TL;DR

This work addresses the inefficiency of existing mixture-of-adapter-experts approaches that fix adapter rank across layers. It introduces HiLo, a hierarchical configuration that jointly optimizes the number and rank of adapter experts per layer, guided by a simple rank-setting rule and dynamic allocation/activation strategies. Empirical results on Llama 2-7B across diverse tasks show that HiLo achieves higher accuracy while reducing both trainable and active parameters compared with strong baselines like MoLA, AlphaLoRA, and AdaMoE, with the best rank configuration (e.g., 2468 across layers) offering the optimal trade-off. The approach has practical significance for scalable, parameter-efficient fine-tuning of large language models in real-world settings.

Abstract

Large language models (LLMs) have demonstrated remarkable success across various tasks, accompanied by a continuous increase in their parameter size. Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address the challenges of fine-tuning LLMs by significantly reducing the number of trainable parameters. Recent studies have integrated LoRA with Mixture of Experts (MoE) architectures, leveraging multiple adapter experts and gating mechanisms to further improve fine-tuning performance. However, existing approaches primarily focus on adjusting the allocations of adapter experts per layer to optimize the introduced trainable parameter size, while neglecting a critical factor of adapters' rank. To this end, we propose a hierarchical scheme for expert allocation and rank configuration, HILO, which dynamically adjusts the number and rank of adapter experts across layers, matching the varying representational complexity of model layers in adapter-granularity. Extensive experiments on multiple benchmark tasks demonstrate that HILO outperforms existing methods in accuracy while introducing fewer trainable parameters, providing an efficient and practical solution for fine-tuning LLMs.

Rank Also Matters: Hierarchical Configuration for Mixture of Adapter Experts in LLM Fine-Tuning

TL;DR

This work addresses the inefficiency of existing mixture-of-adapter-experts approaches that fix adapter rank across layers. It introduces HiLo, a hierarchical configuration that jointly optimizes the number and rank of adapter experts per layer, guided by a simple rank-setting rule and dynamic allocation/activation strategies. Empirical results on Llama 2-7B across diverse tasks show that HiLo achieves higher accuracy while reducing both trainable and active parameters compared with strong baselines like MoLA, AlphaLoRA, and AdaMoE, with the best rank configuration (e.g., 2468 across layers) offering the optimal trade-off. The approach has practical significance for scalable, parameter-efficient fine-tuning of large language models in real-world settings.

Abstract

Large language models (LLMs) have demonstrated remarkable success across various tasks, accompanied by a continuous increase in their parameter size. Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address the challenges of fine-tuning LLMs by significantly reducing the number of trainable parameters. Recent studies have integrated LoRA with Mixture of Experts (MoE) architectures, leveraging multiple adapter experts and gating mechanisms to further improve fine-tuning performance. However, existing approaches primarily focus on adjusting the allocations of adapter experts per layer to optimize the introduced trainable parameter size, while neglecting a critical factor of adapters' rank. To this end, we propose a hierarchical scheme for expert allocation and rank configuration, HILO, which dynamically adjusts the number and rank of adapter experts across layers, matching the varying representational complexity of model layers in adapter-granularity. Extensive experiments on multiple benchmark tasks demonstrate that HILO outperforms existing methods in accuracy while introducing fewer trainable parameters, providing an efficient and practical solution for fine-tuning LLMs.

Paper Structure

This paper contains 21 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Optimization focus of adapter experts. Existing works primarily optimize the number of adapters, whereas this paper explores both the number and rank of adapters to improve fine-tuning performance.
  • Figure 2: Analysis of output values between original model weight matrices and adapters cross layers
  • Figure 3: Proportions of values (<$10^{-3}$) of vanilla method, MoLA, and HiLo. (1) the proportions of three methods in the 1st layer are very close; (2) except 1st layer, MoLA has lower proportions than Vanilla and HiLo in 4 and 2 layers, respectively; (3) HiLo exhibits lower proportions than Vanilla and MoLA in 5 layers.
  • Figure 4: HiLo Design
  • Figure 5: Accuracy comparison between HiLo and MoLA. HiLo assigns experts and rank as $\triangledown$ across layers.
  • ...and 1 more figures