Rank Also Matters: Hierarchical Configuration for Mixture of Adapter Experts in LLM Fine-Tuning
Peizhuang Cong, Wenpu Liu, Wenhan Yu, Haochen Zhao, Tong Yang
TL;DR
This work addresses the inefficiency of existing mixture-of-adapter-experts approaches that fix adapter rank across layers. It introduces HiLo, a hierarchical configuration that jointly optimizes the number and rank of adapter experts per layer, guided by a simple rank-setting rule and dynamic allocation/activation strategies. Empirical results on Llama 2-7B across diverse tasks show that HiLo achieves higher accuracy while reducing both trainable and active parameters compared with strong baselines like MoLA, AlphaLoRA, and AdaMoE, with the best rank configuration (e.g., 2468 across layers) offering the optimal trade-off. The approach has practical significance for scalable, parameter-efficient fine-tuning of large language models in real-world settings.
Abstract
Large language models (LLMs) have demonstrated remarkable success across various tasks, accompanied by a continuous increase in their parameter size. Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address the challenges of fine-tuning LLMs by significantly reducing the number of trainable parameters. Recent studies have integrated LoRA with Mixture of Experts (MoE) architectures, leveraging multiple adapter experts and gating mechanisms to further improve fine-tuning performance. However, existing approaches primarily focus on adjusting the allocations of adapter experts per layer to optimize the introduced trainable parameter size, while neglecting a critical factor of adapters' rank. To this end, we propose a hierarchical scheme for expert allocation and rank configuration, HILO, which dynamically adjusts the number and rank of adapter experts across layers, matching the varying representational complexity of model layers in adapter-granularity. Extensive experiments on multiple benchmark tasks demonstrate that HILO outperforms existing methods in accuracy while introducing fewer trainable parameters, providing an efficient and practical solution for fine-tuning LLMs.
