Table of Contents
Fetching ...

MALoRA: Mixture of Asymmetric Low-Rank Adaptation for Enhanced Multi-Task Learning

Xujia Wang, Haiyan Zhao, Shuo Wang, Hanqing Wang, Zhiyuan Liu

TL;DR

This paper proposes Mixture of Asymmetric Low-Rank Adaptaion (MALoRA), a flexible fine-tuning framework that leverages asymmetric optimization across LoRA experts and consistently outperforms all baseline methods in both inter-domain and intra-domain tasks.

Abstract

Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have significantly improved the adaptation of LLMs to downstream tasks in a resource-efficient manner. However, in multi-task scenarios, challenges such as training imbalance and the seesaw effect frequently emerge. Mixture-of-LoRA (MoLoRA), which combines LoRA with sparse Mixture-of-Experts, mitigates some of these issues by promoting task-specific learning across experts. Despite this, MoLoRA remains inefficient in terms of training speed, parameter utilization, and overall multi-task performance. In this paper, we propose Mixture of Asymmetric Low-Rank Adaptaion (MALoRA), a flexible fine-tuning framework that leverages asymmetric optimization across LoRA experts. MALoRA reduces the number of trainable parameters by 30% to 48%, increases training speed by 1.2x, and matches the computational efficiency of single-task LoRA models. Additionally, MALoRA addresses overfitting issues commonly seen in high-rank configurations, enhancing performance stability. Extensive experiments across diverse multi-task learning scenarios demonstrate that MALoRA consistently outperforms all baseline methods in both inter-domain and intra-domain tasks.

MALoRA: Mixture of Asymmetric Low-Rank Adaptation for Enhanced Multi-Task Learning

TL;DR

This paper proposes Mixture of Asymmetric Low-Rank Adaptaion (MALoRA), a flexible fine-tuning framework that leverages asymmetric optimization across LoRA experts and consistently outperforms all baseline methods in both inter-domain and intra-domain tasks.

Abstract

Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have significantly improved the adaptation of LLMs to downstream tasks in a resource-efficient manner. However, in multi-task scenarios, challenges such as training imbalance and the seesaw effect frequently emerge. Mixture-of-LoRA (MoLoRA), which combines LoRA with sparse Mixture-of-Experts, mitigates some of these issues by promoting task-specific learning across experts. Despite this, MoLoRA remains inefficient in terms of training speed, parameter utilization, and overall multi-task performance. In this paper, we propose Mixture of Asymmetric Low-Rank Adaptaion (MALoRA), a flexible fine-tuning framework that leverages asymmetric optimization across LoRA experts. MALoRA reduces the number of trainable parameters by 30% to 48%, increases training speed by 1.2x, and matches the computational efficiency of single-task LoRA models. Additionally, MALoRA addresses overfitting issues commonly seen in high-rank configurations, enhancing performance stability. Extensive experiments across diverse multi-task learning scenarios demonstrate that MALoRA consistently outperforms all baseline methods in both inter-domain and intra-domain tasks.

Paper Structure

This paper contains 38 sections, 8 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Architectures Overview: (a) LoRA, (b) MoLoRA, and (c) the proposed MALoRA. MALoRA optimizes the model in two key ways: (I) it increases the rank of the up-projection matrices ($B_t$) to enhance expert generalization capabilities, and (II) it introduces a shared low-rank subspace ($S_A$) in the down-projection matrices ($A_t$), while assigning each expert a unique coefficient matrix ($P_t$), effectively reducing parameter redundancy and computation.
  • Figure 2: Spatial Similarity Analysis. Spatial similarity between LoRA experts within the same MoLoRA layer, evaluated using CCA. The down-projection matrix ($A$) demonstrates significantly higher similarity across all learning scenarios (ST(S), ST(D), MT), suggesting it captures generalized features. In contrast, the up-projection matrix ($B$) shows much lower similarity, indicating its role in task-specific fine-tuning.
  • Figure 3: Singular Values of the Concatenated Homologous Matrices in descending order. matrix $B$ shows a concentration of larger singular values, indicating that many singular vectors are important for task-specific fine-tuning. In contrast, $A$ has more smaller singular values, with only a few larger ones, suggesting that only a subset of singular vectors play a critical role. This reflects that $B$ distributes importance across more components, while $A$ relies on a smaller, more focused set of key features for generalization.
  • Figure 4: (a) Multi-domain learning performance across different ranks, with methods maintaining a comparable number of trainable parameters on the same x-axis. (b) Ablation study of hyperparameters $\beta$ and $d$ in common-sense multi-task learning. (c) Comparison of training latency for various PEFT methods with FastMoE.
  • Figure 5: Performance Variations of LoRA and Asymmetry LoRA with Respect to Rank on the Math Reasoning Task MGSM shi2022language.The ticks on the x-axis represent the number of trainable parameters in Asymmetry LoRA and LoRA.