Table of Contents
Fetching ...

ARD-LoRA: Dynamic Rank Allocation for Parameter-Efficient Fine-Tuning of Foundation Models with Heterogeneous Adaptation Needs

Haseeb Ullah Khan Shinwari, Muhammad Usama

TL;DR

ARD-LoRA tackles the inefficiency of fixed-rank PEFT by learning per-head, per-layer rank allocations through differentiable scaling factors optimized under a meta-objective that enforces sparsity and stable rank transitions. The method achieves near-full fine-tuning performance on LLAMA-3.1-70B and PaliGemma-2 while using a tiny fraction of trainable parameters and significantly reducing memory for multimodal adaptation. Theoretical analyses establish convergence, generalization, and stability guarantees, and extensive experiments demonstrate strong empirical gains, including superior cross-domain generalization and meaningful reductions in adaptation overhead. Overall, dynamic, fine-grained rank allocation emerges as a powerful paradigm for efficient, scalable foundation-model adaptation across modalities and tasks.

Abstract

Conventional Low-Rank Adaptation (LoRA) methods employ a fixed rank, imposing uniform adaptation across transformer layers and attention heads despite their heterogeneous learning dynamics. This paper introduces Adaptive Rank Dynamic LoRA (ARD-LoRA), a novel framework that automates rank allocation through learnable scaling factors. These factors are optimized via a meta-objective balancing task performance and parameter efficiency, incorporating $\ell_1$ sparsity for minimal rank and Total Variation regularization for stable rank transitions. ARD-LoRA enables continuous, differentiable, per-head rank adaptation. Experiments on LLAMA-3.1-70B and PaliGemma-2 demonstrate ARD-LoRA's efficacy, achieving up to 99.3% of full fine-tuning performance with only 0.32% trainable parameters, outperforming strong baselines like DoRA and AdaLoRA. Furthermore, it reduces multimodal adaptation memory by 41%. These results establish dynamic, fine-grained rank allocation as a critical paradigm for efficient foundation model adaptation.

ARD-LoRA: Dynamic Rank Allocation for Parameter-Efficient Fine-Tuning of Foundation Models with Heterogeneous Adaptation Needs

TL;DR

ARD-LoRA tackles the inefficiency of fixed-rank PEFT by learning per-head, per-layer rank allocations through differentiable scaling factors optimized under a meta-objective that enforces sparsity and stable rank transitions. The method achieves near-full fine-tuning performance on LLAMA-3.1-70B and PaliGemma-2 while using a tiny fraction of trainable parameters and significantly reducing memory for multimodal adaptation. Theoretical analyses establish convergence, generalization, and stability guarantees, and extensive experiments demonstrate strong empirical gains, including superior cross-domain generalization and meaningful reductions in adaptation overhead. Overall, dynamic, fine-grained rank allocation emerges as a powerful paradigm for efficient, scalable foundation-model adaptation across modalities and tasks.

Abstract

Conventional Low-Rank Adaptation (LoRA) methods employ a fixed rank, imposing uniform adaptation across transformer layers and attention heads despite their heterogeneous learning dynamics. This paper introduces Adaptive Rank Dynamic LoRA (ARD-LoRA), a novel framework that automates rank allocation through learnable scaling factors. These factors are optimized via a meta-objective balancing task performance and parameter efficiency, incorporating sparsity for minimal rank and Total Variation regularization for stable rank transitions. ARD-LoRA enables continuous, differentiable, per-head rank adaptation. Experiments on LLAMA-3.1-70B and PaliGemma-2 demonstrate ARD-LoRA's efficacy, achieving up to 99.3% of full fine-tuning performance with only 0.32% trainable parameters, outperforming strong baselines like DoRA and AdaLoRA. Furthermore, it reduces multimodal adaptation memory by 41%. These results establish dynamic, fine-grained rank allocation as a critical paradigm for efficient foundation model adaptation.

Paper Structure

This paper contains 36 sections, 1 theorem, 28 equations, 7 figures, 11 tables, 1 algorithm.

Key Result

Theorem III.1

Suppose the learning rates for updating the LoRA parameters and scaling factors satisfy appropriate conditions (e.g., $\eta_\Theta \leq 1/L_T$ and $\eta_\alpha \leq 1/(L_T+\lambda\beta)$) and that Assumption ass:lip holds. Then the sequence $\{(\mathbf{A}(t),\mathbf{B}(t),\alpha(t))\}$ produced by t where $C>0$ is a constant and $\Theta(t)$ denotes the collection of LoRA parameters. Thus, the algo

Figures (7)

  • Figure 1: Architecture of the ARD-LoRA method. The frozen pre-trained weights $\mathbf{W}_{l,h} \in \mathbb{R}^{d \times k}$ are augmented with low-rank updates $\Delta \mathbf{W}_{l,h} = \mathbf{B}_{l,h} \mathbf{A}_{l,h}$, where $\mathbf{B}_{l,h} \in \mathbb{R}^{d \times r_{l,h}}$ and $\mathbf{A}_{l,h} \in \mathbb{R}^{r_{l,h} \times k}$. The effective rank $r_{l,h}$ is dynamically computed as $r_{l,h} = \lfloor r_0 \cdot \alpha_{l,h} \rceil$, where $r_0$ is a base rank and $\alpha_{l,h}$ is a learnable scaling factor. This dynamic rescaling allows for adaptive rank allocation across layers and attention heads, optimizing parameter efficiency and task performance.
  • Figure 2: Evolution of effective ranks across layers during training on MMLU. Higher layers (closer to output) naturally develop higher ranks, indicating greater adaptation needs. Shaded regions show standard deviation across attention heads.
  • Figure 3: Peak memory usage vs. model size for different PEFT methods. ARD-LoRA shows superior memory efficiency, especially for larger models.
  • Figure 4: Effective rank distribution across attention heads and layers. Cross-attention heads (1-8) consistently develop higher ranks, particularly in upper layers, indicating their crucial role in model adaptation.
  • Figure 5: Validation loss trajectories for different PEFT methods. ARD-LoRA exhibits faster convergence and lower final loss, attributed to its dynamic rank allocation mechanism.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Theorem III.1: Convergence of ARD-LoRA