Table of Contents
Fetching ...

Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning

Siwei Li, Yifan Yang, Yifei Shen, Fangyun Wei, Zongqing Lu, Lili Qiu, Yuqing Yang

TL;DR

LoRASC tackles the expressiveness and generalization gaps in Low-Rank Adaptation (LoRA) for fine-tuning large foundation models. It introduces cascading LoRA learning with a slow-fast update and cascading noisy tuning, enabling a sequence of LoRA experts to be merged into the backbone without increasing inference cost. The approach yields significant improvements across NLP and CV benchmarks, including instruction-following tasks and ImageNet robustness, and is compatible with existing LoRA variants such as LoRA+, Dora, and COLA. Empirically, LoRASC demonstrates enhanced fitting capability and robustness to overfitting, owing to its cascade-based rank expansion and regularization through noise and averaging. Overall, it offers a practical, plug-in method to boost transfer learning performance for large models without added training or inference overhead.

Abstract

Efficient fine-tuning plays a fundamental role in modern large models, with low-rank adaptation emerging as a particularly promising approach. However, the existing variants of LoRA are hampered by limited expressiveness, a tendency to overfit, and sensitivity to hyperparameter settings. This paper presents LoRA Slow Cascade Learning (LoRASC), an innovative technique designed to enhance LoRA's expressiveness and generalization capabilities while preserving its training efficiency. Our approach augments expressiveness through a cascaded learning strategy that enables a mixture-of-low-rank adaptation, thereby increasing the model's ability to capture complex patterns. Additionally, we introduce a slow-fast update mechanism and cascading noisy tuning to bolster generalization. The extensive experiments on various language and vision datasets, as well as robustness benchmarks, demonstrate that the proposed method not only significantly outperforms existing baselines, but also mitigates overfitting, enhances model stability, and improves OOD robustness. Code will be release in https://github.com/microsoft/LoRASC very soon.

Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning

TL;DR

LoRASC tackles the expressiveness and generalization gaps in Low-Rank Adaptation (LoRA) for fine-tuning large foundation models. It introduces cascading LoRA learning with a slow-fast update and cascading noisy tuning, enabling a sequence of LoRA experts to be merged into the backbone without increasing inference cost. The approach yields significant improvements across NLP and CV benchmarks, including instruction-following tasks and ImageNet robustness, and is compatible with existing LoRA variants such as LoRA+, Dora, and COLA. Empirically, LoRASC demonstrates enhanced fitting capability and robustness to overfitting, owing to its cascade-based rank expansion and regularization through noise and averaging. Overall, it offers a practical, plug-in method to boost transfer learning performance for large models without added training or inference overhead.

Abstract

Efficient fine-tuning plays a fundamental role in modern large models, with low-rank adaptation emerging as a particularly promising approach. However, the existing variants of LoRA are hampered by limited expressiveness, a tendency to overfit, and sensitivity to hyperparameter settings. This paper presents LoRA Slow Cascade Learning (LoRASC), an innovative technique designed to enhance LoRA's expressiveness and generalization capabilities while preserving its training efficiency. Our approach augments expressiveness through a cascaded learning strategy that enables a mixture-of-low-rank adaptation, thereby increasing the model's ability to capture complex patterns. Additionally, we introduce a slow-fast update mechanism and cascading noisy tuning to bolster generalization. The extensive experiments on various language and vision datasets, as well as robustness benchmarks, demonstrate that the proposed method not only significantly outperforms existing baselines, but also mitigates overfitting, enhances model stability, and improves OOD robustness. Code will be release in https://github.com/microsoft/LoRASC very soon.
Paper Structure (25 sections, 12 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 12 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: Iterative pipeline of $\textit{LoRASC}$. Here, $t$ represents the iteration step, and $BA$ denotes the low-rank learnable vectors in LoRA. The backbone network $W$ always has its gradients turned off, and $\alpha$ is the hyperparameter controlling the pace of the slow-fast update. Our method follows three stages: 1. Fast LoRA expert training, where noise is added to the backbone network, followed by training the fast LoRA on the task data. 2. Slow LoRA expert merging, where a portion of the learned fast LoRA is weighted and merged into the slow LoRA. 3. Update the pretrained model, merging the updated slow LoRA into the backbone network, and prepare for the next iteration.
  • Figure 2: Performance of $\textit{LoRASC}$ compared to LoRA and COLA across various ranks and learning schedules in a subset of text transfer learning tasks. It can be observed that $\textit{LoRASC}$ consistently achieves stable performance improvements across all ranks and learning schedules, particularly at higher ranks and longer epochs, where $\textit{LoRASC}$ can mitigate performance degradation caused by overfitting.