Table of Contents
Fetching ...

Flat-LoRA: Low-Rank Adaptation over a Flat Loss Landscape

Tao Li, Zhengbao He, Yujun Li, Yasheng Wang, Lifeng Shang, Xiaolin Huang

TL;DR

The paper tackles the high cost of fine-tuning large pretrained models by focusing on LoRA, a low-rank adaptation that operates in a restricted parameter space. It introduces Flat-LoRA, a method that seeks flat minima in the full parameter space using a Bayesian expected loss with carefully designed random perturbations, while storing only seeds and a few norms to keep memory use low. Empirical results across NLP, vision, and large-language-model tasks show Flat-LoRA consistently improves both in-domain and out-of-domain generalization over standard LoRA, often matching or surpassing SAM-based approaches without the associated computational burden. The approach also integrates well with other LoRA variants and demonstrates strong robustness to distribution shifts, making it a practical enhancement for parameter-efficient fine-tuning.

Abstract

Fine-tuning large-scale pre-trained models is prohibitively expensive in terms of computation and memory costs. Low-Rank Adaptation (LoRA), a popular Parameter-Efficient Fine-Tuning (PEFT) method, offers an efficient solution by optimizing only low-rank matrices. Despite recent progress in improving LoRA's performance, the relationship between the LoRA optimization space and the full parameter space is often overlooked. A solution that appears flat in the loss landscape of the LoRA space may still exhibit sharp directions in the full parameter space, potentially compromising generalization. We introduce Flat-LoRA, which aims to identify a low-rank adaptation situated in a flat region of the full parameter space. Instead of adopting the well-established sharpness-aware minimization approach, which incurs significant computation and memory overheads, we employ a Bayesian expectation loss objective to preserve training efficiency. Further, we design a refined random perturbation generation strategy for improved performance and carefully manage memory overhead using random seeds. Experiments across diverse tasks-including mathematical reasoning, coding abilities, dialogue generation, instruction following, and text-to-image generation-demonstrate that Flat-LoRA improves both in-domain and out-of-domain generalization. Code is available at https://github.com/nblt/Flat-LoRA.

Flat-LoRA: Low-Rank Adaptation over a Flat Loss Landscape

TL;DR

The paper tackles the high cost of fine-tuning large pretrained models by focusing on LoRA, a low-rank adaptation that operates in a restricted parameter space. It introduces Flat-LoRA, a method that seeks flat minima in the full parameter space using a Bayesian expected loss with carefully designed random perturbations, while storing only seeds and a few norms to keep memory use low. Empirical results across NLP, vision, and large-language-model tasks show Flat-LoRA consistently improves both in-domain and out-of-domain generalization over standard LoRA, often matching or surpassing SAM-based approaches without the associated computational burden. The approach also integrates well with other LoRA variants and demonstrates strong robustness to distribution shifts, making it a practical enhancement for parameter-efficient fine-tuning.

Abstract

Fine-tuning large-scale pre-trained models is prohibitively expensive in terms of computation and memory costs. Low-Rank Adaptation (LoRA), a popular Parameter-Efficient Fine-Tuning (PEFT) method, offers an efficient solution by optimizing only low-rank matrices. Despite recent progress in improving LoRA's performance, the relationship between the LoRA optimization space and the full parameter space is often overlooked. A solution that appears flat in the loss landscape of the LoRA space may still exhibit sharp directions in the full parameter space, potentially compromising generalization. We introduce Flat-LoRA, which aims to identify a low-rank adaptation situated in a flat region of the full parameter space. Instead of adopting the well-established sharpness-aware minimization approach, which incurs significant computation and memory overheads, we employ a Bayesian expectation loss objective to preserve training efficiency. Further, we design a refined random perturbation generation strategy for improved performance and carefully manage memory overhead using random seeds. Experiments across diverse tasks-including mathematical reasoning, coding abilities, dialogue generation, instruction following, and text-to-image generation-demonstrate that Flat-LoRA improves both in-domain and out-of-domain generalization. Code is available at https://github.com/nblt/Flat-LoRA.
Paper Structure (22 sections, 2 theorems, 9 equations, 8 figures, 11 tables)

This paper contains 22 sections, 2 theorems, 9 equations, 8 figures, 11 tables.

Key Result

Lemma 3.1

Assume the loss function $L(W)$ is $\alpha$-Lipschitz continuous and $\beta$-smooth w.r.t. $W$ under $\ell_2$-norm. The smoothed function $\mathbb{E}_{(\varepsilon_W)_{i,j} \sim \mathcal{N}(0, \sigma^2)}~~L(W + \varepsilon_W)$ is $\min\left\{\frac{\alpha}{\sigma}, \beta\right\}$-smooth w.r.t. $W$.

Figures (8)

  • Figure 1: Illustration of LoRA optimization space. LoRA constrains optimization to a lower-dimensional space (blue). A flat minimum in LoRA space (blue curve) may exhibit sharp directions in the full parameter space (red curve).
  • Figure 2: Illustration of LoRA (Left) and Flat-LoRA (Right). By introducing designed random weight perturbations during fine-tuning, Flat-LoRA identifies a low-rank solution that is flat in the loss landscape of the full parameter space. Unlike SAM, it eliminates the need for additional gradient steps and remains memory-efficient by storing only the random seed and a small number of filter norms (less than $1/r$ of the LoRA parameters for rank $r$).
  • Figure 3: Images generated by SDXL fine-tuned with LoRA and Flat-LoRA on 3D icon datasets. Each column uses the same seeds for fair comparison.
  • Figure 4: Performance comparison of LoRA and Flat-LoRA across different corruption levels of CIFAR-100-C. The model is fine-tuned on CIFAR-100 with CLIP ViT-B/32.
  • Figure 5: Performance comparison across different LoRA ranks. Keeping the LoRA alpha fixed at 16, we vary the LoRA ranks among $\{1, 4, 16, 64\}$. The results are averaged over three independent trials.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Lemma 3.1: bisla2022low
  • Proposition 3.2