Table of Contents
Fetching ...

LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM

Yehonathan Refael, Iftach Arbel, Ofir Lindenbaum, Tom Tirer

TL;DR

LORENZA tackles the memory and generalization challenges of PEFT for LLMs by marrying zeroth-order adaptive SAM with memory-efficient, adaptive low-rank gradient updates. It introduces AdaZo-SAM to estimate SAM perturbations with a single gradient using randomized finite differences, and LORENZA, which projects gradients into a learned low-rank subspace via SSRF and updates via a low-rank Adam step. The approach provides convergence guarantees and demonstrates superior accuracy and robustness on GLUE fine-tuning, LLAMA pre-training on C4, and reasoning benchmarks, while substantially reducing memory usage compared to full fine-tuning and existing low-rank methods. This work thus offers a practical path to high-performance LLM training and fine-tuning under resource constraints, with broad implications for scalable deployment and accessibility.

Abstract

We study robust parameter-efficient fine-tuning (PEFT) techniques designed to improve accuracy and generalization while operating within strict computational and memory hardware constraints, specifically focusing on large-language models (LLMs). Existing PEFT methods often lack robustness and fail to generalize effectively across diverse tasks, leading to suboptimal performance in real-world scenarios. To address this, we present a new highly computationally efficient framework called AdaZo-SAM, combining Adam and Sharpness-Aware Minimization (SAM) while requiring only a single-gradient computation in every iteration. This is achieved using a stochastic zeroth-order estimation to find SAM's ascent perturbation. We provide a convergence guarantee for AdaZo-SAM and show that it improves the generalization ability of state-of-the-art PEFT methods. Additionally, we design a low-rank gradient optimization method named LORENZA, which is a memory-efficient version of AdaZo-SAM. LORENZA utilizes a randomized SVD scheme to efficiently compute the subspace projection matrix and apply optimization steps onto the selected subspace. This technique enables full-parameter fine-tuning with adaptive low-rank gradient updates, achieving the same reduced memory consumption as gradient-low-rank-projection methods. We provide a convergence analysis of LORENZA and demonstrate its merits for pre-training and fine-tuning LLMs.

LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM

TL;DR

LORENZA tackles the memory and generalization challenges of PEFT for LLMs by marrying zeroth-order adaptive SAM with memory-efficient, adaptive low-rank gradient updates. It introduces AdaZo-SAM to estimate SAM perturbations with a single gradient using randomized finite differences, and LORENZA, which projects gradients into a learned low-rank subspace via SSRF and updates via a low-rank Adam step. The approach provides convergence guarantees and demonstrates superior accuracy and robustness on GLUE fine-tuning, LLAMA pre-training on C4, and reasoning benchmarks, while substantially reducing memory usage compared to full fine-tuning and existing low-rank methods. This work thus offers a practical path to high-performance LLM training and fine-tuning under resource constraints, with broad implications for scalable deployment and accessibility.

Abstract

We study robust parameter-efficient fine-tuning (PEFT) techniques designed to improve accuracy and generalization while operating within strict computational and memory hardware constraints, specifically focusing on large-language models (LLMs). Existing PEFT methods often lack robustness and fail to generalize effectively across diverse tasks, leading to suboptimal performance in real-world scenarios. To address this, we present a new highly computationally efficient framework called AdaZo-SAM, combining Adam and Sharpness-Aware Minimization (SAM) while requiring only a single-gradient computation in every iteration. This is achieved using a stochastic zeroth-order estimation to find SAM's ascent perturbation. We provide a convergence guarantee for AdaZo-SAM and show that it improves the generalization ability of state-of-the-art PEFT methods. Additionally, we design a low-rank gradient optimization method named LORENZA, which is a memory-efficient version of AdaZo-SAM. LORENZA utilizes a randomized SVD scheme to efficiently compute the subspace projection matrix and apply optimization steps onto the selected subspace. This technique enables full-parameter fine-tuning with adaptive low-rank gradient updates, achieving the same reduced memory consumption as gradient-low-rank-projection methods. We provide a convergence analysis of LORENZA and demonstrate its merits for pre-training and fine-tuning LLMs.

Paper Structure

This paper contains 21 sections, 2 theorems, 23 equations, 2 figures, 6 tables, 3 algorithms.

Key Result

Theorem 3.1

Consider a $\beta$-smooth, non-convex function $f$ parametrized by a matrix $\mathbf{W} \in \mathbb{R}^{m \times n},$ where $m \leq n,$ without loss of generality. Suppose $f$ satisfying $\mathop {\hbox{\rm sup}} _{{\bf W}} \mathbb{E}_\xi\|f(\mathbf{W};\xi)\|\leq C$ for some large $C\in\mathbb{R}_+$ where $\hat{\nabla} f\left({\bf W}_t\right)$ is the RGE (eq: RGE) of function $f$ with $q=1,\mu\rig

Figures (2)

  • Figure 1: An illustration showing how the flatness of different minima can impact test loss. Specifically, ${\bf W}_1$ and ${\bf W}_3$, are located in sharp regions that have a high generalization error, while ${\bf W}_2$, found in a flatter region, exhibits a lower generalization error rs16162877.
  • Figure 2: The illustration depicts the training process of LORENZA (\ref{['alg:LORENZA']}). The process begins by selecting a low-rank subspace using the efficient SSRF algorithm (\ref{['alg::randomized_range_finder']}), visualized here as a 2D plane (blue and orange). Next, a low-rank AdaZo-SAM optimization step (\ref{['alg:AdaZo_SAM']}) is performed. Specifically, the estimated low-rank ascent direction $\Tilde{{\bf S}}_t$, is computed using the RGE method, on the 2D-subspace. This low-rank ascent direction is being used to calculate the adversarial gradient ${\bf G}_t$, at the perturbated weights, ${\bf W}_t+\rho \frac{\Tilde{{\bf S}}_t}{\left\|\Tilde{{\bf S}}_t\right\|_2},$ then projected onto the 2D-subspace, namely as ${\Hat{{\bf G}}_t}^{2 \times m} = {{\bf Q}}_t^{2 \times n} {\Tilde{{\bf G}}_t}^{n \times m}$. Following this, a low-rank Adam optimization step is applied. After a predetermined number of LORENZA steps, the optimization subspace ${{\bf Q}}_t$ is updated, and the process is repeated.

Theorems & Definitions (6)

  • Theorem 3.1: AdaZo-SAM convergence rate
  • Definition 3.2
  • Theorem 3.3: Convergence of LORENZA
  • proof
  • proof
  • Definition 2.1