Table of Contents
Fetching ...

Sparse Layer Sharpness-Aware Minimization for Efficient Fine-Tuning

Yifei Cheng, Xianglin Yang, Guoxia Wang, Chao Huang, Fei Ma, Dianhai Yu, Xiaochun Cao, Li Shen

TL;DR

This work tackles the high computational cost of Sharpness-Aware Minimization (SAM) during fine-tuning by introducing Sparse-Layer SAM (SL-SAM), which imposes adaptive layerwise sparsity and frames layer selection as a multi-armed bandit problem to choose active layers for both the perturbation and update steps. By maintaining a layer-selection distribution and updating it with an EXP3-based rule guided by gradient norms, SL-SAM achieves a two-sided sparsity that drastically reduces gradient computations while preserving SAM’s generalization benefits; it also provides a formal convergence guarantee with a rate of $\frac{1}{T}\sum_{t=1}^T \mathbb{E}\| abla f(x_t)\|_1 = \mathcal{O}(T^{-1/4})$. Empirically, SL-SAM delivers competitive or state-of-the-art performance across DeiT, RoBERTa, and Llama-3 fine-tuning tasks, while reducing memory and time costs substantially (e.g., roughly 20–25% reductions in GPU memory and epoch time). The approach extends to single-step SAM variants and demonstrates robustness via ablations, making SAM-based fine-tuning feasible for large-scale models in practice.

Abstract

Sharpness-aware minimization (SAM) seeks the minima with a flat loss landscape to improve the generalization performance in machine learning tasks, including fine-tuning. However, its extra parameter perturbation step doubles the computation cost, which becomes the bottleneck of SAM in the practical implementation. In this work, we propose an approach SL-SAM to break this bottleneck by introducing the sparse technique to layers. Our key innovation is to frame the dynamic selection of layers for both the gradient ascent (perturbation) and descent (update) steps as a multi-armed bandit problem. At the beginning of each iteration, SL-SAM samples a part of the layers of the model according to the gradient norm to participate in the backpropagation of the following parameter perturbation and update steps, thereby reducing the computation complexity. We then provide the analysis to guarantee the convergence of SL-SAM. In the experiments of fine-tuning models in several tasks, SL-SAM achieves the performances comparable to the state-of-the-art baselines, including a \#1 rank on LLM fine-tuning. Meanwhile, SL-SAM significantly reduces the ratio of active parameters in backpropagation compared to vanilla SAM (SL-SAM activates 47\%, 22\% and 21\% parameters on the vision, moderate and large language model respectively while vanilla SAM always activates 100\%), verifying the efficiency of our proposed algorithm.

Sparse Layer Sharpness-Aware Minimization for Efficient Fine-Tuning

TL;DR

This work tackles the high computational cost of Sharpness-Aware Minimization (SAM) during fine-tuning by introducing Sparse-Layer SAM (SL-SAM), which imposes adaptive layerwise sparsity and frames layer selection as a multi-armed bandit problem to choose active layers for both the perturbation and update steps. By maintaining a layer-selection distribution and updating it with an EXP3-based rule guided by gradient norms, SL-SAM achieves a two-sided sparsity that drastically reduces gradient computations while preserving SAM’s generalization benefits; it also provides a formal convergence guarantee with a rate of . Empirically, SL-SAM delivers competitive or state-of-the-art performance across DeiT, RoBERTa, and Llama-3 fine-tuning tasks, while reducing memory and time costs substantially (e.g., roughly 20–25% reductions in GPU memory and epoch time). The approach extends to single-step SAM variants and demonstrates robustness via ablations, making SAM-based fine-tuning feasible for large-scale models in practice.

Abstract

Sharpness-aware minimization (SAM) seeks the minima with a flat loss landscape to improve the generalization performance in machine learning tasks, including fine-tuning. However, its extra parameter perturbation step doubles the computation cost, which becomes the bottleneck of SAM in the practical implementation. In this work, we propose an approach SL-SAM to break this bottleneck by introducing the sparse technique to layers. Our key innovation is to frame the dynamic selection of layers for both the gradient ascent (perturbation) and descent (update) steps as a multi-armed bandit problem. At the beginning of each iteration, SL-SAM samples a part of the layers of the model according to the gradient norm to participate in the backpropagation of the following parameter perturbation and update steps, thereby reducing the computation complexity. We then provide the analysis to guarantee the convergence of SL-SAM. In the experiments of fine-tuning models in several tasks, SL-SAM achieves the performances comparable to the state-of-the-art baselines, including a \#1 rank on LLM fine-tuning. Meanwhile, SL-SAM significantly reduces the ratio of active parameters in backpropagation compared to vanilla SAM (SL-SAM activates 47\%, 22\% and 21\% parameters on the vision, moderate and large language model respectively while vanilla SAM always activates 100\%), verifying the efficiency of our proposed algorithm.
Paper Structure (17 sections, 7 theorems, 21 equations, 4 figures, 7 tables, 4 algorithms)

This paper contains 17 sections, 7 theorems, 21 equations, 4 figures, 7 tables, 4 algorithms.

Key Result

Theorem 1

If $f(x)$ in Algorithm alg1 satisfies Assumptions assu1 and assu2. Assume the constant $\gamma \in (0,1]$, denote $\Hat{\sigma}^2 = \max\{4\sigma_s^2 + 12d\rho^2 L^2, \frac{L(f(x_1)-f^*)}{\gamma^2 T}\}$. We set the coefficients satisfy $1-\sqrt{\beta_1} = \sqrt{\frac{L(f(x_1)-f^*)}{\Hat{\sigma}^2 T}

Figures (4)

  • Figure 1: Average number of model parameters that participate in gradient calculation per iteration vs. task accuracy. We compare SL-SAM with AdaSAM sun2024adasam (vanilla SAM with AdamW) and a representative efficient variant ESAM du2021efficient. The points connected by dotted lines indicate that they achieve comparable performances, while the parameters participate in backpropagation in SL-SAM is 47%, 22% and 21% of that in AdaSAM across three tasks. The results in the table show the GPU memory and wall-clock time savings compared to AdaSAM. Details are referred to the Experiments section.
  • Figure 2: The workflow of our algorithm SL-SAM. Step 1: all the parameter blocks (in blue) are sampled according to the distributions: the active layers participate in this training iteration while the others are frozen; Step 2 & 3: calculate the gradients for active layers to perform the parameter perturbation and update steps; Step 4: update the distributions by the gradient norms obtained in the perturbation.
  • Figure 3: The exact active ratio for each layer throughout the DeiT-Small fine-tuning on CIFAR-10.
  • Figure 4: Test Accuracy v.s. Sparsity of Model Layers in SL-SAM. Left: CIFAR-10 dataset; Right: CIFAR-100 dataset.

Theorems & Definitions (12)

  • Theorem 1
  • Corollary 1
  • Remark 1
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • ...and 2 more