Table of Contents
Fetching ...

Effective Gradient Sample Size via Variation Estimation for Accelerating Sharpness aware Minimization

Jiaxin Deng, Junbiao Pang, Baochang Zhang, Tian Wang

TL;DR

This work addresses the high computational cost of Sharpness Aware Minimization (SAM) by revealing that SAM gradients decompose into the SGD gradient and a Projection of the Second-order gradient onto the First-order gradient (PSF). The authors introduce Variation-based SAM (vSAM), which adaptively samples the PSF based on PSF variation and reuses it during non-sampling iterations, controlled by a variance and gradient-norm-driven schedule. Empirical results across multiple architectures and datasets show vSAM achieves accuracy comparable to SAM while delivering roughly 40% faster training, and its gains extend to quantization-aware training with LSQ. The method offers a practical approach to retain SAM’s generalization benefits with significantly improved efficiency, broadening its applicability to real-world training pipelines.

Abstract

Sharpness-aware Minimization (SAM) has been proposed recently to improve model generalization ability. However, SAM calculates the gradient twice in each optimization step, thereby doubling the computation costs compared to stochastic gradient descent (SGD). In this paper, we propose a simple yet efficient sampling method to significantly accelerate SAM. Concretely, we discover that the gradient of SAM is a combination of the gradient of SGD and the Projection of the Second-order gradient matrix onto the First-order gradient (PSF). PSF exhibits a gradually increasing frequency of change during the training process. To leverage this observation, we propose an adaptive sampling method based on the variation of PSF, and we reuse the sampled PSF for non-sampling iterations. Extensive empirical results illustrate that the proposed method achieved state-of-the-art accuracies comparable to SAM on diverse network architectures.

Effective Gradient Sample Size via Variation Estimation for Accelerating Sharpness aware Minimization

TL;DR

This work addresses the high computational cost of Sharpness Aware Minimization (SAM) by revealing that SAM gradients decompose into the SGD gradient and a Projection of the Second-order gradient onto the First-order gradient (PSF). The authors introduce Variation-based SAM (vSAM), which adaptively samples the PSF based on PSF variation and reuses it during non-sampling iterations, controlled by a variance and gradient-norm-driven schedule. Empirical results across multiple architectures and datasets show vSAM achieves accuracy comparable to SAM while delivering roughly 40% faster training, and its gains extend to quantization-aware training with LSQ. The method offers a practical approach to retain SAM’s generalization benefits with significantly improved efficiency, broadening its applicability to real-world training pipelines.

Abstract

Sharpness-aware Minimization (SAM) has been proposed recently to improve model generalization ability. However, SAM calculates the gradient twice in each optimization step, thereby doubling the computation costs compared to stochastic gradient descent (SGD). In this paper, we propose a simple yet efficient sampling method to significantly accelerate SAM. Concretely, we discover that the gradient of SAM is a combination of the gradient of SGD and the Projection of the Second-order gradient matrix onto the First-order gradient (PSF). PSF exhibits a gradually increasing frequency of change during the training process. To leverage this observation, we propose an adaptive sampling method based on the variation of PSF, and we reuse the sampled PSF for non-sampling iterations. Extensive empirical results illustrate that the proposed method achieved state-of-the-art accuracies comparable to SAM on diverse network architectures.
Paper Structure (12 sections, 2 theorems, 19 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 12 sections, 2 theorems, 19 equations, 2 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

Let $\nabla _\mathbf{w}^2L(\mathbf{w})$ be a positive definite matrix with n eigenvalues, then the L2-PSF has an upper bound as follows:

Figures (2)

  • Figure 1: Accuracy vs training speed of SGD, SAM, LookSAM, ESAM, SAF and vSAM (Our). Every connected line represents a method that trains WideResNet-28-10 and PyramidNet-110 models on CIFAR-100. vSAM substantially accelerates training with almost no reduction in accuracy.
  • Figure 2: The variation trend of $||\nabla L_i^{SGD}||$ and $||\nabla L_i^{PSF}||$ during training. $||\nabla L_i^{SGD}||$ and $||\nabla L_i^{PSF}||$ are denotes the L2-norm of the gradient of SGD and the L2-norm of PSF, respectively. The trends of $||\nabla L_i^{SGD}||$ and $||\nabla L_i^{PSF}||$ are similar across different models (Resnet-18, WideResNet-28-10, PyramidNet-110).

Theorems & Definitions (2)

  • Lemma 1
  • Lemma 2