Table of Contents
Fetching ...

Revisiting Random Weight Perturbation for Efficiently Improving Generalization

Tao Li, Qinghua Tao, Weihao Yan, Zehao Lei, Yingwen Wu, Kun Fang, Mingzhen He, Xiaolin Huang

TL;DR

This work revisits Random Weight Perturbation (RWP) as a computationally efficient alternative to Adversarial Weight Perturbation (AWP)/SAM for improving generalization in deep networks. It reveals a fundamental trade-off between generalization and convergence in RWP and introduces a mixed-RWP (m-RWP) objective that combines the Bayes loss with the original loss to stabilize training. It further proposes Adaptive Random Weight Perturbation (ARWP) to generate gradient-informed perturbations, and two consolidated methods, m-RWP and m-ARWP, that support parallel gradient updates for substantial speedups. Across CIFAR-10/100 and ImageNet, the proposed methods achieve competitive or superior generalization with much greater efficiency than SAM, with code released for public use.

Abstract

Improving the generalization ability of modern deep neural networks (DNNs) is a fundamental challenge in machine learning. Two branches of methods have been proposed to seek flat minima and improve generalization: one led by sharpness-aware minimization (SAM) minimizes the worst-case neighborhood loss through adversarial weight perturbation (AWP), and the other minimizes the expected Bayes objective with random weight perturbation (RWP). While RWP offers advantages in computation and is closely linked to AWP on a mathematical basis, its empirical performance has consistently lagged behind that of AWP. In this paper, we revisit the use of RWP for improving generalization and propose improvements from two perspectives: i) the trade-off between generalization and convergence and ii) the random perturbation generation. Through extensive experimental evaluations, we demonstrate that our enhanced RWP methods achieve greater efficiency in enhancing generalization, particularly in large-scale problems, while also offering comparable or even superior performance to SAM. The code is released at https://github.com/nblt/mARWP.

Revisiting Random Weight Perturbation for Efficiently Improving Generalization

TL;DR

This work revisits Random Weight Perturbation (RWP) as a computationally efficient alternative to Adversarial Weight Perturbation (AWP)/SAM for improving generalization in deep networks. It reveals a fundamental trade-off between generalization and convergence in RWP and introduces a mixed-RWP (m-RWP) objective that combines the Bayes loss with the original loss to stabilize training. It further proposes Adaptive Random Weight Perturbation (ARWP) to generate gradient-informed perturbations, and two consolidated methods, m-RWP and m-ARWP, that support parallel gradient updates for substantial speedups. Across CIFAR-10/100 and ImageNet, the proposed methods achieve competitive or superior generalization with much greater efficiency than SAM, with code released for public use.

Abstract

Improving the generalization ability of modern deep neural networks (DNNs) is a fundamental challenge in machine learning. Two branches of methods have been proposed to seek flat minima and improve generalization: one led by sharpness-aware minimization (SAM) minimizes the worst-case neighborhood loss through adversarial weight perturbation (AWP), and the other minimizes the expected Bayes objective with random weight perturbation (RWP). While RWP offers advantages in computation and is closely linked to AWP on a mathematical basis, its empirical performance has consistently lagged behind that of AWP. In this paper, we revisit the use of RWP for improving generalization and propose improvements from two perspectives: i) the trade-off between generalization and convergence and ii) the random perturbation generation. Through extensive experimental evaluations, we demonstrate that our enhanced RWP methods achieve greater efficiency in enhancing generalization, particularly in large-scale problems, while also offering comparable or even superior performance to SAM. The code is released at https://github.com/nblt/mARWP.
Paper Structure (27 sections, 5 theorems, 29 equations, 5 figures, 4 tables)

This paper contains 27 sections, 5 theorems, 29 equations, 5 figures, 4 tables.

Key Result

Theorem 1

With Assumption assumption-Lipschitz and assumption-smoothness, the function $L^{\rm Bayes}(\boldsymbol{w})$ defined in Eqn. equ:lpf) is $\min \{\frac{\alpha}{\sigma}, \beta \}$-smooth.

Figures (5)

  • Figure 1: Expected training loss $\mathbb{E} [L(\boldsymbol{w}^*+\boldsymbol{\epsilon})]$ under different perturbation radii $\|\boldsymbol{\epsilon}\|_2$. The experiments employ a well-trained model $\boldsymbol{w}^*$ using SGD on CIFAR-10 with ResNet-18. Note that the x-axis is in logarithmic coordinates.
  • Figure 2: Training performance comparison of RWP and m-RWP. m-RWP significantly improve the convergence over RWP and leads to much better performance. The experiments are conducted on CIFAR-100 with ResNet-18. The perturbation variance $\sigma$ is set to 0.01.
  • Figure 3: The mixed loss objective achieves a better trade-off between generalization and convergence. In (a), we conducted multiple runs of RWP and m-RWP with varying perturbation variances ($\sigma$) and recorded the final training losses and generalization errors (difference between training accuracy and test accuracy) for each trial. Our observations reveal that m-RWP achieves an improved trade-off between generalization error and convergence compared to RWP. In (b), m-RWP is capable of utilizing larger perturbation variances to achieve better generalization performance. The experiments are performed on CIFAR-100 with ResNet-18 and the balance coefficient $\lambda$ is set to 0.5.
  • Figure 4: Performance under various hyper-parameter configurations.
  • Figure 5: Loss landscape (Up) and the corresponding Hessian spectrum visualization (Down) of different methods. Models are trained on CIFAR-10 with ResNet-18.

Theorems & Definitions (5)

  • Theorem 1: Smoothness of RWP, bisla2022low
  • Theorem 2: Convergence of RWP in non-convex setting
  • Theorem 3: Smoothness of m-RWP
  • Theorem 4: Convergence of m-RWP in non-convex setting
  • Lemma 1