Table of Contents
Fetching ...

Incorruptible Neural Networks: Training Models that can Generalize to Large Internal Perturbations

Philip Jacobson, Ben Feinberg, Suhas Kumar, Sapan Agarwal, T. Patrick Xiao, Christopher Bennett

TL;DR

The paper tackles the challenge of training neural networks that generalize under large internal weight perturbations, a concern for analog hardware. It systematically compares SAM and RWP, derives a PAC-Bayes bound to guide perturbed training, and demonstrates that over-regularized noise during training can yield more noise-robust minima, while SAM provides benefits mainly in small-noise regimes. A key finding is the vanishing-gradient phenomenon under large perturbations, which can be mitigated by dynamic perturbation schedules that ramp noise during training, enhancing both optimization and generalization. Validation on analog hardware simulations confirms the practical relevance, showing improved robustness over standard SGD and device-dependent gains when using ramped perturbations. Collectively, the work offers a general framework for weight-noise robustness with actionable techniques for AIMC deployment and beyond.

Abstract

Flat regions of the neural network loss landscape have long been hypothesized to correlate with better generalization properties. A closely related but distinct problem is training models that are robust to internal perturbations to their weights, which may be an important need for future low-power hardware platforms. In this paper, we explore the usage of two methods, sharpness-aware minimization (SAM) and random-weight perturbation (RWP), to find minima robust to a variety of random corruptions to weights. We consider the problem from two angles: generalization (how do we reduce the noise-robust generalization gap) and optimization (how do we maximize performance from optimizers when subject to strong perturbations). First, we establish, both theoretically and empirically, that an over-regularized RWP training objective is optimal for noise-robust generalization. For small-magnitude noise, we find that SAM's adversarial objective further improves performance over any RWP configuration, but performs poorly for large-magnitude noise. We link the cause of this to a vanishing-gradient effect, caused by unevenness in the loss landscape, affecting both SAM and RWP. Lastly, we demonstrate that dynamically adjusting the perturbation strength to match the evolution of the loss landscape improves optimizing for these perturbed objectives.

Incorruptible Neural Networks: Training Models that can Generalize to Large Internal Perturbations

TL;DR

The paper tackles the challenge of training neural networks that generalize under large internal weight perturbations, a concern for analog hardware. It systematically compares SAM and RWP, derives a PAC-Bayes bound to guide perturbed training, and demonstrates that over-regularized noise during training can yield more noise-robust minima, while SAM provides benefits mainly in small-noise regimes. A key finding is the vanishing-gradient phenomenon under large perturbations, which can be mitigated by dynamic perturbation schedules that ramp noise during training, enhancing both optimization and generalization. Validation on analog hardware simulations confirms the practical relevance, showing improved robustness over standard SGD and device-dependent gains when using ramped perturbations. Collectively, the work offers a general framework for weight-noise robustness with actionable techniques for AIMC deployment and beyond.

Abstract

Flat regions of the neural network loss landscape have long been hypothesized to correlate with better generalization properties. A closely related but distinct problem is training models that are robust to internal perturbations to their weights, which may be an important need for future low-power hardware platforms. In this paper, we explore the usage of two methods, sharpness-aware minimization (SAM) and random-weight perturbation (RWP), to find minima robust to a variety of random corruptions to weights. We consider the problem from two angles: generalization (how do we reduce the noise-robust generalization gap) and optimization (how do we maximize performance from optimizers when subject to strong perturbations). First, we establish, both theoretically and empirically, that an over-regularized RWP training objective is optimal for noise-robust generalization. For small-magnitude noise, we find that SAM's adversarial objective further improves performance over any RWP configuration, but performs poorly for large-magnitude noise. We link the cause of this to a vanishing-gradient effect, caused by unevenness in the loss landscape, affecting both SAM and RWP. Lastly, we demonstrate that dynamically adjusting the perturbation strength to match the evolution of the loss landscape improves optimizing for these perturbed objectives.
Paper Structure (35 sections, 3 theorems, 11 equations, 16 figures, 12 tables, 1 algorithm)

This paper contains 35 sections, 3 theorems, 11 equations, 16 figures, 12 tables, 1 algorithm.

Key Result

Theorem 4.1

Assume $\Delta L_{\mathcal{D}} > 0$. For any small $\sigma_{train}$, $\sigma_{test}$ where $\sigma_{train} > \sigma_{test}$, the following holds with probability $1-\delta$: where $h: \mathbb{R}_+\rightarrow\mathbb{R}_+$ is a monotonically increasing function.

Figures (16)

  • Figure 1: Plot of noisy (a) training accuracy and (b) test accuracy as a function of the applied $\sigma$ for ResNet-18 trained on Tiny-ImageNet with RWP of varying $\sigma_{train}$.
  • Figure 2: Comparison between optimal RWP and SAM across a variety of Cifar-100 noise settings.
  • Figure 3: (a) ResNet-18 test accuracy on Cifar-100 as a function of training epoch when both perturbations ($\sigma_{test}=0.05$) and no perturbations are applied. Training is conducted using SGD. (b) Comparison of perturbed test accuracy evolution for SAM with various $\rho$ values. (c) Comparison of perturbed test accuracy evolution for RWP with various $\sigma_{train}$ values.
  • Figure 4: (a) Plot of the update gradient norm $||\nabla L ||_2$ as a function of training epoch for a ResNet-18 trained on Cifar-100 using SGD, SAM, and RWP. (b) Plot of the gradient sharpness (corresponding to ascent-direction for SAM or average-direction for RWP) of both SAM and RWP as a function of training loss. (c) Schematic visualization of loss surfaces for large-$\rho$ SAM's (top) and large-$\sigma$ RWP's (bottom) training trajectories.
  • Figure 5: Plot of noisy test accuracy as a function of $\sigma_{test}$ for RWP trained with (a) $\sigma_{train} = 0.01$, (b) $\sigma_{train} = 0.02$, and (c) $\sigma_{train} = 0.03$.
  • ...and 11 more figures

Theorems & Definitions (5)

  • Theorem 4.1
  • Lemma 1.1
  • proof
  • Theorem 1.2
  • proof