Table of Contents
Fetching ...

Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude Control

Haochen You, Baojing Liu

TL;DR

This work tackles the rigidity of global gradient clipping by introducing SPAMP, a differentiable gradient-shaping framework that uses per-layer statistics to adaptively modulate the update magnitude $\eta_t \|g_t\|$ via smooth, power-based operators. By unifying warmup, normalization, and clipping under a single objective and allowing per-layer thresholds $\tau_t^{(l)}$, SPAMP provides a principled mechanism for update control that adapts to layerwise geometry and training dynamics. The authors provide both theoretical insights into how SPAMP shapes loss descent and extensive experiments on image classification and language modeling demonstrating faster, more stable convergence, improved gradient norms, and robustness to perturbations and scale. The results suggest that smooth gradient shaping offers a scalable, flexible alternative to rigid thresholding, with practical implications for large-scale deep-network optimization.

Abstract

Gradient clipping is widely used to stabilize deep network training, but its formulation as a hard, fixed threshold limits flexibility and ignores gradient distribution dynamics. We propose SPAMP (Statistical Per-layer Adaptive Modulation and Projection), a unified framework that generalizes clipping into smooth, per-layer gradient shaping. SPAMP tracks local gradient statistics, dynamically estimates thresholds, and applies power-based transformations to modulate update magnitudes in a differentiable manner. This perspective recasts clipping and warmup as dual mechanisms for controlling the effective update scale $η_t \|g_t\|$, offering a principled alternative to rigid heuristics. Extensive experiments across image and language tasks demonstrate that SPAMP improves stability, convergence, and robustness over existing methods.

Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude Control

TL;DR

This work tackles the rigidity of global gradient clipping by introducing SPAMP, a differentiable gradient-shaping framework that uses per-layer statistics to adaptively modulate the update magnitude via smooth, power-based operators. By unifying warmup, normalization, and clipping under a single objective and allowing per-layer thresholds , SPAMP provides a principled mechanism for update control that adapts to layerwise geometry and training dynamics. The authors provide both theoretical insights into how SPAMP shapes loss descent and extensive experiments on image classification and language modeling demonstrating faster, more stable convergence, improved gradient norms, and robustness to perturbations and scale. The results suggest that smooth gradient shaping offers a scalable, flexible alternative to rigid thresholding, with practical implications for large-scale deep-network optimization.

Abstract

Gradient clipping is widely used to stabilize deep network training, but its formulation as a hard, fixed threshold limits flexibility and ignores gradient distribution dynamics. We propose SPAMP (Statistical Per-layer Adaptive Modulation and Projection), a unified framework that generalizes clipping into smooth, per-layer gradient shaping. SPAMP tracks local gradient statistics, dynamically estimates thresholds, and applies power-based transformations to modulate update magnitudes in a differentiable manner. This perspective recasts clipping and warmup as dual mechanisms for controlling the effective update scale , offering a principled alternative to rigid heuristics. Extensive experiments across image and language tasks demonstrate that SPAMP improves stability, convergence, and robustness over existing methods.

Paper Structure

This paper contains 17 sections, 11 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Training loss vs steps on CIFAR-10 and WikiText-103 for various optimization strategies. Our method shows smooth, accelerated convergence.
  • Figure 2: Smoothed update magnitude $\eta_t \|g_t\|$ across training steps (EMA with $\beta = 0.98$).
  • Figure 3: Gradient norm statistics. (Left) Histogram over 50k steps. (Right) Box plot across training stages. Ours consistently yields tighter, more stable norms.
  • Figure 4: Update magnitude analysis showing stable, bounded updates with a concentrated distribution.
  • Figure 5: Robustness under label noise and gradient spikes.
  • ...and 1 more figures