Table of Contents
Fetching ...

Stochastic Gradient Sampling for Enhancing Neural Networks Training

Juyoung Yun

TL;DR

StochGradAdam is a novel optimizer designed as an extension of the Adam algorithm, incorporating stochastic gradient sampling techniques to improve computational efficiency while maintaining robust performance, providing a promising alternative to traditional optimization techniques for deep learning applications.

Abstract

In this paper, we introduce StochGradAdam, a novel optimizer designed as an extension of the Adam algorithm, incorporating stochastic gradient sampling techniques to improve computational efficiency while maintaining robust performance. StochGradAdam optimizes by selectively sampling a subset of gradients during training, reducing the computational cost while preserving the advantages of adaptive learning rates and bias corrections found in Adam. Our experimental results, applied to image classification and segmentation tasks, demonstrate that StochGradAdam can achieve comparable or superior performance to Adam, even when using fewer gradient updates per iteration. By focusing on key gradient updates, StochGradAdam offers stable convergence and enhanced exploration of the loss landscape, while mitigating the impact of noisy gradients. The results suggest that this approach is particularly effective for large-scale models and datasets, providing a promising alternative to traditional optimization techniques for deep learning applications.

Stochastic Gradient Sampling for Enhancing Neural Networks Training

TL;DR

StochGradAdam is a novel optimizer designed as an extension of the Adam algorithm, incorporating stochastic gradient sampling techniques to improve computational efficiency while maintaining robust performance, providing a promising alternative to traditional optimization techniques for deep learning applications.

Abstract

In this paper, we introduce StochGradAdam, a novel optimizer designed as an extension of the Adam algorithm, incorporating stochastic gradient sampling techniques to improve computational efficiency while maintaining robust performance. StochGradAdam optimizes by selectively sampling a subset of gradients during training, reducing the computational cost while preserving the advantages of adaptive learning rates and bias corrections found in Adam. Our experimental results, applied to image classification and segmentation tasks, demonstrate that StochGradAdam can achieve comparable or superior performance to Adam, even when using fewer gradient updates per iteration. By focusing on key gradient updates, StochGradAdam offers stable convergence and enhanced exploration of the loss landscape, while mitigating the impact of noisy gradients. The results suggest that this approach is particularly effective for large-scale models and datasets, providing a promising alternative to traditional optimization techniques for deep learning applications.
Paper Structure (24 sections, 44 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 24 sections, 44 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of the distribution of normalized entropy across different optimizers (RMSProp, Adam, and StochGradAdam) at various training epochs (10, 100, 200, and 300). The histograms depict the frequency of a specific range of normalized entropy values, illustrating how the uncertainty in predictions evolves as training progresses.
  • Figure 2: PCA visualization of data processed with different optimizers at distinct training epochs. Each plot captures the distribution of data points in the reduced dimensional space, with color gradients representing normalized entropy.
  • Figure 3: Comparison of test accuracy over 300 epochs on CIFAR-10 dataset krizhevsky2009learning for various neural network architectures: ResNet-56,110,152 he2016deep, MobileNetV2, VIT-8 dosovitskiy2020image, and VGG-16 simonyan2014very. Three different optimizers - RMSprop (green) tieleman2012rmsprop, Adam (blue) kingma2014adam, and StochGradAdam (red) - were used to train each model.
  • Figure 4: A comparative visualization of segmentation results on the oxford_iiit_pet dataset using the Unet-2 ronneberger2015u architecture combined with MobileNetV2 mobilenetv2. The figure compares results across different optimizers, including StochGradAdam, Adam, and RMSProp.
  • Figure 5: Test accuracy across different sample rates (100%, 80%, 60%, and 20%) for ResNet-56, ResNet-110, and ResNet-152 architectures he2016deep. The sample rate refers to the percentage of gradients used during updates. All experiments are done with Adam and StochGradAdam on CIFAR-10 krizhevsky2009learning with 256 of batch size and 0.001 of learning rate