Table of Contents
Fetching ...

Compare different SG-Schemes based on large least square problems

Ramkrishna Acharya

TL;DR

This paper analyzes how different SG-based schemes perform on large least squares problems by evaluating a toy LS setup with $\mathbf{X} \in \mathbb{R}^{1000 \times 5}$ and $\mathbf{y} = \mathbf{X}_{norm}\boldsymbol{\theta} + b + 0.1\boldsymbol{\epsilon}$. It systematically compares constant- and adaptive-learning-rate optimizers (SGD, Momentum, Nesterov, Adagrad, RMSProp, Adadelta, Adam) across batch sizes and learning rates, using $J(\boldsymbol{\theta})$ based on MSE and a 90:10 train/validation split. Key findings include that momentum and Nesterov yield similar behavior with smoother losses at larger batches, adaptive optimizers (especially Adam) often deliver faster convergence or lower validation error, and mini-batch regimes offer a practical trade-off between update speed and gradient noise. The study provides practical guidance for optimizer choice in large LS contexts and shares code on GitHub for reproducibility.

Abstract

This study reviews popular stochastic gradient-based schemes based on large least-square problems. These schemes, often called optimizers in machine learning, play a crucial role in finding better model parameters. Hence, this study focuses on viewing such optimizers with different hyper-parameters and analyzing them based on least square problems. Codes that produced results in this work are available on https://github.com/q-viper/gradients-based-methods-on-large-least-square.

Compare different SG-Schemes based on large least square problems

TL;DR

This paper analyzes how different SG-based schemes perform on large least squares problems by evaluating a toy LS setup with and . It systematically compares constant- and adaptive-learning-rate optimizers (SGD, Momentum, Nesterov, Adagrad, RMSProp, Adadelta, Adam) across batch sizes and learning rates, using based on MSE and a 90:10 train/validation split. Key findings include that momentum and Nesterov yield similar behavior with smoother losses at larger batches, adaptive optimizers (especially Adam) often deliver faster convergence or lower validation error, and mini-batch regimes offer a practical trade-off between update speed and gradient noise. The study provides practical guidance for optimizer choice in large LS contexts and shares code on GitHub for reproducibility.

Abstract

This study reviews popular stochastic gradient-based schemes based on large least-square problems. These schemes, often called optimizers in machine learning, play a crucial role in finding better model parameters. Hence, this study focuses on viewing such optimizers with different hyper-parameters and analyzing them based on least square problems. Codes that produced results in this work are available on https://github.com/q-viper/gradients-based-methods-on-large-least-square.

Paper Structure

This paper contains 14 sections, 25 equations, 12 figures, 1 table, 4 algorithms.

Figures (12)

  • Figure 1: MSE vs MAE
  • Figure 2: Nesterov update (Source: G. Hinton’s lecture 6c)
  • Figure 3: Validation MSE While using SGD
  • Figure 4: Validation MSE While using Momentum Optimizer
  • Figure 5: Validation MSE While using Nesterov Optimizer
  • ...and 7 more figures