Compare different SG-Schemes based on large least square problems

Ramkrishna Acharya

Compare different SG-Schemes based on large least square problems

Ramkrishna Acharya

TL;DR

This paper analyzes how different SG-based schemes perform on large least squares problems by evaluating a toy LS setup with $\mathbf{X} \in \mathbb{R}^{1000 \times 5}$ and $\mathbf{y} = \mathbf{X}_{norm}\boldsymbol{\theta} + b + 0.1\boldsymbol{\epsilon}$. It systematically compares constant- and adaptive-learning-rate optimizers (SGD, Momentum, Nesterov, Adagrad, RMSProp, Adadelta, Adam) across batch sizes and learning rates, using $J(\boldsymbol{\theta})$ based on MSE and a 90:10 train/validation split. Key findings include that momentum and Nesterov yield similar behavior with smoother losses at larger batches, adaptive optimizers (especially Adam) often deliver faster convergence or lower validation error, and mini-batch regimes offer a practical trade-off between update speed and gradient noise. The study provides practical guidance for optimizer choice in large LS contexts and shares code on GitHub for reproducibility.

Abstract

This study reviews popular stochastic gradient-based schemes based on large least-square problems. These schemes, often called optimizers in machine learning, play a crucial role in finding better model parameters. Hence, this study focuses on viewing such optimizers with different hyper-parameters and analyzing them based on least square problems. Codes that produced results in this work are available on https://github.com/q-viper/gradients-based-methods-on-large-least-square.

Compare different SG-Schemes based on large least square problems

TL;DR

This paper analyzes how different SG-based schemes perform on large least squares problems by evaluating a toy LS setup with

and

. It systematically compares constant- and adaptive-learning-rate optimizers (SGD, Momentum, Nesterov, Adagrad, RMSProp, Adadelta, Adam) across batch sizes and learning rates, using

based on MSE and a 90:10 train/validation split. Key findings include that momentum and Nesterov yield similar behavior with smoother losses at larger batches, adaptive optimizers (especially Adam) often deliver faster convergence or lower validation error, and mini-batch regimes offer a practical trade-off between update speed and gradient noise. The study provides practical guidance for optimizer choice in large LS contexts and shares code on GitHub for reproducibility.

Compare different SG-Schemes based on large least square problems

TL;DR

Abstract

Compare different SG-Schemes based on large least square problems

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)