Table of Contents
Fetching ...

Effective Dimension Aware Fractional-Order Stochastic Gradient Descent for Convex Optimization Problems

Mohammad Partohaghighi, Roummel Marcia, YangQuan Chen

TL;DR

This work tackles the tuning and stability challenges of fractional-order stochastic gradient descent (FOSGD) by introducing 2SEDFOSGD, which adapts the fractional exponent using the Two-Scale Effective Dimension (2SED). The method combines per-layer curvature information from the Fisher Information Matrix with nominal parameter counts to produce a layer-wise, data-driven exponent α_t^{(j)} via α_t^{(j)} = α_0 − β d_ζ^{(j)}(ε)/d_max, and further employs Lower 2SED for scalable, per-layer complexity estimates. The authors provide convergence guarantees for convex objectives, show how 2SED bounds can be finite, and validate the approach with autoregressive models under Gaussian and α-stable noise as well as MNIST and CIFAR-100 classification, reporting faster convergence and more robust parameter estimates than baseline methods. Collectively, the results demonstrate that dimension-aware fractional memory can enhance stability and efficiency in high-dimensional optimization, with potential impact on deep learning and estimation tasks that benefit from long-range gradient information.

Abstract

Fractional-order stochastic gradient descent (FOSGD) leverages fractional exponents to capture long-memory effects in optimization. However, its utility is often limited by the difficulty of tuning and stabilizing these exponents. We propose 2SED Fractional-Order Stochastic Gradient Descent (2SEDFOSGD), which integrates the Two-Scale Effective Dimension (2SED) algorithm with FOSGD to adapt the fractional exponent in a data-driven manner. By tracking model sensitivity and effective dimensionality, 2SEDFOSGD dynamically modulates the exponent to mitigate oscillations and hasten convergence. Theoretically, this approach preserves the advantages of fractional memory without the sluggish or unstable behavior observed in naïve fractional SGD. Empirical evaluations in Gaussian and $α$-stable noise scenarios using an autoregressive (AR) model\textcolor{red}{, as well as on the MNIST and CIFAR-100 datasets for image classification,} highlight faster convergence and more robust parameter estimates compared to baseline methods, underscoring the potential of dimension-aware fractional techniques for advanced modeling and estimation tasks.

Effective Dimension Aware Fractional-Order Stochastic Gradient Descent for Convex Optimization Problems

TL;DR

This work tackles the tuning and stability challenges of fractional-order stochastic gradient descent (FOSGD) by introducing 2SEDFOSGD, which adapts the fractional exponent using the Two-Scale Effective Dimension (2SED). The method combines per-layer curvature information from the Fisher Information Matrix with nominal parameter counts to produce a layer-wise, data-driven exponent α_t^{(j)} via α_t^{(j)} = α_0 − β d_ζ^{(j)}(ε)/d_max, and further employs Lower 2SED for scalable, per-layer complexity estimates. The authors provide convergence guarantees for convex objectives, show how 2SED bounds can be finite, and validate the approach with autoregressive models under Gaussian and α-stable noise as well as MNIST and CIFAR-100 classification, reporting faster convergence and more robust parameter estimates than baseline methods. Collectively, the results demonstrate that dimension-aware fractional memory can enhance stability and efficiency in high-dimensional optimization, with potential impact on deep learning and estimation tasks that benefit from long-range gradient information.

Abstract

Fractional-order stochastic gradient descent (FOSGD) leverages fractional exponents to capture long-memory effects in optimization. However, its utility is often limited by the difficulty of tuning and stabilizing these exponents. We propose 2SED Fractional-Order Stochastic Gradient Descent (2SEDFOSGD), which integrates the Two-Scale Effective Dimension (2SED) algorithm with FOSGD to adapt the fractional exponent in a data-driven manner. By tracking model sensitivity and effective dimensionality, 2SEDFOSGD dynamically modulates the exponent to mitigate oscillations and hasten convergence. Theoretically, this approach preserves the advantages of fractional memory without the sluggish or unstable behavior observed in naïve fractional SGD. Empirical evaluations in Gaussian and -stable noise scenarios using an autoregressive (AR) model\textcolor{red}{, as well as on the MNIST and CIFAR-100 datasets for image classification,} highlight faster convergence and more robust parameter estimates compared to baseline methods, underscoring the potential of dimension-aware fractional techniques for advanced modeling and estimation tasks.

Paper Structure

This paper contains 17 sections, 4 theorems, 51 equations, 3 figures, 1 algorithm.

Key Result

Proposition 5.1

For $\mu_t = \frac{\mu_0}{\sqrt{t}}$ (and $\mu_0$ for $t=0$), $\|g^j(\theta^t)\| \leq G + \sigma$, the iterates satisfy:

Figures (3)

  • Figure 1: Convergence of $a_1$ and $a_2$ under $\alpha$-stable noise.
  • Figure 2: Training accuracy comparison between 2SEDFOSGD and FOSGD on CIFAR-100 (left) and $\alpha_t$.
  • Figure 3: Training accuracy comparison between 2SEDFOSGD and FOSGD on CIFAR-100 and $\alpha_t$.

Theorems & Definitions (13)

  • Definition 1: Fisher Information datres2024two
  • Definition 2: Empirical Fisher
  • Definition 3: Normalized Fisher Matrix datres2024two
  • Definition 4: Two-Scale Effective Dimension datres2024two
  • Definition 5: Caputo Derivative monje2010fractional
  • Proposition 5.1: Bounded Iterates
  • proof
  • Lemma 1: Bounding the 2SED Measure
  • proof
  • Lemma 2: Descent Lemma
  • ...and 3 more