Table of Contents
Fetching ...

Convergence Analysis of Fractional Gradient Descent

Ashwani Aggarwal

TL;DR

This paper analyzes variations of fractional gradient descent in smooth and convex, smooth and strongly convex, and smooth and non-convex settings to prove linear convergence for smooth and strongly convex functions and O(1/T)$ convergence for smooth and convex functions.

Abstract

Fractional derivatives are a well-studied generalization of integer order derivatives. Naturally, for optimization, it is of interest to understand the convergence properties of gradient descent using fractional derivatives. Convergence analysis of fractional gradient descent is currently limited both in the methods analyzed and the settings analyzed. This paper aims to fill in these gaps by analyzing variations of fractional gradient descent in smooth and convex, smooth and strongly convex, and smooth and non-convex settings. First, novel bounds will be established bridging fractional and integer derivatives. Then, these bounds will be applied to the aforementioned settings to prove linear convergence for smooth and strongly convex functions and $O(1/T)$ convergence for smooth and convex functions. Additionally, we prove $O(1/T)$ convergence for smooth and non-convex functions using an extended notion of smoothness - Hölder smoothness - that is more natural for fractional derivatives. Finally, empirical results will be presented on the potential speed up of fractional gradient descent over standard gradient descent as well as some preliminary theoretical results explaining this speed up.

Convergence Analysis of Fractional Gradient Descent

TL;DR

This paper analyzes variations of fractional gradient descent in smooth and convex, smooth and strongly convex, and smooth and non-convex settings to prove linear convergence for smooth and strongly convex functions and O(1/T)$ convergence for smooth and convex functions.

Abstract

Fractional derivatives are a well-studied generalization of integer order derivatives. Naturally, for optimization, it is of interest to understand the convergence properties of gradient descent using fractional derivatives. Convergence analysis of fractional gradient descent is currently limited both in the methods analyzed and the settings analyzed. This paper aims to fill in these gaps by analyzing variations of fractional gradient descent in smooth and convex, smooth and strongly convex, and smooth and non-convex settings. First, novel bounds will be established bridging fractional and integer derivatives. Then, these bounds will be applied to the aforementioned settings to prove linear convergence for smooth and strongly convex functions and convergence for smooth and convex functions. Additionally, we prove convergence for smooth and non-convex functions using an extended notion of smoothness - Hölder smoothness - that is more natural for fractional derivatives. Finally, empirical results will be presented on the potential speed up of fractional gradient descent over standard gradient descent as well as some preliminary theoretical results explaining this speed up.
Paper Structure (32 sections, 16 theorems, 96 equations, 4 figures)

This paper contains 32 sections, 16 theorems, 96 equations, 4 figures.

Key Result

Theorem 4

Choose some $\alpha\in\mathbb{R}$ and let $n = \lceil \alpha \rceil$. Suppose $f:\mathbb{R}\to\mathbb{R}$ is $n$ times differentiable and $f^n(t)$ is absolutely continuous throughout the interval $[\min(x,c),\max(x,c)]$. Then,

Figures (4)

  • Figure 1: Convergence of descent methods on function $f(x,y) = 10x^2+y^2$ beginning at $x=1, y=-10$. In all cases, the optimal (not theoretical) step size is used. AT-CFGD is as described in shin2021caputo with $x^{(-1)} = 1.5, y^{(-1)} = -10.5$, $\alpha = 1/2$, $\beta = -4/10$. Fractional Descent guided by Gradient is the method discussed in Corollary \ref{['cor:beta_unified_low']} with $\alpha = 1/2$, $\beta = -4/10$, $\lambda_t = -\frac{0.0675}{(t+1)^{0.2}}$ in $x_t-c_t = -\lambda_t \nabla f(x_t)$.
  • Figure 2: Learning rates used by different methods in Figure \ref{['fig:convergence']} with the theoretical learning rate given by Corollary \ref{['cor:beta_unified_low']} added.
  • Figure 3: Comparison of fractional and standard gradient descent methods for $f(x) = x^T {\rm diag}([10,1,1,1,1]) x$ with $x_0 = (1,-10,5,8,-6)$. Hyper-parameters as in Corollary \ref{['cor:beta_unified_low']} are $\alpha = 1/2$, $\beta = -4/10$, $\lambda_t = -0.0675$
  • Figure 4: Comparison of fractional and standard gradient descent methods for $f(x) = x^T {\rm diag}([10,1,7,9,4]) x$ with $x_0 = (1,-10,5,8,-6)$. Hyper-parameters as in Corollary \ref{['cor:beta_unified_low']} are $\alpha = 1/2$, $\beta = -4/10$, $\lambda_t = -0.0675$

Theorems & Definitions (21)

  • Definition 1: Left Caputo Derivative
  • Definition 2: Right Caputo Derivative
  • Definition 3: Caputo Derivative
  • Theorem 4
  • Theorem 5: Relation between First Derivative and Fractional Derivative
  • Corollary 6
  • Definition 7
  • Definition 8
  • Corollary 9
  • Corollary 10
  • ...and 11 more