Convergence Analysis of Fractional Gradient Descent

Ashwani Aggarwal

Convergence Analysis of Fractional Gradient Descent

Ashwani Aggarwal

TL;DR

This paper analyzes variations of fractional gradient descent in smooth and convex, smooth and strongly convex, and smooth and non-convex settings to prove linear convergence for smooth and strongly convex functions and O(1/T)$ convergence for smooth and convex functions.

Abstract

Fractional derivatives are a well-studied generalization of integer order derivatives. Naturally, for optimization, it is of interest to understand the convergence properties of gradient descent using fractional derivatives. Convergence analysis of fractional gradient descent is currently limited both in the methods analyzed and the settings analyzed. This paper aims to fill in these gaps by analyzing variations of fractional gradient descent in smooth and convex, smooth and strongly convex, and smooth and non-convex settings. First, novel bounds will be established bridging fractional and integer derivatives. Then, these bounds will be applied to the aforementioned settings to prove linear convergence for smooth and strongly convex functions and $O(1/T)$ convergence for smooth and convex functions. Additionally, we prove $O(1/T)$ convergence for smooth and non-convex functions using an extended notion of smoothness - Hölder smoothness - that is more natural for fractional derivatives. Finally, empirical results will be presented on the potential speed up of fractional gradient descent over standard gradient descent as well as some preliminary theoretical results explaining this speed up.

Convergence Analysis of Fractional Gradient Descent

TL;DR

Abstract

convergence for smooth and convex functions. Additionally, we prove

convergence for smooth and non-convex functions using an extended notion of smoothness - Hölder smoothness - that is more natural for fractional derivatives. Finally, empirical results will be presented on the potential speed up of fractional gradient descent over standard gradient descent as well as some preliminary theoretical results explaining this speed up.

Paper Structure (32 sections, 16 theorems, 96 equations, 4 figures)

This paper contains 32 sections, 16 theorems, 96 equations, 4 figures.

Introduction
Related Work
Relating Fractional Derivative and Integer Derivative
Smooth and Strongly Convex Optimization
Fractional Gradient Descent Method
Single Dimensional Results
Higher Dimensional Results
Smooth and Convex Optimization
Smooth and Non-Convex Optimization
Fractional Gradient Descent Method
Convergence Results
Finding the Advantage of Fractional Gradient Descent
Experiments
Quadratic Function Analysis
Future Directions
...and 17 more sections

Key Result

Theorem 4

Choose some $\alpha\in\mathbb{R}$ and let $n = \lceil \alpha \rceil$. Suppose $f:\mathbb{R}\to\mathbb{R}$ is $n$ times differentiable and $f^n(t)$ is absolutely continuous throughout the interval $[\min(x,c),\max(x,c)]$. Then,

Figures (4)

Figure 1: Convergence of descent methods on function $f(x,y) = 10x^2+y^2$ beginning at $x=1, y=-10$. In all cases, the optimal (not theoretical) step size is used. AT-CFGD is as described in shin2021caputo with $x^{(-1)} = 1.5, y^{(-1)} = -10.5$, $\alpha = 1/2$, $\beta = -4/10$. Fractional Descent guided by Gradient is the method discussed in Corollary \ref{['cor:beta_unified_low']} with $\alpha = 1/2$, $\beta = -4/10$, $\lambda_t = -\frac{0.0675}{(t+1)^{0.2}}$ in $x_t-c_t = -\lambda_t \nabla f(x_t)$.
Figure 2: Learning rates used by different methods in Figure \ref{['fig:convergence']} with the theoretical learning rate given by Corollary \ref{['cor:beta_unified_low']} added.
Figure 3: Comparison of fractional and standard gradient descent methods for $f(x) = x^T {\rm diag}([10,1,1,1,1]) x$ with $x_0 = (1,-10,5,8,-6)$. Hyper-parameters as in Corollary \ref{['cor:beta_unified_low']} are $\alpha = 1/2$, $\beta = -4/10$, $\lambda_t = -0.0675$
Figure 4: Comparison of fractional and standard gradient descent methods for $f(x) = x^T {\rm diag}([10,1,7,9,4]) x$ with $x_0 = (1,-10,5,8,-6)$. Hyper-parameters as in Corollary \ref{['cor:beta_unified_low']} are $\alpha = 1/2$, $\beta = -4/10$, $\lambda_t = -0.0675$

Theorems & Definitions (21)

Definition 1: Left Caputo Derivative
Definition 2: Right Caputo Derivative
Definition 3: Caputo Derivative
Theorem 4
Theorem 5: Relation between First Derivative and Fractional Derivative
Corollary 6
Definition 7
Definition 8
Corollary 9
Corollary 10
...and 11 more

Convergence Analysis of Fractional Gradient Descent

TL;DR

Abstract

Convergence Analysis of Fractional Gradient Descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (21)