Analysing heavy-tail properties of Stochastic Gradient Descent by means of Stochastic Recurrence Equations

Ewa Damek; Sebastian Mentemeier

Analysing heavy-tail properties of Stochastic Gradient Descent by means of Stochastic Recurrence Equations

Ewa Damek, Sebastian Mentemeier

TL;DR

The problem is put into the right framework by applying the theory of irreducible-proximal matrices to solve the heavy tail properties of stochastic gradient descent in linear regression.

Abstract

In recent works on the theory of machine learning, it has been observed that heavy tail properties of Stochastic Gradient Descent (SGD) can be studied in the probabilistic framework of stochastic recursions. In particular, Gürbüzbalaban et al. (arXiv:2006.04740) considered a setup corresponding to linear regression for which iterations of SGD can be modelled by a multivariate affine stochastic recursion $X_k=A_k X_{k-1}+B_k$, for independent and identically distributed pairs $(A_k, B_k)$, where $A_k$ is a random symmetric matrix and $B_k$ is a random vector. In this work, we will answer several open questions of the quoted paper and extend their results by applying the theory of irreducible-proximal (i-p) matrices.

Analysing heavy-tail properties of Stochastic Gradient Descent by means of Stochastic Recurrence Equations

TL;DR

The problem is put into the right framework by applying the theory of irreducible-proximal matrices to solve the heavy tail properties of stochastic gradient descent in linear regression.

Abstract

, for independent and identically distributed pairs

, where

is a random symmetric matrix and

is a random vector. In this work, we will answer several open questions of the quoted paper and extend their results by applying the theory of irreducible-proximal (i-p) matrices.

Paper Structure (14 sections, 14 theorems, 111 equations, 2 figures)

This paper contains 14 sections, 14 theorems, 111 equations, 2 figures.

Introduction
Assumptions, Notations and Preliminaries
Main results
Heavy tail properties
How does the tail index depend on $\xi$?
Checking the assumptions - the model \ref{['Rank1Gauss']}
Finite Iterations: Proofs of Theorems \ref{['th:growmoments']} and \ref{['thm:tailsRn']}
Evaluation of tail index and stationary measure: The proofs of Theorems \ref{['lem:uniform measure']} and \ref{['th:behavioralpha']}
Identification and Differentiability of $k(s)=\mathop{\mathrm{\mathds{E}}}\nolimits | (I-\xi H)e_1| ^s$
The shape of the tail index function $\alpha (\xi)$: Proof of Theorem \ref{['th:behavioralpha']}
Specific models
Hierarchy of assumptions
The Gaussian model
Integrability of $\| A^{-1}\|^{\delta }$

Key Result

Proposition 2.1

Assume that $\mu_A$ satisfies (i-p-nc) and let $s \in I_k$. Then the following holds. The spectral radii $\rho(\mathop{\mathrm{P^s}}\nolimits)$ and $\rho(\mathop{\mathrm{P^s_*}}\nolimits)$ both equal $k(s)$ and there is a unique probability measure $\nu_{s}$ on $S$ and a unique function $r_{s}\in \m Further, the function $r_{s}$ is strictly positive. Also, there is a unique probability measure $\n

Figures (2)

Figure 1: Contour plot of $h$ as a function of $b$ and $s$, for model \ref{['Rank1Gauss']} with $d=2$ and $\eta=0.75$. The black line is the contour of $k \equiv1$. The values of $h$ have been cutted at level 2 for a better visualization.
Figure 2: Contour plot of $h$ as a function of $\eta$ and $s$, for model \ref{['Rank1Gauss']} with $d=2$ and $b=5$. The black line is the contour of $k\equiv1$. The values of $h$ have been cutted at level 2 for a better visualization.

Theorems & Definitions (36)

Proposition 2.1
proof : Source
Proposition 2.2
proof : Source
Remark 2.3
Theorem 3.1
proof : Source
Remark 3.2
Theorem 3.3
Remark 3.4
...and 26 more

Analysing heavy-tail properties of Stochastic Gradient Descent by means of Stochastic Recurrence Equations

TL;DR

Abstract

Analysing heavy-tail properties of Stochastic Gradient Descent by means of Stochastic Recurrence Equations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (36)