Quantitative Fluctuation Analysis for Continuous-Time Stochastic Gradient Descent via Malliavin Calculus

Solesne Bourguin; Shivam S. Dhama; Konstantinos Spiliopoulos

Quantitative Fluctuation Analysis for Continuous-Time Stochastic Gradient Descent via Malliavin Calculus

Solesne Bourguin, Shivam S. Dhama, Konstantinos Spiliopoulos

TL;DR

An explicit rate at which the Stochastic Gradient Descent in Continuous Time iterates converge to a critical point of the objective function is derived, in the Wasserstein metric, to a critical point of the objective function.

Abstract

In this paper, we establish a Quantitative Central Limit Theorem ({\sc qclt}) for the Stochastic Gradient Descent in Continuous Time ({\sc sgdct}) algorithm, whose parameter updates are governed by a stochastic differential equation. We derive an explicit rate at which the {\sc sgdct} iterates converge, in the Wasserstein metric, to a critical point of the objective function. This rate is driven primarily by the magnitude of the learning rate: for a fixed convexity constant of the objective function, smaller learning rates lead to slower convergence. Our approach relies on tools from Malliavin calculus. In particular, we apply a second-order Poincaré inequality and obtain explicit bounds by estimating the first- and second-order Malliavin derivatives separately. Controlling the second-order derivative requires several delicate calculations and a careful sequence of decompositions in order to achieve sharp estimates. We complement the theoretical results with several numerical experiments that illustrate the predicted convergence behavior.

Quantitative Fluctuation Analysis for Continuous-Time Stochastic Gradient Descent via Malliavin Calculus

TL;DR

Abstract

Paper Structure (21 sections, 43 theorems, 230 equations, 5 figures)

This paper contains 21 sections, 43 theorems, 230 equations, 5 figures.

Introduction
Problem Statement, Assumptions and Main result
Proof of our Main Result: Theorem \ref{['T:Main-theorem']}
Numerical Examples and Simulation
First-order Malliavin Derivatives
Proofs of Lemmas \ref{['L:Integrating-Factor-first-der']} and \ref{['L:1-der-moments']}
Proofs of Lemmas \ref{['L:L-1-deri-1']} through \ref{['L:L-8-deri-1']}
Second-order Malliavin Derivatives
Bound associated with the initial condition term: $\mathbb{E} \left[ (\eta^*_{t, r_1 \vee r_2})^{2p}\gamma(X_{r_1}, X_{r_2}, \theta_{r_1}, \theta_{r_2})^{2p} \right]$
Bound associated with the function $g(x,\theta)$ term: $\mathbb{E} \left[ \left( \int_{r_1 \vee r_2}^t \alpha_u \eta^*_{t,u} \Gamma^g (X_u, \theta_u) du \right)^{2p} \right]$
Bound associated with the function $f(x, \theta)$ terms: $\mathbb{E} [( \int_{r_1 \vee r_2}^t \alpha_u \eta^*_{t,u} \Gamma^f (X_u, \theta_u) dW_u)^{2p}]$ and $\mathbb{E} [( \int_{r_1 \vee r_2}^t \alpha_u^2 \eta^*_{t,u} f_{\theta \theta}(X_u,\theta_u) \Gamma^f (X_u, \theta_u) du)^{2p}]$
Rates corresponding to the cases $K_{g_{\theta \theta}}^* = \frac{1}{2C_\alpha} + 2 C_{\bar{g}}$, and $K_{g_{\theta \theta}}^* > \frac{1}{2C_\alpha} + 2 C_{\bar{g}}$
Bounds associated with Pre-limit expectation and variance
Bound for the term $\sqrt{\frac{\bar{\Sigma}}{\operatorname{Var}(\mathsf{F}_t)}} |{\mathbb{E}(\mathsf{F}_t)}|$
Bound for the term $\mathbb{E} \left(|{\mathsf{F}_t}| \right)\left|{1-\sqrt{\frac{\bar{\Sigma}}{\operatorname{Var}(\mathsf{F}_t)}}}\right|$
...and 6 more sections

Key Result

Proposition 2.6

Let $\theta_t$ be the solution of Equation E:Process-theta and Assumptions A:f*-growth through A:Learning-rate are satisfied. Then, as $t \to \infty$, we have where, for the solution $\Psi$ of Poisson Equation E:Poisson-equation-prelimit and the functions $\bar{h}(\theta) = \int h(x,\theta)\mu(dx)$, $h(x,\theta) \triangleq \sigma^2 \left[ f_\theta(x, \theta) \sigma^{-2} - \Psi_x(x, \theta) \righ

Figures (5)

Figure 1: X-independent dynamics: The quantities $\frac{\log(d_W(\mathsf{F}_t, N))}{\log(t)}$ and $d_W(\mathsf{F}_t, N)$ are examined over $1100$ sample paths with $t = 5000.$ For notational convenience, we denote the Wasserstein distance by $W_1(t)$ in all figures. Since $C_{\bar{g}} = 1$, the values of $C_\alpha C_{\bar{g}}$ are $0.43, 0.72, 0.78,$ and $1.0$. For visualization, in Figure \ref{['fig:sub2']} we display trajectories only up to $t = 500$.
Figure 2: ou process: We numerically estimate the limiting variance $\bar{\Sigma}$ for three values of $C_\alpha$: $0.045$, $0.0496$, and $0.068$. The remaining parameters are $t= 7000, dt = 0.1, \theta^* = 0.031$. Since $C_{\bar{g}} = 1/2\theta^* = 1/0.062$, the corresponding values of $C_\alpha C_{\bar{g}}$ are $0.72$, $0.8$, and $1.1$. At $t = 6500$, we obtain the estimates $\bar{\Sigma} \approx 0.0016$, $0.002$, and $0.0028$, respectively. For visualization, we display trajectories only up to $t = 200$.
Figure 3: ou process: The quantities $\frac{\log(d_W(\mathsf{F}_t, N))}{\log(t)}$ and $d_W(\mathsf{F}_t, N)$ are examined over $150$ sample paths and $150$ Monte Carlo runs with $t = 7000.$ For notational convenience, we denote the Wasserstein distance by $W_1(t)$ in all figures. For visualization, in Figure \ref{['fig:sub22']} we display trajectories only up to $t = 400$.
Figure 4: Cubic drift: We numerically estimate the limiting variance $\bar{\Sigma}$ for three values of $C_\alpha$: $0.0092$, $0.011$, and $0.016$. The remaining parameters are $t= 2000, dt = 0.1, \theta^* = 0.035$. Since $C_{\bar{g}} \approx 0.253 \left( \frac{2}{\theta^*} \right)^{\frac{3}{2}}$, the corresponding values of $C_{\bar{g}} C_\alpha$ are $1.01$, $1.21$ and $1.7$. At $t = 1600$, we obtain the estimates $\bar{\Sigma} \approx 0.0003$, $0.00034$, and $0.00038$, respectively. For visualization, we display trajectories only up to $t = 160$.
Figure 5: Cubic drift: The quantities $\frac{\log(d_W(\mathsf{F}_t, N))}{\log(t)}$ and $d_W(\mathsf{F}_t, N)$ are examined over $100$ sample paths and $100$ Monte Carlo runs with $t = 10000.$ For notational convenience, we denote the Wasserstein distance by $W_1(t)$ in all figures. For visualization, in Figure \ref{['fig:sub16']} we display trajectories only up to $t = 300$.

Theorems & Definitions (95)

Proposition 2.6: Qualitative clt siri_spilio_2020
Theorem 2.8: Quantitative clt
Remark 2.9
Remark 2.10: Comments on Assumption \ref{['A:Tech-Cond']}
Remark 2.11: Comments on the multidimensional case
Remark 2.12: Uniform-in-time moments
Proposition 3.1
proof : Proof of Theorem \ref{['T:Main-theorem']}
Example 4.1
Example 4.2
...and 85 more

Quantitative Fluctuation Analysis for Continuous-Time Stochastic Gradient Descent via Malliavin Calculus

TL;DR

Abstract

Quantitative Fluctuation Analysis for Continuous-Time Stochastic Gradient Descent via Malliavin Calculus

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (95)