Table of Contents
Fetching ...

Quantitative Fluctuation Analysis for Continuous-Time Stochastic Gradient Descent via Malliavin Calculus

Solesne Bourguin, Shivam S. Dhama, Konstantinos Spiliopoulos

TL;DR

An explicit rate at which the Stochastic Gradient Descent in Continuous Time iterates converge to a critical point of the objective function is derived, in the Wasserstein metric, to a critical point of the objective function.

Abstract

In this paper, we establish a Quantitative Central Limit Theorem ({\sc qclt}) for the Stochastic Gradient Descent in Continuous Time ({\sc sgdct}) algorithm, whose parameter updates are governed by a stochastic differential equation. We derive an explicit rate at which the {\sc sgdct} iterates converge, in the Wasserstein metric, to a critical point of the objective function. This rate is driven primarily by the magnitude of the learning rate: for a fixed convexity constant of the objective function, smaller learning rates lead to slower convergence. Our approach relies on tools from Malliavin calculus. In particular, we apply a second-order Poincaré inequality and obtain explicit bounds by estimating the first- and second-order Malliavin derivatives separately. Controlling the second-order derivative requires several delicate calculations and a careful sequence of decompositions in order to achieve sharp estimates. We complement the theoretical results with several numerical experiments that illustrate the predicted convergence behavior.

Quantitative Fluctuation Analysis for Continuous-Time Stochastic Gradient Descent via Malliavin Calculus

TL;DR

An explicit rate at which the Stochastic Gradient Descent in Continuous Time iterates converge to a critical point of the objective function is derived, in the Wasserstein metric, to a critical point of the objective function.

Abstract

In this paper, we establish a Quantitative Central Limit Theorem ({\sc qclt}) for the Stochastic Gradient Descent in Continuous Time ({\sc sgdct}) algorithm, whose parameter updates are governed by a stochastic differential equation. We derive an explicit rate at which the {\sc sgdct} iterates converge, in the Wasserstein metric, to a critical point of the objective function. This rate is driven primarily by the magnitude of the learning rate: for a fixed convexity constant of the objective function, smaller learning rates lead to slower convergence. Our approach relies on tools from Malliavin calculus. In particular, we apply a second-order Poincaré inequality and obtain explicit bounds by estimating the first- and second-order Malliavin derivatives separately. Controlling the second-order derivative requires several delicate calculations and a careful sequence of decompositions in order to achieve sharp estimates. We complement the theoretical results with several numerical experiments that illustrate the predicted convergence behavior.
Paper Structure (21 sections, 43 theorems, 230 equations, 5 figures)

This paper contains 21 sections, 43 theorems, 230 equations, 5 figures.

Table of Contents

  1. Introduction
  2. Problem Statement, Assumptions and Main result
  3. Proof of our Main Result: Theorem \ref{['T:Main-theorem']}
  4. Numerical Examples and Simulation
  5. First-order Malliavin Derivatives
  6. Proofs of Lemmas \ref{['L:Integrating-Factor-first-der']} and \ref{['L:1-der-moments']}
  7. Proofs of Lemmas \ref{['L:L-1-deri-1']} through \ref{['L:L-8-deri-1']}
  8. Second-order Malliavin Derivatives
  9. Bound associated with the initial condition term: $\mathbb{E} \left[ (\eta^*_{t, r_1 \vee r_2})^{2p}\gamma(X_{r_1}, X_{r_2}, \theta_{r_1}, \theta_{r_2})^{2p} \right]$
  10. Bound associated with the function $g(x,\theta)$ term: $\mathbb{E} \left[ \left( \int_{r_1 \vee r_2}^t \alpha_u \eta^*_{t,u} \Gamma^g (X_u, \theta_u) du \right)^{2p} \right]$
  11. Bound associated with the function $f(x, \theta)$ terms: $\mathbb{E} [( \int_{r_1 \vee r_2}^t \alpha_u \eta^*_{t,u} \Gamma^f (X_u, \theta_u) dW_u)^{2p}]$ and $\mathbb{E} [( \int_{r_1 \vee r_2}^t \alpha_u^2 \eta^*_{t,u} f_{\theta \theta}(X_u,\theta_u) \Gamma^f (X_u, \theta_u) du)^{2p}]$
  12. Rates corresponding to the cases $K_{g_{\theta \theta}}^* = \frac{1}{2C_\alpha} + 2 C_{\bar{g}}$, and $K_{g_{\theta \theta}}^* > \frac{1}{2C_\alpha} + 2 C_{\bar{g}}$
  13. Bounds associated with Pre-limit expectation and variance
  14. Bound for the term $\sqrt{\frac{\bar{\Sigma}}{\operatorname{Var}(\mathsf{F}_t)}} |{\mathbb{E}(\mathsf{F}_t)}|$
  15. Bound for the term $\mathbb{E} \left(|{\mathsf{F}_t}| \right)\left|{1-\sqrt{\frac{\bar{\Sigma}}{\operatorname{Var}(\mathsf{F}_t)}}}\right|$
  16. ...and 6 more sections

Key Result

Proposition 2.6

Let $\theta_t$ be the solution of Equation E:Process-theta and Assumptions A:f*-growth through A:Learning-rate are satisfied. Then, as $t \to \infty$, we have where, for the solution $\Psi$ of Poisson Equation E:Poisson-equation-prelimit and the functions $\bar{h}(\theta) = \int h(x,\theta)\mu(dx)$, $h(x,\theta) \triangleq \sigma^2 \left[ f_\theta(x, \theta) \sigma^{-2} - \Psi_x(x, \theta) \righ

Figures (5)

  • Figure 1: X-independent dynamics: The quantities $\frac{\log(d_W(\mathsf{F}_t, N))}{\log(t)}$ and $d_W(\mathsf{F}_t, N)$ are examined over $1100$ sample paths with $t = 5000.$ For notational convenience, we denote the Wasserstein distance by $W_1(t)$ in all figures. Since $C_{\bar{g}} = 1$, the values of $C_\alpha C_{\bar{g}}$ are $0.43, 0.72, 0.78,$ and $1.0$. For visualization, in Figure \ref{['fig:sub2']} we display trajectories only up to $t = 500$.
  • Figure 2: ou process: We numerically estimate the limiting variance $\bar{\Sigma}$ for three values of $C_\alpha$: $0.045$, $0.0496$, and $0.068$. The remaining parameters are $t= 7000, dt = 0.1, \theta^* = 0.031$. Since $C_{\bar{g}} = 1/2\theta^* = 1/0.062$, the corresponding values of $C_\alpha C_{\bar{g}}$ are $0.72$, $0.8$, and $1.1$. At $t = 6500$, we obtain the estimates $\bar{\Sigma} \approx 0.0016$, $0.002$, and $0.0028$, respectively. For visualization, we display trajectories only up to $t = 200$.
  • Figure 3: ou process: The quantities $\frac{\log(d_W(\mathsf{F}_t, N))}{\log(t)}$ and $d_W(\mathsf{F}_t, N)$ are examined over $150$ sample paths and $150$ Monte Carlo runs with $t = 7000.$ For notational convenience, we denote the Wasserstein distance by $W_1(t)$ in all figures. For visualization, in Figure \ref{['fig:sub22']} we display trajectories only up to $t = 400$.
  • Figure 4: Cubic drift: We numerically estimate the limiting variance $\bar{\Sigma}$ for three values of $C_\alpha$: $0.0092$, $0.011$, and $0.016$. The remaining parameters are $t= 2000, dt = 0.1, \theta^* = 0.035$. Since $C_{\bar{g}} \approx 0.253 \left( \frac{2}{\theta^*} \right)^{\frac{3}{2}}$, the corresponding values of $C_{\bar{g}} C_\alpha$ are $1.01$, $1.21$ and $1.7$. At $t = 1600$, we obtain the estimates $\bar{\Sigma} \approx 0.0003$, $0.00034$, and $0.00038$, respectively. For visualization, we display trajectories only up to $t = 160$.
  • Figure 5: Cubic drift: The quantities $\frac{\log(d_W(\mathsf{F}_t, N))}{\log(t)}$ and $d_W(\mathsf{F}_t, N)$ are examined over $100$ sample paths and $100$ Monte Carlo runs with $t = 10000.$ For notational convenience, we denote the Wasserstein distance by $W_1(t)$ in all figures. For visualization, in Figure \ref{['fig:sub16']} we display trajectories only up to $t = 300$.

Theorems & Definitions (95)

  • Proposition 2.6: Qualitative clt siri_spilio_2020
  • Theorem 2.8: Quantitative clt
  • Remark 2.9
  • Remark 2.10: Comments on Assumption \ref{['A:Tech-Cond']}
  • Remark 2.11: Comments on the multidimensional case
  • Remark 2.12: Uniform-in-time moments
  • Proposition 3.1
  • proof : Proof of Theorem \ref{['T:Main-theorem']}
  • Example 4.1
  • Example 4.2
  • ...and 85 more