Table of Contents
Fetching ...

Non-asymptotic convergence analysis of the stochastic gradient Hamiltonian Monte Carlo algorithm with discontinuous stochastic gradient with applications to training of ReLU neural networks

Luxu Liang, Ariel Neufeld, Ying Zhang

TL;DR

The paper develops a non-asymptotic convergence theory for the stochastic gradient Hamiltonian Monte Carlo algorithm when the stochastic gradient is allowed to be discontinuous. By splitting the gradient into a locally Lipschitz part and a bounded part, and by employing a continuity-in-average condition, the authors establish explicit Wasserstein-1 and Wasserstein-2 distance bounds between SGHMC iterates and the target invariant measure, with rates that can be made arbitrarily small by choosing a small step size $\eta$. They also derive non-asymptotic bounds on the expected excess risk for the associated nonconvex optimization problems, including training ReLU neural networks, and demonstrate practical applicability through quantile estimation, transfer learning, hedging under asymmetric risk, and real-data experiments. The results are complemented by a detailed proof strategy based on auxiliary processes, moment estimates, and contraction in a semi-metric, providing rigorous guarantees for SGHMC under discontinuous gradients and offering guidance for tuning $\beta$, $\gamma$, and $\eta$ in practice.

Abstract

In this paper, we provide a non-asymptotic analysis of the convergence of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm to a target measure in Wasserstein-1 and Wasserstein-2 distance. Crucially, compared to the existing literature on SGHMC, we allow its stochastic gradient to be discontinuous. This allows us to provide explicit upper bounds, which can be controlled to be arbitrarily small, for the expected excess risk of non-convex stochastic optimization problems with discontinuous stochastic gradients, including, among others, the training of neural networks with ReLU activation function. To illustrate the applicability of our main results, we consider numerical experiments on quantile estimation and on several optimization problems involving ReLU neural networks relevant in finance and artificial intelligence.

Non-asymptotic convergence analysis of the stochastic gradient Hamiltonian Monte Carlo algorithm with discontinuous stochastic gradient with applications to training of ReLU neural networks

TL;DR

The paper develops a non-asymptotic convergence theory for the stochastic gradient Hamiltonian Monte Carlo algorithm when the stochastic gradient is allowed to be discontinuous. By splitting the gradient into a locally Lipschitz part and a bounded part, and by employing a continuity-in-average condition, the authors establish explicit Wasserstein-1 and Wasserstein-2 distance bounds between SGHMC iterates and the target invariant measure, with rates that can be made arbitrarily small by choosing a small step size . They also derive non-asymptotic bounds on the expected excess risk for the associated nonconvex optimization problems, including training ReLU neural networks, and demonstrate practical applicability through quantile estimation, transfer learning, hedging under asymmetric risk, and real-data experiments. The results are complemented by a detailed proof strategy based on auxiliary processes, moment estimates, and contraction in a semi-metric, providing rigorous guarantees for SGHMC under discontinuous gradients and offering guidance for tuning , , and in practice.

Abstract

In this paper, we provide a non-asymptotic analysis of the convergence of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm to a target measure in Wasserstein-1 and Wasserstein-2 distance. Crucially, compared to the existing literature on SGHMC, we allow its stochastic gradient to be discontinuous. This allows us to provide explicit upper bounds, which can be controlled to be arbitrarily small, for the expected excess risk of non-convex stochastic optimization problems with discontinuous stochastic gradients, including, among others, the training of neural networks with ReLU activation function. To illustrate the applicability of our main results, we consider numerical experiments on quantile estimation and on several optimization problems involving ReLU neural networks relevant in finance and artificial intelligence.
Paper Structure (24 sections, 17 theorems, 280 equations, 5 figures, 4 tables)

This paper contains 24 sections, 17 theorems, 280 equations, 5 figures, 4 tables.

Key Result

Theorem 3.1

Let Assumption asm:A3-asm:A5 hold. Then, for any $\beta > 0$, there exist constants $C_1^{\star}, C_2^{\star}, C_3^{\star}, C_4^{\star} > 0$ such that, for every $n \in \mathbb{N}_0$, $0 < \eta \leq \eta_{\max}$ with $\eta_{\max}$ defined in eta_max, we obtain where $C_1^{\star}, C_2^{\star}, C_3^{\star}$, and $C_4^{\star}$ are made explicit and are summarized in Table table 1. In particular, for

Figures (5)

  • Figure 1: Expected excess risk of the optimization problem \ref{['op:qe']} by using SGHMC and SGLD algorithm (left) and the rate of convergence of the SGHMC algorithm based on 100000 samples (right).
  • Figure 2: True and prediction values for original learning task involving \ref{['3LFN_eg']} (left) and new learning task involving \ref{['2LFN']} (right).
  • Figure 3: Plots of test scores $\mathscr{E}_K^{\mathcal{N} \mathcal{N}}\left(s_0\right)$ for different numbers of assets and hidden sizes under the Black-Scholes-Merton model in the complete market case. The parameter settings are summarized in Table \ref{['tab:iid']}.
  • Figure 4: Plots of test scores $\mathscr{E}_K^{\mathcal{N} \mathcal{N}}\left(s_0\right)$ for different numbers of assets and hidden sizes under the Black-Scholes-Merton model in the incomplete market case. The parameter settings are summarized in Table \ref{['tab:iid']}.
  • Figure 5: Mean squared error for concrete compressive strength dataset (left) and Test accuracy curve for Fashion MNIST (right).

Theorems & Definitions (53)

  • Remark 2.1
  • Remark 2.2
  • Remark 2.3
  • Remark 2.4
  • Theorem 3.1
  • Corollary 3.2
  • Theorem 3.3
  • Proposition 4.1
  • proof
  • Proposition 4.2
  • ...and 43 more