Table of Contents
Fetching ...

A weak convergence approach to large deviations for stochastic approximations

Henrik Hult, Adam Lindhe, Pierre Nyquist, Guo-Jhen Wu

TL;DR

The paper establishes a full large-deviation principle for stochastic-approximation algorithms with state-dependent Markovian noise and decreasing step sizes by leveraging a weak-convergence framework. The main result provides a Laplace principle with rate function $I(\varphi)=\int_0^T \frac{1}{h(t)} L(\varphi(t),\dot{\varphi}(t))\,dt$, where $h(t)$ captures the time-scale and $L$ is a local rate function expressed via the family of transition kernels $\rho_x(y,\cdot)$. It also offers a new rate-function representation not relying on a limiting Hamiltonian, connects $L$ to several equivalent representations, and characterizes a time-dependent Hamiltonian $H(t,x,\alpha) = \frac{1}{h(t)} H(x,\alpha h(t))$, highlighting the two-time-scale ergodic structure. Applications include stochastic gradients, persistent-contrastive divergence, and the Wang–Landau algorithm, demonstrating broad applicability to learning and statistical-physics algorithms. The results generalize prior decreasing-step results to state-dependent noise and provide a robust framework for assessing rare-event behavior in complex stochastic-approximation schemes.

Abstract

The theory of stochastic approximations form the theoretical foundation for studying convergence properties of many popular recursive learning algorithms in statistics, machine learning and statistical physics. Large deviations for stochastic approximations provide asymptotic estimates of the probability that the learning algorithm deviates from its expected path, given by a limit ODE, and the large deviation rate function gives insights to the most likely way that such deviations occur. In this paper we prove a large deviation principle for general stochastic approximations with state-dependent Markovian noise and decreasing step size. Using the weak convergence approach to large deviations, we generalize previous results for stochastic approximations and identify the appropriate scaling sequence for the large deviation principle. We also give a new representation for the rate function, in which the rate function is expressed as an action functional involving the family of Markov transition kernels. Examples of learning algorithms that are covered by the large deviation principle include stochastic gradient descent, persistent contrastive divergence and the Wang-Landau algorithm.

A weak convergence approach to large deviations for stochastic approximations

TL;DR

The paper establishes a full large-deviation principle for stochastic-approximation algorithms with state-dependent Markovian noise and decreasing step sizes by leveraging a weak-convergence framework. The main result provides a Laplace principle with rate function , where captures the time-scale and is a local rate function expressed via the family of transition kernels . It also offers a new rate-function representation not relying on a limiting Hamiltonian, connects to several equivalent representations, and characterizes a time-dependent Hamiltonian , highlighting the two-time-scale ergodic structure. Applications include stochastic gradients, persistent-contrastive divergence, and the Wang–Landau algorithm, demonstrating broad applicability to learning and statistical-physics algorithms. The results generalize prior decreasing-step results to state-dependent noise and provide a robust framework for assessing rare-event behavior in complex stochastic-approximation schemes.

Abstract

The theory of stochastic approximations form the theoretical foundation for studying convergence properties of many popular recursive learning algorithms in statistics, machine learning and statistical physics. Large deviations for stochastic approximations provide asymptotic estimates of the probability that the learning algorithm deviates from its expected path, given by a limit ODE, and the large deviation rate function gives insights to the most likely way that such deviations occur. In this paper we prove a large deviation principle for general stochastic approximations with state-dependent Markovian noise and decreasing step size. Using the weak convergence approach to large deviations, we generalize previous results for stochastic approximations and identify the appropriate scaling sequence for the large deviation principle. We also give a new representation for the rate function, in which the rate function is expressed as an action functional involving the family of Markov transition kernels. Examples of learning algorithms that are covered by the large deviation principle include stochastic gradient descent, persistent contrastive divergence and the Wang-Landau algorithm.

Paper Structure

This paper contains 25 sections, 25 theorems, 289 equations.

Key Result

Theorem 3.1

Let $X^n = \{X^n(t):t\in[0,T]\}$ be the continuous interpolations of $\{X^n_k\}_{k\geq n}$ given by eqn_recursion and take $L$ as in eqn_local_rate_function. Under Assumptions ass:Lipschitz-ass:limith, $I$ is a rate function, and $\{X^n\} _{n \in \mathbb{N}}$ satisfies a Laplace principle with scal

Theorems & Definitions (55)

  • Definition 2.1: Laplace principle
  • Remark 2.3
  • Theorem 3.1: Laplace principle
  • proof
  • Proposition 3.2
  • proof
  • Proposition 3.3
  • proof
  • Remark 3.4
  • Lemma 3.5
  • ...and 45 more