Table of Contents
Fetching ...

Losing momentum in continuous-time stochastic optimisation

Kexin Jin, Jonas Latz, Chenguang Liu, Alessandro Scagliotti

TL;DR

A piecewise-deterministic Markov process that represents the optimiser by an underdamped dynamical system and the data subsampling through a stochastic switching is proposed and analysed, which shows convergence of the dynamical system to the global minimiser when reducing momentum over time and letting the subsampling rate go to infinity.

Abstract

The training of modern machine learning models often consists in solving high-dimensional non-convex optimisation problems that are subject to large-scale data. In this context, momentum-based stochastic optimisation algorithms have become particularly widespread. The stochasticity arises from data subsampling which reduces computational cost. Both, momentum and stochasticity help the algorithm to converge globally. In this work, we propose and analyse a continuous-time model for stochastic gradient descent with momentum. This model is a piecewise-deterministic Markov process that represents the optimiser by an underdamped dynamical system and the data subsampling through a stochastic switching. We investigate longtime limits, the subsampling-to-no-subsampling limit, and the momentum-to-no-momentum limit. We are particularly interested in the case of reducing the momentum over time. Under convexity assumptions, we show convergence of our dynamical system to the global minimiser when reducing momentum over time and letting the subsampling rate go to infinity. We then propose a stable, symplectic discretisation scheme to construct an algorithm from our continuous-time dynamical system. In experiments, we study our scheme in convex and non-convex test problems. Additionally, we train a convolutional neural network in an image classification problem. Our algorithm {attains} competitive results compared to stochastic gradient descent with momentum.

Losing momentum in continuous-time stochastic optimisation

TL;DR

A piecewise-deterministic Markov process that represents the optimiser by an underdamped dynamical system and the data subsampling through a stochastic switching is proposed and analysed, which shows convergence of the dynamical system to the global minimiser when reducing momentum over time and letting the subsampling rate go to infinity.

Abstract

The training of modern machine learning models often consists in solving high-dimensional non-convex optimisation problems that are subject to large-scale data. In this context, momentum-based stochastic optimisation algorithms have become particularly widespread. The stochasticity arises from data subsampling which reduces computational cost. Both, momentum and stochasticity help the algorithm to converge globally. In this work, we propose and analyse a continuous-time model for stochastic gradient descent with momentum. This model is a piecewise-deterministic Markov process that represents the optimiser by an underdamped dynamical system and the data subsampling through a stochastic switching. We investigate longtime limits, the subsampling-to-no-subsampling limit, and the momentum-to-no-momentum limit. We are particularly interested in the case of reducing the momentum over time. Under convexity assumptions, we show convergence of our dynamical system to the global minimiser when reducing momentum over time and letting the subsampling rate go to infinity. We then propose a stable, symplectic discretisation scheme to construct an algorithm from our continuous-time dynamical system. In experiments, we study our scheme in convex and non-convex test problems. Additionally, we train a convolutional neural network in an image classification problem. Our algorithm {attains} competitive results compared to stochastic gradient descent with momentum.
Paper Structure (25 sections, 26 theorems, 203 equations, 9 figures, 1 table)

This paper contains 25 sections, 26 theorems, 203 equations, 9 figures, 1 table.

Key Result

Theorem 5

Let $(q^\nu_t,p^\nu_t)_{t\ge 0}$ and $(q_t,p_t)_{t\ge 0}$ solve (eq:AS:pqe) and (eq:AS:pq). Then $(q^\nu_t,p^\nu_t)_{t\ge 0}$ converges weakly to $(q_t,p_t)_{t\ge 0}$ in $\mathcal{C}([0,\infty),X^2)$ as $\nu\to 0$, i.e. for any bounded continuous function $F: \mathcal{C}([0,\infty),X^2) \rightarrow

Figures (9)

  • Figure 1: Plot of the potential $\Phi$.
  • Figure 2: We consider the minimisation of $\bar{\Phi}(\theta) := \theta^2/2$. Let $\alpha = 1, m = 1$, $p_0 = 1$, $q_0 = 0$. Then, $q_t = \exp(-t/2)\left( \sin(\sqrt{3}t/2)/\sqrt{3} + \cos(\sqrt{3}t/2)\right) (t \geq 0)$ (solid line) oscillates around the solution and converges ultimately to $\theta^* = 0$. If we choose instead $m(t) = (1+t)^{-1}$$(t \geq 0)$ (dashed lines), we have $q_t = \exp(-t^2/2)$, which does not oscillate, but converges very quickly to $\theta^*$.
  • Figure 3: Schematic comparing gradient flow, underdamped gradient flow, stochastic gradient process, and stochastic gradient-momentum process in terms of the particle mass $m$ and learning rate $\nu$.
  • Figure 4: The plots depict for which combinations of $m,\alpha$ the discrete-time version of \ref{['eq:polyak_ode']} derived in \ref{['eq:semi_impl_rule']} manages to overcome the "false minimiser". The blue crosses represent convergence to the global minimiser of \ref{['eq:1_dim_ex']}, the red ones to the origin. The black curve divides the $m,\alpha$ that do and do not satisfy \ref{['eq:rel_escape_ex']}. As the step-size $h=\frac{1}{L}$ gets smaller, the theoretical prediction \ref{['eq:rel_escape_ex']} becomes more accurate. Finally, we observe that the gradient method (that corresponds to $m =0$) never converges to the global minimiser.
  • Figure 5: Convergence rate comparison. The plot represents the decreasing distance from the true minimiser achieved by the SGD and SGMP introduced in \ref{['eq:stoc_meth']}. We consider the case where the mass is constant, but the step-size $(h_n)_{n=0}^\infty$ is decreasing (left) and the case where the mass is reduced as $m_k=m_0(0.995)^k$ and the step-size decreases as before (right, see Subsection \ref{['sec_losingmomentum']}). The experiments are repeated $100$ times (always resampling the potentials), and we report the mean distance achieved by each method, and the corresponding standard deviation.
  • ...and 4 more figures

Theorems & Definitions (38)

  • Definition 1: Index process
  • Remark 2: Mass $m$
  • Remark 3: Learning rate and $\beta$
  • Definition 4
  • Theorem 5
  • Theorem 6
  • Remark 7
  • Definition 8
  • Proposition 9
  • Lemma 10: cf. Lemma \ref{['lem:dqi']}
  • ...and 28 more