Table of Contents
Fetching ...

Stochastic Gradient Descent with Adaptive Data

Ethan Che, Jing Dong, Xin T. Tong

TL;DR

This paper introduces simple criteria for the adaptively generated data stream to guarantee the convergence of SGD, and shows that the convergence speed of SGD with adaptive data is largely similar to the classical iid setting, as long as the mixing time of the policy-induced dynamics is factored in.

Abstract

Stochastic gradient descent (SGD) is a powerful optimization technique that is particularly useful in online learning scenarios. Its convergence analysis is relatively well understood under the assumption that the data samples are independent and identically distributed (iid). However, applying SGD to policy optimization problems in operations research involves a distinct challenge: the policy changes the environment and thereby affects the data used to update the policy. The adaptively generated data stream involves samples that are non-stationary, no longer independent from each other, and affected by previous decisions. The influence of previous decisions on the data generated introduces bias in the gradient estimate, which presents a potential source of instability for online learning not present in the iid case. In this paper, we introduce simple criteria for the adaptively generated data stream to guarantee the convergence of SGD. We show that the convergence speed of SGD with adaptive data is largely similar to the classical iid setting, as long as the mixing time of the policy-induced dynamics is factored in. Our Lyapunov-function analysis allows one to translate existing stability analysis of stochastic systems studied in operations research into convergence rates for SGD, and we demonstrate this for queueing and inventory management problems. We also showcase how our result can be applied to study the sample complexity of an actor-critic policy gradient algorithm.

Stochastic Gradient Descent with Adaptive Data

TL;DR

This paper introduces simple criteria for the adaptively generated data stream to guarantee the convergence of SGD, and shows that the convergence speed of SGD with adaptive data is largely similar to the classical iid setting, as long as the mixing time of the policy-induced dynamics is factored in.

Abstract

Stochastic gradient descent (SGD) is a powerful optimization technique that is particularly useful in online learning scenarios. Its convergence analysis is relatively well understood under the assumption that the data samples are independent and identically distributed (iid). However, applying SGD to policy optimization problems in operations research involves a distinct challenge: the policy changes the environment and thereby affects the data used to update the policy. The adaptively generated data stream involves samples that are non-stationary, no longer independent from each other, and affected by previous decisions. The influence of previous decisions on the data generated introduces bias in the gradient estimate, which presents a potential source of instability for online learning not present in the iid case. In this paper, we introduce simple criteria for the adaptively generated data stream to guarantee the convergence of SGD. We show that the convergence speed of SGD with adaptive data is largely similar to the classical iid setting, as long as the mixing time of the policy-induced dynamics is factored in. Our Lyapunov-function analysis allows one to translate existing stability analysis of stochastic systems studied in operations research into convergence rates for SGD, and we demonstrate this for queueing and inventory management problems. We also showcase how our result can be applied to study the sample complexity of an actor-critic policy gradient algorithm.
Paper Structure (22 sections, 25 theorems, 267 equations, 4 figures, 1 algorithm)

This paper contains 22 sections, 25 theorems, 267 equations, 4 figures, 1 algorithm.

Key Result

Theorem 1

Suppose Assumptions ass:ergodicity0 -- ass:error0 hold. The iterates according to eq:genSGD satsifies where $O$ hides a polynomial of $M$ and $L$. If we fix $\eta_t=\eta_0 t^{-1/2}$ for some $\eta_0>0$, and assume ${\mathbb E} e_t =O(1/\sqrt{t})$, we can further simplify the bound to

Figures (4)

  • Figure 1: Inventory control with stock-out damping (overage cost $b=10$, underage cost $h=1$, and noise level $\sigma = 1.0$). The dotted line indicates $1/t$ convergence rate in all plots. (Top) Newsvendor loss gap for the SGD iterates with $AR(1)$ parameter $\alpha = 0.8$. (Top Left) Step-size schedule $\eta_{t} = 2Bt^{-1}$ across batch sizes $B \in \{1, 10, 100\}$. (Top Right) Step size schedule $\eta_{t} = 2Bt^{-1/2}$ with iterate averaging. (Bottom) Newsvendor loss gap for the SGD iterates with $AR(1)$ parameter $\alpha = 0.9$. (Bottom Left) Step-size schedule $\eta_{t} = 2Bt^{-1}$. (Bottom Right) Step-size schedule $\eta_{t} = 2Bt^{-1/2}$ with iterate averaging.
  • Figure 2: Pricing and capacity sizing in the single server queue. (Left) Last iterate loss gap for the SGD iterates with $\eta_{t} = 1/t$ across batch sizes $B \in \{1, 10, 100\}$. The dotted line displays the convergence rate of $t^{-1}$. (Right) Loss gap for the average iterate with $\eta_{t} = 1/\sqrt{t}$. The dotted line displays the convergence rate of $t^{-1/2}$.
  • Figure 3: Pricing and capacity sizing in the single server queue. (Left) Loss gap for the SGD iterates without projection and without iterate averaging with $\eta_{t} = 1/\sqrt{t}$ across batch sizes $B \in \{1, 10, 100\}$. (Right) Loss gap for the SGD iterates without projection but with iterate averaging and $\eta_{t} = 1/\sqrt{t}$. The dotted line displays the convergence rate of $t^{-1/2}$.
  • Figure 4: Policy gradient for tabular MDPs. Dotted line indicates a $t^{-1/2}$ convergence rate. (Left) Scaled loss gap for the averaged SGD iterate for the reinforcement learning problem across batch sizes $B \in \{1, 10, 100\}$ for 100 randomly generated MDPs with $|\mathcal{S}|=|\mathcal{A}|=5$. Step-size schedule is $\eta_{t} = 2B/t^{-1/2}$. (Right) Scaled loss gap for the averaged SGD iterates for the reinforcement learning problem for 100 randomly generated MDPs with $|\mathcal{S}|=|\mathcal{A}|=10$. Step-size schedule is $\eta_{t} = 2B/t^{-1/2}$.

Theorems & Definitions (26)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • Lemma 1
  • Theorem 3
  • Theorem 4
  • Proposition 1
  • Proposition 2
  • Theorem 5
  • Lemma 2
  • ...and 16 more