Table of Contents
Fetching ...

The surprising efficiency of temporal difference learning for rare event prediction

Xiaoou Cheng, Jonathan Weare

TL;DR

The paper studies efficient policy evaluation for rare-event statistics in finite-state Markov reward processes. It develops a central limit theorem for the least-squares TD (LSTD) estimator and derives a simple, data-dependent bound on the relative asymptotic variance that hinges on connectivity quantities rather than worst-case conditioning. Under mild rare-event assumptions, the bound shows that LSTD requires only polynomially many observed transitions (e.g., $\mathcal{O}(n^3)$) to achieve a target relative accuracy, while MC data requirements grow exponentially with the state-space size. Experiments on a multimodal chain illustrate dramatic practical gains, and the results suggest broad applicability to trajectory-based TD methods for rare-event prediction with potential extensions to continuous spaces and online algorithms.

Abstract

We quantify the efficiency of temporal difference (TD) learning over the direct, or Monte Carlo (MC), estimator for policy evaluation in reinforcement learning, with an emphasis on estimation of quantities related to rare events. Policy evaluation is complicated in the rare event setting by the long timescale of the event and by the need for \emph{relative accuracy} in estimates of very small values. Specifically, we focus on least-squares TD (LSTD) prediction for finite state Markov chains, and show that LSTD can achieve relative accuracy far more efficiently than MC. We prove a central limit theorem for the LSTD estimator and upper bound the \emph{relative asymptotic variance} by simple quantities characterizing the connectivity of states relative to the transition probabilities between them. Using this bound, we show that, even when both the timescale of the rare event and the relative accuracy of the MC estimator are exponentially large in the number of states, LSTD maintains a fixed level of relative accuracy with a total number of observed transitions of the Markov chain that is only \emph{polynomially} large in the number of states.

The surprising efficiency of temporal difference learning for rare event prediction

TL;DR

The paper studies efficient policy evaluation for rare-event statistics in finite-state Markov reward processes. It develops a central limit theorem for the least-squares TD (LSTD) estimator and derives a simple, data-dependent bound on the relative asymptotic variance that hinges on connectivity quantities rather than worst-case conditioning. Under mild rare-event assumptions, the bound shows that LSTD requires only polynomially many observed transitions (e.g., ) to achieve a target relative accuracy, while MC data requirements grow exponentially with the state-space size. Experiments on a multimodal chain illustrate dramatic practical gains, and the results suggest broad applicability to trajectory-based TD methods for rare-event prediction with potential extensions to continuous spaces and online algorithms.

Abstract

We quantify the efficiency of temporal difference (TD) learning over the direct, or Monte Carlo (MC), estimator for policy evaluation in reinforcement learning, with an emphasis on estimation of quantities related to rare events. Policy evaluation is complicated in the rare event setting by the long timescale of the event and by the need for \emph{relative accuracy} in estimates of very small values. Specifically, we focus on least-squares TD (LSTD) prediction for finite state Markov chains, and show that LSTD can achieve relative accuracy far more efficiently than MC. We prove a central limit theorem for the LSTD estimator and upper bound the \emph{relative asymptotic variance} by simple quantities characterizing the connectivity of states relative to the transition probabilities between them. Using this bound, we show that, even when both the timescale of the rare event and the relative accuracy of the MC estimator are exponentially large in the number of states, LSTD maintains a fixed level of relative accuracy with a total number of observed transitions of the Markov chain that is only \emph{polynomially} large in the number of states.
Paper Structure (12 sections, 5 theorems, 90 equations, 5 figures)

This paper contains 12 sections, 5 theorems, 90 equations, 5 figures.

Key Result

Lemma 1

For any consistent matrix norm, if the restriction of $S_D$ to row and column indices in $D$ is irreducible and aperiodic, then $\lVert (I - S_D^\tau)^{-1} \rVert \geq \mathbf{E}_\nu[T]/\tau$ where $\nu(i) = \lim_{t\rightarrow \infty} \mathbf{P}\left[ X_t = i\, |\, T>t\right]$ is the quasi-station

Figures (5)

  • Figure 1: Left: the (exact) mean first passage time $u(i)$ with $n=20,\ 40,$ and $80$. Middle: the relative asymptotic variance (solid lines) and the relative empirical MSE (circles) of the LSTD estimator with $\tau=1$. The relative empirical MSE are obtained with sample sizes $M = 10 n^3$. Right: the (exact) relative asymptotic MSE of the MC estimator.
  • Figure 2: Left: the (exact) committor function $u(i)$ with $n=20,\ 40,$ and $80$. Middle: the relative asymptotic variance (solid lines) and the relative empirical MSE (circles) of the LSTD estimator with $\tau=1$. The relative empirical MSE are obtained with sample sizes $M = 10 n^3$. Right: the (exact) relative asymptotic MSE of the MC estimator.
  • Figure 3: The bound and the truth for the maximum relative asymptotic variance of the mean first passage time and the committor with varying lag time $\tau$. The number of states fixed at $n=40$. The relative asymptotic variance bounds for the mean first passage time and the committor are from \ref{['eq:avar1']} and \ref{['eq:avar1cor']} respectively.
  • Figure 4: The relative asymptotic variance (solid lines) and the relative empirical MSE (circles) of the LSTD estimators of the mean first passage time and the committor, with $\mu$ being the invariant distribution $p$ conditioned within $D$. Note the scale is logarithmic. The relative empirical MSE are obtained with sample sizes $M = 10 n^3$. For $n=80$ the TD estimator fails with high probability and the empirical error is undefined.
  • Figure 5: Left: an undirected graph with edges colored according to the transition probabilities. An edge with darker color corresponds to a transition with higher probability. Right: after pruning edges with low transition probabilities, the minorizing graph stays connected.

Theorems & Definitions (14)

  • Lemma 1
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • proof : Proof of Theorem \ref{['thm:avar2']}
  • proof
  • proof : Proof of lower bound of $\|I - S_D\|_2$
  • proof : Proof of \ref{['eq: maxETscaling']}
  • proof : Proof of \ref{['eq: committor2scaling']}
  • proof : Proof of \ref{['eq:sigmaimain']} in Theorem \ref{['thm:clt']}
  • ...and 4 more