The surprising efficiency of temporal difference learning for rare event prediction
Xiaoou Cheng, Jonathan Weare
TL;DR
The paper studies efficient policy evaluation for rare-event statistics in finite-state Markov reward processes. It develops a central limit theorem for the least-squares TD (LSTD) estimator and derives a simple, data-dependent bound on the relative asymptotic variance that hinges on connectivity quantities rather than worst-case conditioning. Under mild rare-event assumptions, the bound shows that LSTD requires only polynomially many observed transitions (e.g., $\mathcal{O}(n^3)$) to achieve a target relative accuracy, while MC data requirements grow exponentially with the state-space size. Experiments on a multimodal chain illustrate dramatic practical gains, and the results suggest broad applicability to trajectory-based TD methods for rare-event prediction with potential extensions to continuous spaces and online algorithms.
Abstract
We quantify the efficiency of temporal difference (TD) learning over the direct, or Monte Carlo (MC), estimator for policy evaluation in reinforcement learning, with an emphasis on estimation of quantities related to rare events. Policy evaluation is complicated in the rare event setting by the long timescale of the event and by the need for \emph{relative accuracy} in estimates of very small values. Specifically, we focus on least-squares TD (LSTD) prediction for finite state Markov chains, and show that LSTD can achieve relative accuracy far more efficiently than MC. We prove a central limit theorem for the LSTD estimator and upper bound the \emph{relative asymptotic variance} by simple quantities characterizing the connectivity of states relative to the transition probabilities between them. Using this bound, we show that, even when both the timescale of the rare event and the relative accuracy of the MC estimator are exponentially large in the number of states, LSTD maintains a fixed level of relative accuracy with a total number of observed transitions of the Markov chain that is only \emph{polynomially} large in the number of states.
