On the Statistical Benefits of Temporal Difference Learning
David Cheikhi, Daniel Russo
TL;DR
The paper analyzes the statistical efficiency of Temporal-Difference learning (TD) versus direct Monte Carlo (MC) estimation in a batch setting with finite Markov Reward Processes. It introduces two central ideas: the inverse trajectory pooling coefficient $C(s)$ that governs value-estimation gains from TD, and the trajectory crossing time $H(s,s')$ that bounds TD’s advantage-estimation error, potentially much smaller than the horizon. The authors prove asymptotic results linking TD’s MSE to $C(s)$ and show that TD advantages for value-to-go differences arise from coupling of errors via common crossing states, effectively truncating the horizon for those estimates. They provide Prop-based CLT analysis and empirical illustrations (e.g., layered MRPs) to demonstrate when TD yields substantial improvements and how this depends on trajectory pooling and state representation. Together, these results clarify when and why TD is statistically advantageous in practice and highlight the crucial role of representation in preserving trajectory pooling.
Abstract
Given a dataset on actions and resulting long-term rewards, a direct estimation approach fits value functions that minimize prediction error on the training data. Temporal difference learning (TD) methods instead fit value functions by minimizing the degree of temporal inconsistency between estimates made at successive time-steps. Focusing on finite state Markov chains, we provide a crisp asymptotic theory of the statistical advantages of this approach. First, we show that an intuitive inverse trajectory pooling coefficient completely characterizes the percent reduction in mean-squared error of value estimates. Depending on problem structure, the reduction could be enormous or nonexistent. Next, we prove that there can be dramatic improvements in estimates of the difference in value-to-go for two states: TD's errors are bounded in terms of a novel measure - the problem's trajectory crossing time - which can be much smaller than the problem's time horizon.
