An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks
Zhifa Ke, Zaiwen Wen, Junyu Zhang
TL;DR
This work presents a non-asymptotic analysis of neural temporal-difference and Q-learning with multi-layer neural networks in reinforcement learning. By introducing a subspace decomposition that separates range and kernel components of the feature-covariance operator $\Sigma_\pi$, the authors establish an $\tilde{O}(\varepsilon^{-1})$ sample complexity under Markovian sampling, a significant improvement over prior $\tilde{O}(\varepsilon^{-2})$ bounds. The results apply to neural TD learning, neural Q-learning, and extend to minimax neural Q-learning in two-player zero-sum Markov games, under weaker regularity conditions. Experiments on OpenAI Gym tasks corroborate the theoretical findings, including width-dependent convergence and spectral properties of $\Sigma_\pi$. The paper contributes a versatile analytic framework with potential applications to broader neural-approximation RL algorithms, including actor-critic methods.
Abstract
Temporal difference (TD) learning algorithms with neural network function parameterization have well-established empirical success in many practical large-scale reinforcement learning tasks. However, theoretical understanding of these algorithms remains challenging due to the nonlinearity of the action-value approximation. In this paper, we develop an improved non-asymptotic analysis of the neural TD method with a general $L$-layer neural network. New proof techniques are developed and an improved new $\tilde{\mathcal{O}}(ε^{-1})$ sample complexity is derived. To our best knowledge, this is the first finite-time analysis of neural TD that achieves an $\tilde{\mathcal{O}}(ε^{-1})$ complexity under the Markovian sampling, as opposed to the best known $\tilde{\mathcal{O}}(ε^{-2})$ complexity in the existing literature.
