An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks

Zhifa Ke; Zaiwen Wen; Junyu Zhang

An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks

Zhifa Ke, Zaiwen Wen, Junyu Zhang

TL;DR

This work presents a non-asymptotic analysis of neural temporal-difference and Q-learning with multi-layer neural networks in reinforcement learning. By introducing a subspace decomposition that separates range and kernel components of the feature-covariance operator $\Sigma_\pi$, the authors establish an $\tilde{O}(\varepsilon^{-1})$ sample complexity under Markovian sampling, a significant improvement over prior $\tilde{O}(\varepsilon^{-2})$ bounds. The results apply to neural TD learning, neural Q-learning, and extend to minimax neural Q-learning in two-player zero-sum Markov games, under weaker regularity conditions. Experiments on OpenAI Gym tasks corroborate the theoretical findings, including width-dependent convergence and spectral properties of $\Sigma_\pi$. The paper contributes a versatile analytic framework with potential applications to broader neural-approximation RL algorithms, including actor-critic methods.

Abstract

Temporal difference (TD) learning algorithms with neural network function parameterization have well-established empirical success in many practical large-scale reinforcement learning tasks. However, theoretical understanding of these algorithms remains challenging due to the nonlinearity of the action-value approximation. In this paper, we develop an improved non-asymptotic analysis of the neural TD method with a general $L$-layer neural network. New proof techniques are developed and an improved new $\tilde{\mathcal{O}}(ε^{-1})$ sample complexity is derived. To our best knowledge, this is the first finite-time analysis of neural TD that achieves an $\tilde{\mathcal{O}}(ε^{-1})$ complexity under the Markovian sampling, as opposed to the best known $\tilde{\mathcal{O}}(ε^{-2})$ complexity in the existing literature.

An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks

TL;DR

, the authors establish an

sample complexity under Markovian sampling, a significant improvement over prior

bounds. The results apply to neural TD learning, neural Q-learning, and extend to minimax neural Q-learning in two-player zero-sum Markov games, under weaker regularity conditions. Experiments on OpenAI Gym tasks corroborate the theoretical findings, including width-dependent convergence and spectral properties of

. The paper contributes a versatile analytic framework with potential applications to broader neural-approximation RL algorithms, including actor-critic methods.

Abstract

-layer neural network. New proof techniques are developed and an improved new

sample complexity is derived. To our best knowledge, this is the first finite-time analysis of neural TD that achieves an

complexity under the Markovian sampling, as opposed to the best known

complexity in the existing literature.

Paper Structure (20 sections, 16 theorems, 115 equations, 1 figure, 1 table, 3 algorithms)

This paper contains 20 sections, 16 theorems, 115 equations, 1 figure, 1 table, 3 algorithms.

Introduction
Preliminaries
Convergence of Neural Temporal Difference Learning
Basic Settings and Assumptions
An Improved Complexity of Neural TD Learning
Convergence of Minimax Neural Q-Learning
Experiments
Conclusion
Acknowledgements
Details of Section \ref{['section:conv']}
Proof of \ref{['eq:subspace-remaining']}
Proof of Theorem \ref{['theorem:pi-f']}
Proof of Theorem \ref{['theorem:total']}
Convergence Results of Neural Q-learning
Neural Q-Learning Algorithm
...and 5 more sections

Key Result

Proposition 3.4

Let $\mathcal{R}(\Sigma_\pi)$ and $\mathcal{K}(\Sigma_\pi)$ denote the range space and kernel space of the matrix $\Sigma_\pi$, respectively. Then for any parameter $\boldsymbol{\theta}\in S_\omega$, there exists $\boldsymbol{\theta}_*$ such that which also implies that the projections of $\boldsymbol{\theta}$ and $\boldsymbol{\theta}_*$ onto the subspace $\mathcal{K}(\Sigma_\pi)$ are identical.

Figures (1)

Figure 1: Training curves and the ratio of the largest and smallest non-zero singular values of $\Sigma_\pi$ over different network widths $m$.

Theorems & Definitions (30)

Proposition 3.4
proof
Proposition 3.5
Theorem 3.6
Theorem 3.7
Theorem 4.2
proof
Lemma 1.1
proof
Lemma 1.2
...and 20 more

An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks

TL;DR

Abstract

An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (30)