Table of Contents
Fetching ...

An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks

Zhifa Ke, Zaiwen Wen, Junyu Zhang

TL;DR

This work presents a non-asymptotic analysis of neural temporal-difference and Q-learning with multi-layer neural networks in reinforcement learning. By introducing a subspace decomposition that separates range and kernel components of the feature-covariance operator $\Sigma_\pi$, the authors establish an $\tilde{O}(\varepsilon^{-1})$ sample complexity under Markovian sampling, a significant improvement over prior $\tilde{O}(\varepsilon^{-2})$ bounds. The results apply to neural TD learning, neural Q-learning, and extend to minimax neural Q-learning in two-player zero-sum Markov games, under weaker regularity conditions. Experiments on OpenAI Gym tasks corroborate the theoretical findings, including width-dependent convergence and spectral properties of $\Sigma_\pi$. The paper contributes a versatile analytic framework with potential applications to broader neural-approximation RL algorithms, including actor-critic methods.

Abstract

Temporal difference (TD) learning algorithms with neural network function parameterization have well-established empirical success in many practical large-scale reinforcement learning tasks. However, theoretical understanding of these algorithms remains challenging due to the nonlinearity of the action-value approximation. In this paper, we develop an improved non-asymptotic analysis of the neural TD method with a general $L$-layer neural network. New proof techniques are developed and an improved new $\tilde{\mathcal{O}}(ε^{-1})$ sample complexity is derived. To our best knowledge, this is the first finite-time analysis of neural TD that achieves an $\tilde{\mathcal{O}}(ε^{-1})$ complexity under the Markovian sampling, as opposed to the best known $\tilde{\mathcal{O}}(ε^{-2})$ complexity in the existing literature.

An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks

TL;DR

This work presents a non-asymptotic analysis of neural temporal-difference and Q-learning with multi-layer neural networks in reinforcement learning. By introducing a subspace decomposition that separates range and kernel components of the feature-covariance operator , the authors establish an sample complexity under Markovian sampling, a significant improvement over prior bounds. The results apply to neural TD learning, neural Q-learning, and extend to minimax neural Q-learning in two-player zero-sum Markov games, under weaker regularity conditions. Experiments on OpenAI Gym tasks corroborate the theoretical findings, including width-dependent convergence and spectral properties of . The paper contributes a versatile analytic framework with potential applications to broader neural-approximation RL algorithms, including actor-critic methods.

Abstract

Temporal difference (TD) learning algorithms with neural network function parameterization have well-established empirical success in many practical large-scale reinforcement learning tasks. However, theoretical understanding of these algorithms remains challenging due to the nonlinearity of the action-value approximation. In this paper, we develop an improved non-asymptotic analysis of the neural TD method with a general -layer neural network. New proof techniques are developed and an improved new sample complexity is derived. To our best knowledge, this is the first finite-time analysis of neural TD that achieves an complexity under the Markovian sampling, as opposed to the best known complexity in the existing literature.
Paper Structure (20 sections, 16 theorems, 115 equations, 1 figure, 1 table, 3 algorithms)

This paper contains 20 sections, 16 theorems, 115 equations, 1 figure, 1 table, 3 algorithms.

Key Result

Proposition 3.4

Let $\mathcal{R}(\Sigma_\pi)$ and $\mathcal{K}(\Sigma_\pi)$ denote the range space and kernel space of the matrix $\Sigma_\pi$, respectively. Then for any parameter $\boldsymbol{\theta}\in S_\omega$, there exists $\boldsymbol{\theta}_*$ such that which also implies that the projections of $\boldsymbol{\theta}$ and $\boldsymbol{\theta}_*$ onto the subspace $\mathcal{K}(\Sigma_\pi)$ are identical.

Figures (1)

  • Figure 1: Training curves and the ratio of the largest and smallest non-zero singular values of $\Sigma_\pi$ over different network widths $m$.

Theorems & Definitions (30)

  • Proposition 3.4
  • proof
  • Proposition 3.5
  • Theorem 3.6
  • Theorem 3.7
  • Theorem 4.2
  • proof
  • Lemma 1.1
  • proof
  • Lemma 1.2
  • ...and 20 more