Table of Contents
Fetching ...

Gauss-Newton Temporal Difference Learning with Nonlinear Function Approximation

Zhifa Ke, Junyu Zhang, Zaiwen Wen

TL;DR

This work introduces Gauss-Newton Temporal Difference (GNTD) learning for Q-learning with nonlinear function approximation, addressing the double-sampling issue via target networks and a GN-based subproblem update. It establishes finite-time convergence results across linear, neural, and smooth function regimes, achieving notably improved sample complexities for neural networks ($\tilde{O}(\varepsilon^{-1})$) and $\tilde{O}(\varepsilon^{-1.5})$ for general smooth functions. The paper also presents an efficient neural implementation based on Kronecker-Factored Approximate Curvature (K-FAC) and Levenberg–Marquardt damping, termed GNTD-KFAC, with extensive experiments showing faster convergence and higher rewards than TD-type baselines and DQN in online and offline RL settings. These results suggest GN-based updates can significantly improve sample efficiency and stability in nonlinear Q-function approximation, with practical benefits for policy evaluation, offline RL, and continuous control. Overall, GNTD contributes a theoretically grounded, computationally efficient alternative to FQI/TD methods in nonlinear RL.

Abstract

In this paper, a Gauss-Newton Temporal Difference (GNTD) learning method is proposed to solve the Q-learning problem with nonlinear function approximation. In each iteration, our method takes one Gauss-Newton (GN) step to optimize a variant of Mean-Squared Bellman Error (MSBE), where target networks are adopted to avoid double sampling. Inexact GN steps are analyzed so that one can safely and efficiently compute the GN updates by cheap matrix iterations. Under mild conditions, non-asymptotic finite-sample convergence to the globally optimal Q function is derived for various nonlinear function approximations. In particular, for neural network parameterization with relu activation, GNTD achieves an improved sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-1})$, as opposed to the $\mathcal{\mathcal{O}}(\varepsilon^{-2})$ sample complexity of the existing neural TD methods. An $\tilde{\mathcal{O}}(\varepsilon^{-1.5})$ sample complexity of GNTD is also established for general smooth function approximations. We validate our method via extensive experiments in several RL benchmarks, where GNTD exhibits both higher rewards and faster convergence than TD-type methods.

Gauss-Newton Temporal Difference Learning with Nonlinear Function Approximation

TL;DR

This work introduces Gauss-Newton Temporal Difference (GNTD) learning for Q-learning with nonlinear function approximation, addressing the double-sampling issue via target networks and a GN-based subproblem update. It establishes finite-time convergence results across linear, neural, and smooth function regimes, achieving notably improved sample complexities for neural networks () and for general smooth functions. The paper also presents an efficient neural implementation based on Kronecker-Factored Approximate Curvature (K-FAC) and Levenberg–Marquardt damping, termed GNTD-KFAC, with extensive experiments showing faster convergence and higher rewards than TD-type baselines and DQN in online and offline RL settings. These results suggest GN-based updates can significantly improve sample efficiency and stability in nonlinear Q-function approximation, with practical benefits for policy evaluation, offline RL, and continuous control. Overall, GNTD contributes a theoretically grounded, computationally efficient alternative to FQI/TD methods in nonlinear RL.

Abstract

In this paper, a Gauss-Newton Temporal Difference (GNTD) learning method is proposed to solve the Q-learning problem with nonlinear function approximation. In each iteration, our method takes one Gauss-Newton (GN) step to optimize a variant of Mean-Squared Bellman Error (MSBE), where target networks are adopted to avoid double sampling. Inexact GN steps are analyzed so that one can safely and efficiently compute the GN updates by cheap matrix iterations. Under mild conditions, non-asymptotic finite-sample convergence to the globally optimal Q function is derived for various nonlinear function approximations. In particular, for neural network parameterization with relu activation, GNTD achieves an improved sample complexity of , as opposed to the sample complexity of the existing neural TD methods. An sample complexity of GNTD is also established for general smooth function approximations. We validate our method via extensive experiments in several RL benchmarks, where GNTD exhibits both higher rewards and faster convergence than TD-type methods.
Paper Structure (21 sections, 17 theorems, 97 equations, 2 figures, 2 tables, 3 algorithms)

This paper contains 21 sections, 17 theorems, 97 equations, 2 figures, 2 tables, 3 algorithms.

Key Result

Theorem 3.1

\newlabeltheorem:sto-lin-under-parameterization0 Suppose Assumptions assumption:stationary and assumption:cov-min-eig hold. For any $\varepsilon\ll \|Q^0-Q^*\|_\mu$, if we set $\beta=\frac{(1-\gamma)\lambda_0}{4}$, the damping rate $\omega\in(0,1)$ for each iteration, then the output $\theta^K$ of w.p. $1-\delta$, where $C_1>0$ is a given constant.

Figures (2)

  • Figure 1: Training curves over 5 random seeds on OpenAI Gym MuJoCo tasks. We use TD and GNTD to handle policy evaluation in the policy gradient (PG) algorithm, respectively. The shaded area captures the standard deviation at each iteration. "Time steps" refers to the number of occurrences in which various algorithms utilize samples of the same batch size.
  • Figure 2: Training curves over 5 random seeds on OpenAI Gym offline discrete tasks. The shaded area captures the standard deviation at each iteration.

Theorems & Definitions (30)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Lemma 4.1
  • Lemma 4.2
  • Proof 1
  • Lemma 4.3
  • Proof 2
  • Lemma 4.4
  • Proof 3
  • ...and 20 more