Table of Contents
Fetching ...

Towards Characterizing Divergence in Deep Q-Learning

Joshua Achiam, Ethan Knight, Pieter Abbeel

TL;DR

This work analyzes why deep Q-learning can diverge by examining the leading-order update under function approximation and bootstrapping, emphasizing the neural tangent kernel's role in stability. It introduces Preconditioned Q-Networks (PreQN), which precondition TD-errors using a minibatch estimate of the NTK to approximate non-expansive updates, potentially eliminating the need for target networks or multiple Q-functions. Empirical NTK analyses link network architecture and activation choice to stability, and PreQN demonstrates competitive performance on MuJoCo benchmarks, albeit with higher computational cost. The study also connects PreQN to natural-gradient Q-learning, offering theoretical and practical insights into stability for deep RL. Overall, the NTK-centered perspective provides a principled direction for architecture and algorithm design to mitigate divergence in DQL.

Abstract

Deep Q-Learning (DQL), a family of temporal difference algorithms for control, employs three techniques collectively known as the `deadly triad' in reinforcement learning: bootstrapping, off-policy learning, and function approximation. Prior work has demonstrated that together these can lead to divergence in Q-learning algorithms, but the conditions under which divergence occurs are not well-understood. In this note, we give a simple analysis based on a linear approximation to the Q-value updates, which we believe provides insight into divergence under the deadly triad. The central point in our analysis is to consider when the leading order approximation to the deep-Q update is or is not a contraction in the sup norm. Based on this analysis, we develop an algorithm which permits stable deep Q-learning for continuous control without any of the tricks conventionally used (such as target networks, adaptive gradient optimizers, or using multiple Q functions). We demonstrate that our algorithm performs above or near state-of-the-art on standard MuJoCo benchmarks from the OpenAI Gym.

Towards Characterizing Divergence in Deep Q-Learning

TL;DR

This work analyzes why deep Q-learning can diverge by examining the leading-order update under function approximation and bootstrapping, emphasizing the neural tangent kernel's role in stability. It introduces Preconditioned Q-Networks (PreQN), which precondition TD-errors using a minibatch estimate of the NTK to approximate non-expansive updates, potentially eliminating the need for target networks or multiple Q-functions. Empirical NTK analyses link network architecture and activation choice to stability, and PreQN demonstrates competitive performance on MuJoCo benchmarks, albeit with higher computational cost. The study also connects PreQN to natural-gradient Q-learning, offering theoretical and practical insights into stability for deep RL. Overall, the NTK-centered perspective provides a principled direction for architecture and algorithm design to mitigate divergence in DQL.

Abstract

Deep Q-Learning (DQL), a family of temporal difference algorithms for control, employs three techniques collectively known as the `deadly triad' in reinforcement learning: bootstrapping, off-policy learning, and function approximation. Prior work has demonstrated that together these can lead to divergence in Q-learning algorithms, but the conditions under which divergence occurs are not well-understood. In this note, we give a simple analysis based on a linear approximation to the Q-value updates, which we believe provides insight into divergence under the deadly triad. The central point in our analysis is to consider when the leading order approximation to the deep-Q update is or is not a contraction in the sup norm. Based on this analysis, we develop an algorithm which permits stable deep Q-learning for continuous control without any of the tricks conventionally used (such as target networks, adaptive gradient optimizers, or using multiple Q functions). We demonstrate that our algorithm performs above or near state-of-the-art on standard MuJoCo benchmarks from the OpenAI Gym.

Paper Structure

This paper contains 22 sections, 13 theorems, 41 equations, 14 figures, 1 algorithm.

Key Result

Theorem 1

For Q-learning with nonlinear function approximation based on the update in Eq deepq, when the state-action space is finite and the $Q$ function is represented as a vector in ${ {\mathbb R}^{|S||A|} }$, the $Q$-values before and after an update are related by: where $K_{\theta}$ is the $|S||A|\times|S||A|$ matrix of entries given by Eq kernelcomponent, and $D_{\rho}$ is a diagonal matrix with ent

Figures (14)

  • Figure 1: Average row ratio for networks with 2 hidden layers of size 32 (small), 64 (med), 128 (large), and 256 (exlarge), using data from Walker2d-v2. Error bars are standard deviations from 3 random network initializations (with fixed data).
  • Figure 2: Benchmarking PreQN against TD3 and SAC on standard OpenAI Gym MuJoCo environments. Curves are averaged over 7 random seeds. PreQN is stable and performant, despite not using target networks. The PreQN experiments used sin activations; the TD3 and SAC experiments used relu activations.
  • Figure 3: Examining the cosine alignment of actual $Q$-value change with intended $Q$-value change ($\cos(Q'-Q, y -Q)$) for PreQN and TD3 with relu and sin activations. Curves are averaged over 3 random seeds.
  • Figure 4: NTK analysis for randomly-initialized networks with various activation functions, where the NTKs were formed using 1000 steps taken by a rails-random policy in the Ant-v2 gym environment (with the same data used across all trials). Networks are MLPs with widths of $32, 64, 128, 256$ hidden units (small, med, large, exlarge respectively) and $2$ hidden layers. Each bar is the average over 3 random trials (different network initializations).
  • Figure 5: NTK analysis for randomly-initialized networks with various activation functions, where the NTKs were formed using 1000 steps taken by a rails-random policy in the HalfCheetah-v2 gym environment (with the same data used across all trials). Networks are MLPs with widths of $32, 64, 128, 256$ hidden units (small, med, large, exlarge respectively) and $2$ hidden layers. Each bar is the average over 3 random trials (different network initializations).
  • ...and 9 more figures

Theorems & Definitions (19)

  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 2
  • Theorem 3
  • Lemma 4
  • Lemma 4
  • proof
  • Lemma 4
  • ...and 9 more