Towards Characterizing Divergence in Deep Q-Learning
Joshua Achiam, Ethan Knight, Pieter Abbeel
TL;DR
This work analyzes why deep Q-learning can diverge by examining the leading-order update under function approximation and bootstrapping, emphasizing the neural tangent kernel's role in stability. It introduces Preconditioned Q-Networks (PreQN), which precondition TD-errors using a minibatch estimate of the NTK to approximate non-expansive updates, potentially eliminating the need for target networks or multiple Q-functions. Empirical NTK analyses link network architecture and activation choice to stability, and PreQN demonstrates competitive performance on MuJoCo benchmarks, albeit with higher computational cost. The study also connects PreQN to natural-gradient Q-learning, offering theoretical and practical insights into stability for deep RL. Overall, the NTK-centered perspective provides a principled direction for architecture and algorithm design to mitigate divergence in DQL.
Abstract
Deep Q-Learning (DQL), a family of temporal difference algorithms for control, employs three techniques collectively known as the `deadly triad' in reinforcement learning: bootstrapping, off-policy learning, and function approximation. Prior work has demonstrated that together these can lead to divergence in Q-learning algorithms, but the conditions under which divergence occurs are not well-understood. In this note, we give a simple analysis based on a linear approximation to the Q-value updates, which we believe provides insight into divergence under the deadly triad. The central point in our analysis is to consider when the leading order approximation to the deep-Q update is or is not a contraction in the sup norm. Based on this analysis, we develop an algorithm which permits stable deep Q-learning for continuous control without any of the tricks conventionally used (such as target networks, adaptive gradient optimizers, or using multiple Q functions). We demonstrate that our algorithm performs above or near state-of-the-art on standard MuJoCo benchmarks from the OpenAI Gym.
