An Analysis of Quantile Temporal-Difference Learning
Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney
TL;DR
This work provides the first convergence analysis for quantile temporal-difference learning (QTD) by casting QTD as a stochastic approximation to a differential inclusion and linking its asymptotics to a family of Quantile Dynamic Programming (QDP) fixed points. It establishes that, under mild conditions, QTD converges almost surely to the set of fixed points of the projected distributional Bellman operators $oldPi^oldlambda oldT^ ho$, with contraction guaranteed in Wasserstein-like metrics. The authors develop a Lyapunov framework and leverage Marchaud differential inclusions to handle non-smooth dynamics arising from quantile-based updates, proving boundedness, existence of solutions, and convergence to fixed points. They also derive instance-dependent bounds on fixed-point quality, analyze qualitative artifacts via back-up diagrams, and extend results to asynchronous QTD, offering theoretical grounding for QTD’s empirical successes. Together, these results advance understanding of distributional RL with quantile representations and guide practical design choices (e.g., number of quantiles, projection schemes) for reliable, scalable learning.
Abstract
We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.
