An Analysis of Quantile Temporal-Difference Learning

Mark Rowland; Rémi Munos; Mohammad Gheshlaghi Azar; Yunhao Tang; Georg Ostrovski; Anna Harutyunyan; Karl Tuyls; Marc G. Bellemare; Will Dabney

An Analysis of Quantile Temporal-Difference Learning

Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney

TL;DR

This work provides the first convergence analysis for quantile temporal-difference learning (QTD) by casting QTD as a stochastic approximation to a differential inclusion and linking its asymptotics to a family of Quantile Dynamic Programming (QDP) fixed points. It establishes that, under mild conditions, QTD converges almost surely to the set of fixed points of the projected distributional Bellman operators $oldPi^oldlambda oldT^ ho$, with contraction guaranteed in Wasserstein-like metrics. The authors develop a Lyapunov framework and leverage Marchaud differential inclusions to handle non-smooth dynamics arising from quantile-based updates, proving boundedness, existence of solutions, and convergence to fixed points. They also derive instance-dependent bounds on fixed-point quality, analyze qualitative artifacts via back-up diagrams, and extend results to asynchronous QTD, offering theoretical grounding for QTD’s empirical successes. Together, these results advance understanding of distributional RL with quantile representations and guide practical design choices (e.g., number of quantiles, projection schemes) for reliable, scalable learning.

Abstract

We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.

An Analysis of Quantile Temporal-Difference Learning

TL;DR

, with contraction guaranteed in Wasserstein-like metrics. The authors develop a Lyapunov framework and leverage Marchaud differential inclusions to handle non-smooth dynamics arising from quantile-based updates, proving boundedness, existence of solutions, and convergence to fixed points. They also derive instance-dependent bounds on fixed-point quality, analyze qualitative artifacts via back-up diagrams, and extend results to asynchronous QTD, offering theoretical grounding for QTD’s empirical successes. Together, these results advance understanding of distributional RL with quantile representations and guide practical design choices (e.g., number of quantiles, projection schemes) for reliable, scalable learning.

Abstract

Paper Structure (37 sections, 14 theorems, 97 equations, 5 figures, 4 algorithms)

This paper contains 37 sections, 14 theorems, 97 equations, 5 figures, 4 algorithms.

Introduction
Background
Markov Decision Processes
Predicting Expected Returns and the Return Distribution
Monte Carlo and Temporal-Difference Learning
Quantile Temporal-Difference Learning and Quantile Dynamic Programming
Quantile Regression
Quantile Temporal-Difference Learning
Motivating Examples
Quantile Dynamic Programming
Convergence of Quantile Dynamic Programming
Convergence Analysis
Convergence of Quantile Temporal-Difference Learning
The QTD Differential Equation
The QTD Differential Inclusion
...and 22 more sections

Key Result

Proposition 5

The distributional Bellman operator $\mathcal{T}^\pi : \mathscr{P}(\mathbb{R})^{\mathcal{X}} \rightarrow \mathscr{P}(\mathbb{R})^{\mathcal{X}}$ is a $\gamma$-contraction with respect to $\bar{w}_\infty$. That is, for all $\eta, \eta' \in \mathscr{P}(\mathbb{R})^\mathcal{X}$.

Figures (5)

Figure 1: The three distinct scenarios that arise in defining quantiles. Firstly, there is a value $z_1$ for which $F_\nu(z_1)=\tau_1$ and at which $F_\nu$ is strictly increasing. Therefore $z_1$ is the unique $\tau_1$-quantile of $\nu$. Next, there is an interval $[z_2, z_2']$ on which $F_\nu$ equals $\tau_2$, therefore all elements in this interval are $\tau_2$-quantiles of $\nu$. Finally, there is no value $z$ such that $F_\nu(z) = \tau_3$, and the unique $\tau_3$-quantile is therefore defined by the infimum part of the definition.
Figure 2: Top: A chain MDP with four states. Each transition yields a normally-distributed reward; from $x_3$, the episode ends. The discount factor is $\gamma = 0.9$. Centre-top: The progress of QTD, run with $m=5$ quantiles, over the course of 10,000 updates. The vertical axis corresponds to the predicted quantile values. Centre-bottom: The true CDF of the return distribution (blue) at each state, along with the final estimate produced by QTD (black), and the approximation produced by the quantiles of the return distribution (grey). Bottom: The PDF of the return distribution (blue) at each state, along with the final quantile approximation produced by QTD (black).
Figure 3: Top left: The example Markov decision process described in Example \ref{['ex:2d']}. Top right: Example dynamics of QTD with $m=1$ in this environment, when reward distributions are Gaussian. Also included are the directions of expected update, in blue. Bottom left: Example dynamics and expected update directions when reward distributions are Dirac deltas. Bottom right: Example dynamics and expected updates with modified environment transition probabilities.
Figure 4: Top left: Illustration of QDP (dashed purple) and QTD (solid red) on the first MDP from Example \ref{['ex:2d']}, with Gaussian rewards. Top right: Illustration of QDP and QTD on the second MDP from Example \ref{['ex:2d']}, with deterministic rewards. Bottom: Values of $\lambda$ and corresponding fixed points of QDP in the final MDP from Example \ref{['ex:2d']}.
Figure 5: Left: An example MDP. Centre: The fixed point return distribution estimates for state $x_1$ obtained by QDP for $m=2,5,20,100$ (solid purple, dotted blue, dashed green, and dash-dotted orange, respectively) compared to ground truth in solid black. Right: The corresponding local quantile backup diagram at the fixed point for $m=2$, illustrating potential approximation artefacts in QDP fixed points.

Theorems & Definitions (25)

Definition 1
Example 2
Example 3
Remark 4
Proposition 5
Proposition 5
Proposition 6
Theorem 7
Remark 9
Definition 10
...and 15 more

An Analysis of Quantile Temporal-Difference Learning

TL;DR

Abstract

An Analysis of Quantile Temporal-Difference Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (25)