Table of Contents
Fetching ...

A Finite Sample Analysis of Distributional TD Learning with Linear Function Approximation

Yang Peng, Kaicheng Jin, Liangyu Zhang, Zhihua Zhang

TL;DR

The paper addresses distributional policy evaluation with linear function approximation and proves finite-sample, non-asymptotic rates for a linear-categorical TD algorithm (Linear-CTD). By formulating a linear-categorical projected Bellman equation and applying exponential stability to products of random matrices, it shows that learning the full return distribution is statistically as tractable as learning its mean under linear approximations. Theoretical results include instance-dependent and -independent step-size bounds, high-probability guarantees for Markovian data, and a mean-preserving property of Linear-CTD, with a preconditioning technique that removes $K$-dependence in the sample complexity. Empirical validation corroborates convergence and demonstrates Linear-CTD’s advantages over baseline PMF-based methods, especially as the number of categorical supports $K$ grows. Overall, the work bridges a gap in distributional RL by matching the non-asymptotic efficiency of classic TD learning in the linear-function setting and outlining avenues for further improvement via variance reduction.

Abstract

In this paper, we study the finite-sample statistical rates of distributional temporal difference (TD) learning with linear function approximation. The aim of distributional TD learning is to estimate the return distribution of a discounted Markov decision process for a given policy π. Previous works on statistical analysis of distributional TD learning mainly focus on the tabular case. In contrast, we first consider the linear function approximation setting and derive sharp finite-sample rates. Our theoretical results demonstrate that the sample complexity of linear distributional TD learning matches that of classic linear TD learning. This implies that, with linear function approximation, learning the full distribution of the return from streaming data is no more difficult than learning its expectation (value function). To derive tight sample complexity bounds, we conduct a fine-grained analysis of the linear-categorical Bellman equation and employ the exponential stability arguments for products of random matrices. Our results provide new insights into the statistical efficiency of distributional reinforcement learning algorithms.

A Finite Sample Analysis of Distributional TD Learning with Linear Function Approximation

TL;DR

The paper addresses distributional policy evaluation with linear function approximation and proves finite-sample, non-asymptotic rates for a linear-categorical TD algorithm (Linear-CTD). By formulating a linear-categorical projected Bellman equation and applying exponential stability to products of random matrices, it shows that learning the full return distribution is statistically as tractable as learning its mean under linear approximations. Theoretical results include instance-dependent and -independent step-size bounds, high-probability guarantees for Markovian data, and a mean-preserving property of Linear-CTD, with a preconditioning technique that removes -dependence in the sample complexity. Empirical validation corroborates convergence and demonstrates Linear-CTD’s advantages over baseline PMF-based methods, especially as the number of categorical supports grows. Overall, the work bridges a gap in distributional RL by matching the non-asymptotic efficiency of classic TD learning in the linear-function setting and outlining avenues for further improvement via variance reduction.

Abstract

In this paper, we study the finite-sample statistical rates of distributional temporal difference (TD) learning with linear function approximation. The aim of distributional TD learning is to estimate the return distribution of a discounted Markov decision process for a given policy π. Previous works on statistical analysis of distributional TD learning mainly focus on the tabular case. In contrast, we first consider the linear function approximation setting and derive sharp finite-sample rates. Our theoretical results demonstrate that the sample complexity of linear distributional TD learning matches that of classic linear TD learning. This implies that, with linear function approximation, learning the full distribution of the return from streaming data is no more difficult than learning its expectation (value function). To derive tight sample complexity bounds, we conduct a fine-grained analysis of the linear-categorical Bellman equation and employ the exponential stability arguments for products of random matrices. Our results provide new insights into the statistical efficiency of distributional reinforcement learning algorithms.

Paper Structure

This paper contains 61 sections, 38 theorems, 183 equations, 5 figures, 2 tables.

Key Result

Proposition 3.1

For any $\bm{\eta}\in({\mathscr{P}}^{\mathsf{sign}})^{{\mathcal{S}}}$, ${\bm{\Pi}}_{{\bm{\phi}}, K}^{\pi}\bm{\eta}$ is uniquely given by $\bm{\eta}_{\tilde{{\bm{\theta}}}}$, where $\tilde{{\bm{\theta}}}=\mathsf{vec}(\tilde{{\bm{\Theta}}})$, Here ${\bm{p}}_{\bm{\eta}}(s){:}{=}{\bm{p}}_{\eta(s)}{=}(p_k(\eta(s)))_{k{=}0}^{K{-}1}$ is the vector that identifies ${\bm{\Pi}}_K\eta(s)$ defined in Eqn. eq

Figures (5)

  • Figure 1: Convergence results under varying $K$ for our ${\texttt{Linear-CTD}}$ algorithm with step size $\alpha = 0.01$. These curves exhibit similar trends, demonstrating our algorithm's robustness across different $K$ values.
  • Figure 2: Convergence results under varying $K$ for the baseline algorithm, SSGD with the PMF representation with step size $\alpha = 0.01$. We remark that when $K = 45$, the program reports errors of inf and nan. In contrast to results of ${\texttt{Linear-CTD}}$ in Figure \ref{['fig:loss_cdf']}, the baseline algorithm no longer converges when $K$ is large ($K\geq 44$).
  • Figure 3: Convergence results with different step sizes for the baseline algorithm, SSGD with the PMF representation under fixed $K = 150$. We remark that when we take $\alpha = 8.8\mathrm{e}{-}4$, the program reports errors of inf and nan. The baseline algorithm converges when the step size does not exceed $8.6\mathrm{e}{-}4$, and it does not converge when the step size exceeds $8.7\mathrm{e}{-}4$. Therefore, $\alpha_\infty^{\operatorname{PMF},(150)}\in[8.6\mathrm{e}{-}4, 8.7\mathrm{e}{-}4]$ in this environment.
  • Figure 4: The approximate values of of maximum step sizes $1/\alpha_\infty^{\operatorname{PMF},(K)}$ under varying $K$. Here we take the average of the upper and lower bounds of $\alpha_\infty^{\operatorname{PMF},(K)}$ provided in Table \ref{['table:pmf_K_alpha']} as an approximation of $\alpha_\infty^{\operatorname{PMF},(K)}$ and perform quadratic regression of $1/\alpha_\infty^{\operatorname{PMF},(K)}$ on $K$. This fit achieves a mean squared error of $425.85$ and $R^2$ of $0.99996$, which indicates that $1/\alpha_\infty^{\operatorname{PMF},(K)}$ indeed grows quadratically with respect to $K$, aligning with our theoretical results (Lemma \ref{['lem:pmf_convergence']}).
  • Figure 5: LHS and RHS of Eqn. \ref{['eq:eq_in_lem_loss_bound_PMF']} in Lemma \ref{['lem:loss_bound_PMF']} under varying $K$. The left sub-graph corresponds to the LHS, and the right sub-graph corresponds to the RHS. We omit the constants that are independent of $K$. We can find that the LHS remains almost unchanged under different $K$, but the RHS increases as $K$ becomes larger, indicating that the stretching coefficient of the matrix ${\bm{C}}{\bm{C}}^\top$ that we frequently encounters during the iterative process grows with $K$ rather than remaining a constant order.

Theorems & Definitions (66)

  • Proposition 3.1
  • Theorem 3.1
  • Proposition 3.2: Approximation Error of $\bm{\eta}_{{\bm{\theta}}^{\star}}$
  • Remark 1: Comparison with Existing Linear Distributional TD Learning Algorithms
  • Remark 2: Linear-CTD is mean-preserving
  • Theorem 4.1: $L^2$ Convergence
  • Remark 3: Comparison with Convergence Rate of Linear-TD
  • Corollary 4.1
  • Theorem 4.2: High-Probability Convergence
  • Remark 4: Markovian Setting
  • ...and 56 more