Table of Contents
Fetching ...

Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features

Zixuan Xie, Xinyu Liu, Rohan Chandra, Shangtong Zhang

TL;DR

This work provides the first finite-sample L^2 convergence rates for linear TD(\lambda) with arbitrary (potentially dependent) features, addressing a long-standing gap in RL theory. It introduces a novel Lyapunov function based on the distance to a solution set and a general stochastic-approximation framework that permits convergence to a set rather than a single point, enabling analysis in both discounted and average-reward MDPs without modifying the algorithm. In the discounted setting, the authors prove explicit rates to a fixed-point set W_* and show convergence of the value estimates X w_t to a projected fixed point \hat v_π, achieving near- optimal rates (e.g., \mathcal{O}(\ln t / t) for \xi=1). For average reward, a new feature-decomposition X = X_1 + X_2 yields a nonempty fixed-set \overline W_* and a rate bound on the joint error (\hat J_t - J_π)^2 + d(w_t, \overline W_*)^2, extending finite-sample analyses to arbitrary features. Overall, the results broaden the applicability of TD methods by removing the independence assumption on features and provide tools that could extend to other linear RL algorithms and to alternative convergence modes.

Abstract

Linear TD($λ$) is one of the most fundamental reinforcement learning algorithms for policy evaluation. Previously, convergence rates are typically established under the assumption of linearly independent features, which does not hold in many practical scenarios. This paper instead establishes the first $L^2$ convergence rates for linear TD($λ$) operating under arbitrary features, without making any algorithmic modification or additional assumptions. Our results apply to both the discounted and average-reward settings. To address the potential non-uniqueness of solutions resulting from arbitrary features, we develop a novel stochastic approximation result featuring convergence rates to the solution set instead of a single point.

Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features

TL;DR

This work provides the first finite-sample L^2 convergence rates for linear TD(\lambda) with arbitrary (potentially dependent) features, addressing a long-standing gap in RL theory. It introduces a novel Lyapunov function based on the distance to a solution set and a general stochastic-approximation framework that permits convergence to a set rather than a single point, enabling analysis in both discounted and average-reward MDPs without modifying the algorithm. In the discounted setting, the authors prove explicit rates to a fixed-point set W_* and show convergence of the value estimates X w_t to a projected fixed point \hat v_π, achieving near- optimal rates (e.g., \mathcal{O}(\ln t / t) for \xi=1). For average reward, a new feature-decomposition X = X_1 + X_2 yields a nonempty fixed-set \overline W_* and a rate bound on the joint error (\hat J_t - J_π)^2 + d(w_t, \overline W_*)^2, extending finite-sample analyses to arbitrary features. Overall, the results broaden the applicability of TD methods by removing the independence assumption on features and provide tools that could extend to other linear RL algorithms and to alternative convergence modes.

Abstract

Linear TD() is one of the most fundamental reinforcement learning algorithms for policy evaluation. Previously, convergence rates are typically established under the assumption of linearly independent features, which does not hold in many practical scenarios. This paper instead establishes the first convergence rates for linear TD() operating under arbitrary features, without making any algorithmic modification or additional assumptions. Our results apply to both the discounted and average-reward settings. To address the potential non-uniqueness of solutions resulting from arbitrary features, we develop a novel stochastic approximation result featuring convergence rates to the solution set instead of a single point.

Paper Structure

This paper contains 34 sections, 23 theorems, 114 equations, 2 figures, 1 table.

Key Result

Theorem 1

Let Assumptions assum:markov and assu lr hold and $\lambda \in [0, 1]$. Then for sufficiently large $t_0$ and $\alpha$, there exist some constants $C_\text{Thm}thm:td markov$ and $\kappa_thm:td markov\doteq \alpha C_prop:a>1$ such that the iterates $\qty{w_t}$ generated by eq:td lambda satisfy for a

Figures (2)

  • Figure 1: Convergence of \ref{['eq:td lambda']} with $\gamma = 0.9, \alpha_0 \in \qty{0.005, 0.01}$. Curves are averaged over 10 runs with shaded regions (too small to be visible) indicating standard errors.
  • Figure 2: Convergence of \ref{['eq:artd']} with $\beta_0 = 0.01, \alpha_0 \in \qty{0.01, 0.02, 0.1}$. Curves are averaged over 10 runs with shaded regions (too small to be visible) indicating standard errors.

Theorems & Definitions (45)

  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Theorem 3
  • proof
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • proof
  • ...and 35 more