Table of Contents
Fetching ...

Stability and Sensitivity Analysis of Relative Temporal-Difference Learning: Extended Version

Masoud S. Sakha, Rushikesh Kamalapurkar, Sean Meyn

Abstract

Relative temporal-difference (TD) learning was introduced to mitigate the slow convergence of TD methods when the discount factor approaches one by subtracting a baseline from the temporal-difference update. While this idea has been studied in the tabular setting, stability guarantees with function approximation remain poorly understood. This paper analyzes relative TD learning with linear function approximation. We establish stability conditions for the algorithm and show that the choice of baseline distribution plays a central role. In particular, when the baseline is chosen as the empirical distribution of the state-action process, the algorithm is stable for any non-negative baseline weight and any discount factor. We also provide a sensitivity analysis of the resulting parameter estimates, characterizing both asymptotic bias and covariance. The asymptotic covariance and asymptotic bias are shown to remain uniformly bounded as the discount factor approaches one.

Stability and Sensitivity Analysis of Relative Temporal-Difference Learning: Extended Version

Abstract

Relative temporal-difference (TD) learning was introduced to mitigate the slow convergence of TD methods when the discount factor approaches one by subtracting a baseline from the temporal-difference update. While this idea has been studied in the tabular setting, stability guarantees with function approximation remain poorly understood. This paper analyzes relative TD learning with linear function approximation. We establish stability conditions for the algorithm and show that the choice of baseline distribution plays a central role. In particular, when the baseline is chosen as the empirical distribution of the state-action process, the algorithm is stable for any non-negative baseline weight and any discount factor. We also provide a sensitivity analysis of the resulting parameter estimates, characterizing both asymptotic bias and covariance. The asymptotic covariance and asymptotic bias are shown to remain uniformly bounded as the discount factor approaches one.

Paper Structure

This paper contains 12 sections, 7 theorems, 86 equations, 7 figures.

Key Result

Proposition 2.1

[proposition]t:bias Consider the SA recursion e:SA in which $f_{n+1}(\theta) = A(\Phi_{n+1}) \theta + b(\Phi_{n+1})$ in which ${\hbox{\boldmath$\Phi$}}$ is unichain and aperiodic, and the steady-state mean ${\bar{A}} = {\sf E}[ A(\Phi)]$ is Hurwitz. Suppose the step-size is of the form e:alpharho. T in which $\widebar{\Upupsilon} = {\sf E}[\Upupsilon_{n+1}^*]$ with expectation in steady-state, and

Figures (7)

  • Figure 1: 3-state 2-action MDP model, with randomized policy. Variants of TD learning were performed with discount factor $\gamma=0.99$, and step-size \ref{['e:alpharho']} with $\rho = 0.65$ and $\alpha_0 = 2$.
  • Figure 2: Eigenvalues as a function of discount factor $\gamma$. One eigenvalue approaches zero for the algorithm using $\delta_{\text{\sf r}}=0$.
  • Figure 3: Histograms of the scaled and centered parameter estimates for TD(0) and $\upvarpi$-relative TD(0) in the finite state-action model using Polyak--Ruppert averaging. The red solid curves show the corresponding theoretical Gaussian approximations predicted by \ref{['e:SigmaPRopt']}.
  • Figure 4: Bias realized vs theory.
  • Figure 5: Squared-norm of normalized bias as a function of $\delta_{\text{\sf r}}$. The red dashed line is the predicted slope
  • ...and 2 more figures

Theorems & Definitions (7)

  • Proposition 2.1: Bias
  • Theorem 2.2: Relative TD($\lambda$) may be unstable
  • Theorem 2.3: Stability and Variance
  • Lemma 2.4
  • Theorem 2.5: Sensitivity
  • Proposition A.1
  • Lemma A.2