PID Accelerated Temporal Difference Algorithms

Mark Bedaywi; Amin Rakhsha; Amir-massoud Farahmand

PID Accelerated Temporal Difference Algorithms

Mark Bedaywi, Amin Rakhsha, Amir-massoud Farahmand

TL;DR

This work gives a theoretical analysis of the convergence of PID TD Learning and its acceleration compared to the conventional TD Learning and introduces PID TD Learning and PID Q-Learning algorithms for the RL setting, in which only samples from the environment are available.

Abstract

Long-horizon tasks, which have a large discount factor, pose a challenge for most conventional reinforcement learning (RL) algorithms. Algorithms such as Value Iteration and Temporal Difference (TD) learning have a slow convergence rate and become inefficient in these tasks. When the transition distributions are given, PID VI was recently introduced to accelerate the convergence of Value Iteration using ideas from control theory. Inspired by this, we introduce PID TD Learning and PID Q-Learning algorithms for the RL setting, in which only samples from the environment are available. We give a theoretical analysis of the convergence of PID TD Learning and its acceleration compared to the conventional TD Learning. We also introduce a method for adapting PID gains in the presence of noise and empirically verify its effectiveness.

PID Accelerated Temporal Difference Algorithms

TL;DR

Abstract

Paper Structure (22 sections, 8 theorems, 67 equations, 7 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 8 theorems, 67 equations, 7 figures, 3 tables, 2 algorithms.

Introduction
Background
PID Value Iteration
PID TD Learning and PID Q-Learning
Theoretical Guarantees
Convergence Guarantee
Acceleration Result
Gain Adaptation
Empirical Results
Related Work
Conclusion
Proofs for Convergence Results (Section \ref{['sec:pid-theory-convergence']})
Proofs for Acceleration Results (Section \ref{['sec:pid-theory-acceleration']})
Proof of Theorem \ref{['thm:acceleration-PID-TD']}
Proof of Proposition \ref{['prop:error-terms-ratio']}
...and 7 more sections

Key Result

Theorem 1

Consider a set of controller gains $g$. Let $\{\lambda_i\}$ be the eigenvalues of $A_g^\pi$. If $\mathrm{Re}\{\lambda_i\} < 1$ for all $i$, under mild assumptions on learning rate schedule $\mu$ and the sequence $(X_t)$ (Assumptions ass:lr_schedule, ass:balanced_visit), the functions $V_t$ in PID TD

Figures (7)

Figure 1: Comparison of PID TD Learning with Conventional TD Learning in Chain Walk (left) and Cliff Walk (right) with $\gamma = 0.99$. Each curve is averaged over 80 runs. Shaded areas show the standard error.
Figure 2: PID TD Learning with Gain Adaptation in Cliff Walk with $\gamma = 0.999$. (Left) Comparison of value errors of PID TD Learning with TD Learning. Each curve is averaged over 80 runs. Shaded area shows standard error. (Right) The change of gains done by Gain Adaptation through training.
Figure 3: PID Q-Learning with Gain Adaptation in Chain Walk with $\gamma = 0.999$. (Left) Comparison of value errors of PID Q-Learning with Q-Learning. Each curve is averaged over 80 runs. Shaded area shows standard error. (Right) The change of gains done by Gain Adaptation through training.
Figure 4: Comparison of PID Accelerated algorithms with the conventional ones for PE (Left) and Control (Right) problems in randomly generated Garnet environments with $\gamma = 0.99$. Each curve is an average of 80 MDPs, run for 80 times each. Shaded area shows standard error.
Figure 5: A visualization of Cliff Walk, taken from rakhsha2022operator. The arrows depict the optimal policy.
...and 2 more figures

Theorems & Definitions (19)

Theorem 1: Convergence of PID TD
Theorem 2
Definition 1
Proposition 1
Lemma 1
proof
proof
Definition 2
Definition 3
Lemma 2
...and 9 more

PID Accelerated Temporal Difference Algorithms

TL;DR

Abstract

PID Accelerated Temporal Difference Algorithms

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (19)