Table of Contents
Fetching ...

PID Accelerated Temporal Difference Algorithms

Mark Bedaywi, Amin Rakhsha, Amir-massoud Farahmand

TL;DR

This work gives a theoretical analysis of the convergence of PID TD Learning and its acceleration compared to the conventional TD Learning and introduces PID TD Learning and PID Q-Learning algorithms for the RL setting, in which only samples from the environment are available.

Abstract

Long-horizon tasks, which have a large discount factor, pose a challenge for most conventional reinforcement learning (RL) algorithms. Algorithms such as Value Iteration and Temporal Difference (TD) learning have a slow convergence rate and become inefficient in these tasks. When the transition distributions are given, PID VI was recently introduced to accelerate the convergence of Value Iteration using ideas from control theory. Inspired by this, we introduce PID TD Learning and PID Q-Learning algorithms for the RL setting, in which only samples from the environment are available. We give a theoretical analysis of the convergence of PID TD Learning and its acceleration compared to the conventional TD Learning. We also introduce a method for adapting PID gains in the presence of noise and empirically verify its effectiveness.

PID Accelerated Temporal Difference Algorithms

TL;DR

This work gives a theoretical analysis of the convergence of PID TD Learning and its acceleration compared to the conventional TD Learning and introduces PID TD Learning and PID Q-Learning algorithms for the RL setting, in which only samples from the environment are available.

Abstract

Long-horizon tasks, which have a large discount factor, pose a challenge for most conventional reinforcement learning (RL) algorithms. Algorithms such as Value Iteration and Temporal Difference (TD) learning have a slow convergence rate and become inefficient in these tasks. When the transition distributions are given, PID VI was recently introduced to accelerate the convergence of Value Iteration using ideas from control theory. Inspired by this, we introduce PID TD Learning and PID Q-Learning algorithms for the RL setting, in which only samples from the environment are available. We give a theoretical analysis of the convergence of PID TD Learning and its acceleration compared to the conventional TD Learning. We also introduce a method for adapting PID gains in the presence of noise and empirically verify its effectiveness.
Paper Structure (22 sections, 8 theorems, 67 equations, 7 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 8 theorems, 67 equations, 7 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Consider a set of controller gains $g$. Let $\{\lambda_i\}$ be the eigenvalues of $A_g^\pi$. If $\mathrm{Re}\{\lambda_i\} < 1$ for all $i$, under mild assumptions on learning rate schedule $\mu$ and the sequence $(X_t)$ (Assumptions ass:lr_schedule, ass:balanced_visit), the functions $V_t$ in PID TD

Figures (7)

  • Figure 1: Comparison of PID TD Learning with Conventional TD Learning in Chain Walk (left) and Cliff Walk (right) with $\gamma = 0.99$. Each curve is averaged over 80 runs. Shaded areas show the standard error.
  • Figure 2: PID TD Learning with Gain Adaptation in Cliff Walk with $\gamma = 0.999$. (Left) Comparison of value errors of PID TD Learning with TD Learning. Each curve is averaged over 80 runs. Shaded area shows standard error. (Right) The change of gains done by Gain Adaptation through training.
  • Figure 3: PID Q-Learning with Gain Adaptation in Chain Walk with $\gamma = 0.999$. (Left) Comparison of value errors of PID Q-Learning with Q-Learning. Each curve is averaged over 80 runs. Shaded area shows standard error. (Right) The change of gains done by Gain Adaptation through training.
  • Figure 4: Comparison of PID Accelerated algorithms with the conventional ones for PE (Left) and Control (Right) problems in randomly generated Garnet environments with $\gamma = 0.99$. Each curve is an average of 80 MDPs, run for 80 times each. Shaded area shows standard error.
  • Figure 5: A visualization of Cliff Walk, taken from rakhsha2022operator. The arrows depict the optimal policy.
  • ...and 2 more figures

Theorems & Definitions (19)

  • Theorem 1: Convergence of PID TD
  • Theorem 2
  • Definition 1
  • Proposition 1
  • Lemma 1
  • proof
  • proof
  • Definition 2
  • Definition 3
  • Lemma 2
  • ...and 9 more