Table of Contents
Fetching ...

True Online TD-Replan(lambda) Achieving Planning through Replaying

Abdulrahman Altahhan

TL;DR

This work introduces True Online TD-Replan($\lambda$), a planning method that extends true online TD($\lambda$) by replaying past experiences online with a replay density controlled by $\acute{\lambda}$ and target depth by $\lambda$. It proves that true online TD($\lambda$) is a special case of this framework and provides incremental, efficient update rules with $O(n^2)$ complexity, enabling online planning across the full spectrum from no-replay to full-replay. The authors demonstrate superior performance over quadratic-complexity baselines such as Dyna Planning and TD(\lambda)-Replan on a 17-state random walk and on a myoelectric cursor-control task, including setups with deep sparse autoencoder features. The results indicate that the TD-Replan family is a robust, scalable approach for online planning in environments where experience replay is beneficial, and it integrates well with deep feature representations for real-time applications. The work also outlines avenues for integration with on-policy/off-policy updates and end-to-end deep reinforcement learning frameworks.

Abstract

In this paper, we develop a new planning method that extends the capabilities of the true online TD to allow an agent to efficiently replay all or part of its past experience, online in the sequence that they appear with, either in each step or sparsely according to the usual λ parameter. In this new method that we call True Online TD-Replan(λ), the λ parameter plays a new role in specifying the density of the replay process in addition to the usual role of specifying the depth of the target's updates. We demonstrate that, for problems that benefit from experience replay, our new method outperforms true online TD(λ), albeit quadratic in complexity due to its replay capabilities. In addition, we demonstrate that our method outperforms other methods with similar quadratic complexity such as Dyna Planning and TD(λ)-Replan algorithms. We test our method on two benchmarking environments, a random walk problem that uses simple binary features and a myoelectric control domain that uses both simple sEMG features and deeply extracted features to showcase its capabilities.

True Online TD-Replan(lambda) Achieving Planning through Replaying

TL;DR

This work introduces True Online TD-Replan(), a planning method that extends true online TD() by replaying past experiences online with a replay density controlled by and target depth by . It proves that true online TD() is a special case of this framework and provides incremental, efficient update rules with complexity, enabling online planning across the full spectrum from no-replay to full-replay. The authors demonstrate superior performance over quadratic-complexity baselines such as Dyna Planning and TD(\lambda)-Replan on a 17-state random walk and on a myoelectric cursor-control task, including setups with deep sparse autoencoder features. The results indicate that the TD-Replan family is a robust, scalable approach for online planning in environments where experience replay is beneficial, and it integrates well with deep feature representations for real-time applications. The work also outlines avenues for integration with on-policy/off-policy updates and end-to-end deep reinforcement learning frameworks.

Abstract

In this paper, we develop a new planning method that extends the capabilities of the true online TD to allow an agent to efficiently replay all or part of its past experience, online in the sequence that they appear with, either in each step or sparsely according to the usual λ parameter. In this new method that we call True Online TD-Replan(λ), the λ parameter plays a new role in specifying the density of the replay process in addition to the usual role of specifying the depth of the target's updates. We demonstrate that, for problems that benefit from experience replay, our new method outperforms true online TD(λ), albeit quadratic in complexity due to its replay capabilities. In addition, we demonstrate that our method outperforms other methods with similar quadratic complexity such as Dyna Planning and TD(λ)-Replan algorithms. We test our method on two benchmarking environments, a random walk problem that uses simple binary features and a myoelectric control domain that uses both simple sEMG features and deeply extracted features to showcase its capabilities.

Paper Structure

This paper contains 12 sections, 2 theorems, 19 equations, 6 figures, 2 algorithms.

Key Result

Theorem 1

Given a set of n weights ${{\boldsymbol{\theta}}}$ that are due to the forward true online TD($\lambda$)-Replan algorithm shown earlier, we can obtain exactly ${{\boldsymbol{\theta}}}$ incrementally according to the following step updates

Figures (6)

  • Figure 1: Random Walk Task
  • Figure 2: Comparison of true online TD($\lambda$)-Replan(1) with true online TD($\lambda$), as well as TD(0)-Replan and Dyna Planning, on Random Walk on 17 of the first 10 episodes averaged over 20 trials for binary features. This shows the clear edge that our new method have over other methods despite the simplicity of the problem.
  • Figure 3: Comparison of the RMSE for true online TD($\lambda$)-Replan(1) and true online TD($\lambda$) applied on Myoelectric control of cursor on a screen using normalised sEMG as features, all taken for the first 10 episodes and averaged over 66 trials with $\alpha$ values that spans 0.001 to 0.1 with 0.005 steps. the figure clearly shows that for high $\lambda$ values TD($\lambda$)-Replan(1) is more advantageous than true TD($\lambda$). We note that our algorithm has a wider maximal area and converges quicker to an optimal performance for a wider range of learning steps $\alpha$, making it more reliable and stable. Note that Dyna Planning has struggled to learn the environment's dynamics due to deep learning mapping the sEMG into a more elaborate but sparse space. On the other hand, TD(0)-Replan has performed relatively good as expected but could not outperform true TD(0.9)-Replan onward.
  • Figure 4: RMSE comparison for true online TD($\lambda$)-Replan(1), true online TD($\lambda$), TD(0)-Replan and Dyna Planning. The methods are applied on myoelectric control of cursor on a screen, where 16 normalised sEMG are fed into a Sparse Auto Encoder to extract a more elaborate set of features ($16^2$). All results are taken for the first 10 episodes and averaged over 66 trials with $\alpha$ values that spans $10^{-4}$ to $10^{-3}$ with $3\times10^{-4}$ increment. The figure clearly shows that when using deeply learned features TD($\lambda$)-Replan(1) outperforms the true TD($\lambda$) for all $\lambda$ values with a considerable margin.
  • Figure 5: Same as for Fig. \ref{['fig:sAE TD(lambda)-Replay(1)']} but over a wider range of values of $\alpha$. It shows that for this type of features the smaller $\alpha$ values generates better results for both algorithms.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2