Table of Contents
Fetching ...

Improving planning and MBRL with temporally-extended actions

Palash Chatterjee, Roni Khardon

TL;DR

The paper tackles the inefficiency of planning in continuous-time systems modeled by fine-grained discrete dynamics by introducing temporally-extended actions, where the planner also selects the duration $\delta t$ of each action. It formalizes temporally-extended dynamics with a $\hat{F}_{TE}$ model (or an iterative $F_{IP}$) and adopts a non-stationary multi-armed bandit to automatically pick the action-duration range $\delta t_{\max}$, while learning separate models per arm in an MBRL setting. Empirically, temporally-extended actions yield faster planning, better solutions, and enable solving tasks that are difficult for standard primitive-action planning, with notable gains in planning depth, memory efficiency, and training speed across planning and MuJoCo-based experiments. The work also provides insights into the roles of dual discounts $\gamma_1$ and $\gamma_2$ and the tradeoffs between fixed and dynamic duration ranges, highlighting when flexibility in duration is most beneficial. Overall, the approach advances planning efficiency in long-horizon control by integrating duration as an optimization variable and leveraging MAB-driven auto-tuning within MBRL.

Abstract

Continuous time systems are often modeled using discrete time dynamics but this requires a small simulation step to maintain accuracy. In turn, this requires a large planning horizon which leads to computationally demanding planning problems and reduced performance. Previous work in model-free reinforcement learning has partially addressed this issue using action repeats where a policy is learned to determine a discrete action duration. Instead we propose to control the continuous decision timescale directly by using temporally-extended actions and letting the planner treat the duration of the action as an additional optimization variable along with the standard action variables. This additional structure has multiple advantages. It speeds up simulation time of trajectories and, importantly, it allows for deep horizon search in terms of primitive actions while using a shallow search depth in the planner. In addition, in the model-based reinforcement learning (MBRL) setting, it reduces compounding errors from model learning and improves training time for models. We show that this idea is effective and that the range for action durations can be automatically selected using a multi-armed bandit formulation and integrated into the MBRL framework. An extensive experimental evaluation both in planning and in MBRL, shows that our approach yields faster planning, better solutions, and that it enables solutions to problems that are not solved in the standard formulation.

Improving planning and MBRL with temporally-extended actions

TL;DR

The paper tackles the inefficiency of planning in continuous-time systems modeled by fine-grained discrete dynamics by introducing temporally-extended actions, where the planner also selects the duration of each action. It formalizes temporally-extended dynamics with a model (or an iterative ) and adopts a non-stationary multi-armed bandit to automatically pick the action-duration range , while learning separate models per arm in an MBRL setting. Empirically, temporally-extended actions yield faster planning, better solutions, and enable solving tasks that are difficult for standard primitive-action planning, with notable gains in planning depth, memory efficiency, and training speed across planning and MuJoCo-based experiments. The work also provides insights into the roles of dual discounts and and the tradeoffs between fixed and dynamic duration ranges, highlighting when flexibility in duration is most beneficial. Overall, the approach advances planning efficiency in long-horizon control by integrating duration as an optimization variable and leveraging MAB-driven auto-tuning within MBRL.

Abstract

Continuous time systems are often modeled using discrete time dynamics but this requires a small simulation step to maintain accuracy. In turn, this requires a large planning horizon which leads to computationally demanding planning problems and reduced performance. Previous work in model-free reinforcement learning has partially addressed this issue using action repeats where a policy is learned to determine a discrete action duration. Instead we propose to control the continuous decision timescale directly by using temporally-extended actions and letting the planner treat the duration of the action as an additional optimization variable along with the standard action variables. This additional structure has multiple advantages. It speeds up simulation time of trajectories and, importantly, it allows for deep horizon search in terms of primitive actions while using a shallow search depth in the planner. In addition, in the model-based reinforcement learning (MBRL) setting, it reduces compounding errors from model learning and improves training time for models. We show that this idea is effective and that the range for action durations can be automatically selected using a multi-armed bandit formulation and integrated into the MBRL framework. An extensive experimental evaluation both in planning and in MBRL, shows that our approach yields faster planning, better solutions, and that it enables solutions to problems that are not solved in the standard formulation.

Paper Structure

This paper contains 22 sections, 6 equations, 8 figures, 9 tables, 4 algorithms.

Figures (8)

  • Figure 1: $A_{\text{STD}}$ requires a large planning horizon $(D_{\text{STD}} \ge 60)$ to succeed in Mountain Car, but $A_{\text{TE}}$ using $\delta t_{\max}=100$ can work with a small planning horizon $(D_{\text{TE}} \ge 4)$.
  • Figure 2: When the number of actions is small (as in a and b), both the agents are able to solve the problem. But when the number of actions increases (c), $A_{\text{TE}}$ is still able to solve the problem while $A_{\text{STD}}$ fails due to large memory requirements. The shape of the curves is an artifact of the action space in this environment. Note that a constant action in acceleration space yields curved paths. The path in (c) is composed of 4 actions of different durations, as marked by the colors.
  • Figure 3: Mean and standard deviation of the running averages (window size=10) of scores and number of decisions taken by the agents in classical control and MuJoCo domains across 5 different seeds. Ours(D) selects $\delta t_{\max}$ automatically using the proposed multi-armed bandit framework while Ours(F) uses fixed $\delta t_{\max}$ for every episode. Using temporally-extended actions results in better performance in many environments while requiring fewer decision points.
  • Figure 4: Histogram of action duration in terms of primitive action repeats from the one episode at the end of training for Ant, Half Cheetah, Hopper and Walker. In contrast to using a fixed frame-skip, using temporally-extended actions allows the agent to take actions of varying durations. Numbers in the title indicate the minimum, average and maximum of action repeats taken.
  • Figure A1: (a) An example of $A_{\text{TE}}$ solving the cave-mini map $(\gamma_1 = 0.99, \gamma_2 = 1.0, \delta t_{\max}=20)$. (b) An instance of the IPC Multi-hill Mountain Car.
  • ...and 3 more figures