Improving planning and MBRL with temporally-extended actions
Palash Chatterjee, Roni Khardon
TL;DR
The paper tackles the inefficiency of planning in continuous-time systems modeled by fine-grained discrete dynamics by introducing temporally-extended actions, where the planner also selects the duration $\delta t$ of each action. It formalizes temporally-extended dynamics with a $\hat{F}_{TE}$ model (or an iterative $F_{IP}$) and adopts a non-stationary multi-armed bandit to automatically pick the action-duration range $\delta t_{\max}$, while learning separate models per arm in an MBRL setting. Empirically, temporally-extended actions yield faster planning, better solutions, and enable solving tasks that are difficult for standard primitive-action planning, with notable gains in planning depth, memory efficiency, and training speed across planning and MuJoCo-based experiments. The work also provides insights into the roles of dual discounts $\gamma_1$ and $\gamma_2$ and the tradeoffs between fixed and dynamic duration ranges, highlighting when flexibility in duration is most beneficial. Overall, the approach advances planning efficiency in long-horizon control by integrating duration as an optimization variable and leveraging MAB-driven auto-tuning within MBRL.
Abstract
Continuous time systems are often modeled using discrete time dynamics but this requires a small simulation step to maintain accuracy. In turn, this requires a large planning horizon which leads to computationally demanding planning problems and reduced performance. Previous work in model-free reinforcement learning has partially addressed this issue using action repeats where a policy is learned to determine a discrete action duration. Instead we propose to control the continuous decision timescale directly by using temporally-extended actions and letting the planner treat the duration of the action as an additional optimization variable along with the standard action variables. This additional structure has multiple advantages. It speeds up simulation time of trajectories and, importantly, it allows for deep horizon search in terms of primitive actions while using a shallow search depth in the planner. In addition, in the model-based reinforcement learning (MBRL) setting, it reduces compounding errors from model learning and improves training time for models. We show that this idea is effective and that the range for action durations can be automatically selected using a multi-armed bandit formulation and integrated into the MBRL framework. An extensive experimental evaluation both in planning and in MBRL, shows that our approach yields faster planning, better solutions, and that it enables solutions to problems that are not solved in the standard formulation.
