Table of Contents
Fetching ...

Learning Uncertainty-Aware Temporally-Extended Actions

Joongkyu Lee, Seung Joon Park, Yunhao Tang, Min-hwan Oh

TL;DR

This work addresses the instability of naive action repetition in reinforcement learning by introducing Uncertainty-aware Temporal Extension (UTE). UTE uses an ensemble of Q-value heads to quantify uncertainty during action extension, enabling adaptive extension lengths and an uncertainty parameter λ that can promote exploration or risk aversion as appropriate. The method decomposes policy learning into an action policy and an extension policy, leverages multi-step Q-learning, and employs a bandit-based mechanism to adapt λ, with experiments showing improved learning speed and final performance over existing baselines across Chain MDP, Gridworlds, Atari 2600, and Pendulum-v0. The results demonstrate that explicitly modeling and leveraging uncertainty in temporally extended actions mitigates failure modes of naive repetition and yields robust, scalable improvements for temporal abstraction in RL.

Abstract

In reinforcement learning, temporal abstraction in the action space, exemplified by action repetition, is a technique to facilitate policy learning through extended actions. However, a primary limitation in previous studies of action repetition is its potential to degrade performance, particularly when sub-optimal actions are repeated. This issue often negates the advantages of action repetition. To address this, we propose a novel algorithm named Uncertainty-aware Temporal Extension (UTE). UTE employs ensemble methods to accurately measure uncertainty during action extension. This feature allows policies to strategically choose between emphasizing exploration or adopting an uncertainty-averse approach, tailored to their specific needs. We demonstrate the effectiveness of UTE through experiments in Gridworld and Atari 2600 environments. Our findings show that UTE outperforms existing action repetition algorithms, effectively mitigating their inherent limitations and significantly enhancing policy learning efficiency.

Learning Uncertainty-Aware Temporally-Extended Actions

TL;DR

This work addresses the instability of naive action repetition in reinforcement learning by introducing Uncertainty-aware Temporal Extension (UTE). UTE uses an ensemble of Q-value heads to quantify uncertainty during action extension, enabling adaptive extension lengths and an uncertainty parameter λ that can promote exploration or risk aversion as appropriate. The method decomposes policy learning into an action policy and an extension policy, leverages multi-step Q-learning, and employs a bandit-based mechanism to adapt λ, with experiments showing improved learning speed and final performance over existing baselines across Chain MDP, Gridworlds, Atari 2600, and Pendulum-v0. The results demonstrate that explicitly modeling and leveraging uncertainty in temporally extended actions mitigates failure modes of naive repetition and yields robust, scalable improvements for temporal abstraction in RL.

Abstract

In reinforcement learning, temporal abstraction in the action space, exemplified by action repetition, is a technique to facilitate policy learning through extended actions. However, a primary limitation in previous studies of action repetition is its potential to degrade performance, particularly when sub-optimal actions are repeated. This issue often negates the advantages of action repetition. To address this, we propose a novel algorithm named Uncertainty-aware Temporal Extension (UTE). UTE employs ensemble methods to accurately measure uncertainty during action extension. This feature allows policies to strategically choose between emphasizing exploration or adopting an uncertainty-averse approach, tailored to their specific needs. We demonstrate the effectiveness of UTE through experiments in Gridworld and Atari 2600 environments. Our findings show that UTE outperforms existing action repetition algorithms, effectively mitigating their inherent limitations and significantly enhancing policy learning efficiency.
Paper Structure (28 sections, 1 theorem, 16 equations, 14 figures, 13 tables, 1 algorithm)

This paper contains 28 sections, 1 theorem, 16 equations, 14 figures, 13 tables, 1 algorithm.

Key Result

Proposition 1

In a Semi-Markov Decision Process (SMDP), let an option $\omega \in \Omega$ be the action repeating option defined by action $a$ and extension length $j$, i.e. $\omega_{aj} := \langle \mathcal{S}, \bm{1}_a, \beta(h)=\bm{1}_{h=j} \rangle$. For all $\omega \in \Omega$, a policy over option, $\pi_{\ome

Figures (14)

  • Figure 1: Chain MDP
  • Figure 2: $6\times10$ Gridworlds. Agents have to reach a goal state (G) from a starting state (S) detouring the lava. Dots represent decision steps with and without temporally-extended actions.
  • Figure 3: Distributions of extension length in Gridworlds.
  • Figure 4: Coverage plots (right) on ZigZag environments. The blue represents states visited more often and white represents states rarely or never seen. See Appendix for the expanded version of the figures.
  • Figure 5: Learning curves of UTE with best $\lambda$, UTE with adaptive $\lambda$ and other baseline algorithms on Atari environments. The shaded area represents the standard deviation over 7 random seeds.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Definition 1
  • Proposition 1