Table of Contents
Fetching ...

An Investigation of Offline Reinforcement Learning in Factorisable Action Spaces

Alex Beeson, David Ireland, Giovanni Montana

TL;DR

This work undertake a formative investigation into offline reinforcement learning in factorisable action spaces using value-decomposition as formulated in DecQN as a foundation, and presents the case for a factorised approach and conducts an extensive empirical evaluation of several offline techniques adapted to the factorised setting.

Abstract

Expanding reinforcement learning (RL) to offline domains generates promising prospects, particularly in sectors where data collection poses substantial challenges or risks. Pivotal to the success of transferring RL offline is mitigating overestimation bias in value estimates for state-action pairs absent from data. Whilst numerous approaches have been proposed in recent years, these tend to focus primarily on continuous or small-scale discrete action spaces. Factorised discrete action spaces, on the other hand, have received relatively little attention, despite many real-world problems naturally having factorisable actions. In this work, we undertake a formative investigation into offline reinforcement learning in factorisable action spaces. Using value-decomposition as formulated in DecQN as a foundation, we present the case for a factorised approach and conduct an extensive empirical evaluation of several offline techniques adapted to the factorised setting. In the absence of established benchmarks, we introduce a suite of our own comprising datasets of varying quality and task complexity. Advocating for reproducible research and innovation, we make all datasets available for public use alongside our code base.

An Investigation of Offline Reinforcement Learning in Factorisable Action Spaces

TL;DR

This work undertake a formative investigation into offline reinforcement learning in factorisable action spaces using value-decomposition as formulated in DecQN as a foundation, and presents the case for a factorised approach and conducts an extensive empirical evaluation of several offline techniques adapted to the factorised setting.

Abstract

Expanding reinforcement learning (RL) to offline domains generates promising prospects, particularly in sectors where data collection poses substantial challenges or risks. Pivotal to the success of transferring RL offline is mitigating overestimation bias in value estimates for state-action pairs absent from data. Whilst numerous approaches have been proposed in recent years, these tend to focus primarily on continuous or small-scale discrete action spaces. Factorised discrete action spaces, on the other hand, have received relatively little attention, despite many real-world problems naturally having factorisable actions. In this work, we undertake a formative investigation into offline reinforcement learning in factorisable action spaces. Using value-decomposition as formulated in DecQN as a foundation, we present the case for a factorised approach and conduct an extensive empirical evaluation of several offline techniques adapted to the factorised setting. In the absence of established benchmarks, we introduce a suite of our own comprising datasets of varying quality and task complexity. Advocating for reproducible research and innovation, we make all datasets available for public use alongside our code base.

Paper Structure

This paper contains 33 sections, 19 equations, 9 figures, 12 tables, 4 algorithms.

Figures (9)

  • Figure 1: In this simple example there are $N=3$ sub-action dimensions, each with two actions {$\uparrow$, $\downarrow$}. In-distribution and out-of-distribution actions/sub-actions are highlighted in green and red, respectively. For a particular state, the dataset contains two global actions. Under atomic representation only actions which match those in the dataset are in-distribution. Under factorised representation, individual sub-actions which match those in the dataset are in-distribution. Atomic actions that are out-of-distribution can contain sub-actions that are in-distribution when factorised.
  • Figure 2: Examples of maze environment with different numbers of actuators. The star represents the target goal location, the red dot the agent and the arrows the actuators. Adapted from original Figure in chandak2019learning.
  • Figure 3: Comparisons of performance (left) and computation time (right) for DQN-CQL and DecQN-CQL on the "cheetah-run-medium-expert" dataset for varying numbers of bins. As the number of bins increases, DQN-CQL suffers notable drops in performance and increases in computation time, whereas DecQN-CQL is relatively resilient in both areas.
  • Figure 4: Comparison of performance for DQN-CQL and DecQN-CQL on the Maze task with $N=15$ actuators with "random-medium-expert" datasets of varying size. As the number of transitions in the dataset decreases, DQN-CQL suffers more notable drops in performance in comparison to DecQN-CQL.
  • Figure 5: Performance comparison on maze task for varying numbers of actuators. For presentation purposes the prefix "DecQN-" has been omitted for each offline method. In general, all offline RL methods improve over behavioural cloning, with the exception of DecQN-BCQ for "random-medium-expert" datasets. DecQN without any offline modification performs poorly across the board.
  • ...and 4 more figures