Table of Contents
Fetching ...

Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction

Baiting Luo, Ava Pettet, Aron Laszka, Abhishek Dubey, Ayan Mukhopadhyay

TL;DR

The paper addresses the challenge of planning in offline reinforcement learning settings with stochastic, high-dimensional continuous action spaces. It proposes Latent Macro Action Planner (L-MAP), which learns discrete macro-actions via a state-conditioned VQ-VAE and a latent transition model implemented as a Transformer prior, enabling efficient planning in a reduced latent space. Planning is performed with Monte Carlo Tree Search over a preconstructed latent search space, using progressive widening to balance fast, informed decisions with deeper exploration. Empirical results across stochastic MuJoCo, Adroit, AntMaze, and other domains show that L-MAP achieves low decision latency and superior or competitive performance relative to strong model-based and model-free baselines, illustrating robust planning under stochastic dynamics and high-dimensional actions. Overall, L-MAP demonstrates scalable, robust planning for offline RL in complex environments by fusing temporal abstraction with latent-space planning and sampling.

Abstract

Sequential decision-making in high-dimensional continuous action spaces, particularly in stochastic environments, faces significant computational challenges. We explore this challenge in the traditional offline RL setting, where an agent must learn how to make decisions based on data collected through a stochastic behavior policy. We present Latent Macro Action Planner (L-MAP), which addresses this challenge by learning a set of temporally extended macro-actions through a state-conditional Vector Quantized Variational Autoencoder (VQ-VAE), effectively reducing action dimensionality. L-MAP employs a (separate) learned prior model that acts as a latent transition model and allows efficient sampling of plausible actions. During planning, our approach accounts for stochasticity in both the environment and the behavior policy by using Monte Carlo tree search (MCTS). In offline RL settings, including stochastic continuous control tasks, L-MAP efficiently searches over discrete latent actions to yield high expected returns. Empirical results demonstrate that L-MAP maintains low decision latency despite increased action dimensionality. Notably, across tasks ranging from continuous control with inherently stochastic dynamics to high-dimensional robotic hand manipulation, L-MAP significantly outperforms existing model-based methods and performs on-par with strong model-free actor-critic baselines, highlighting the effectiveness of the proposed approach in planning in complex and stochastic environments with high-dimensional action spaces.

Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction

TL;DR

The paper addresses the challenge of planning in offline reinforcement learning settings with stochastic, high-dimensional continuous action spaces. It proposes Latent Macro Action Planner (L-MAP), which learns discrete macro-actions via a state-conditioned VQ-VAE and a latent transition model implemented as a Transformer prior, enabling efficient planning in a reduced latent space. Planning is performed with Monte Carlo Tree Search over a preconstructed latent search space, using progressive widening to balance fast, informed decisions with deeper exploration. Empirical results across stochastic MuJoCo, Adroit, AntMaze, and other domains show that L-MAP achieves low decision latency and superior or competitive performance relative to strong model-based and model-free baselines, illustrating robust planning under stochastic dynamics and high-dimensional actions. Overall, L-MAP demonstrates scalable, robust planning for offline RL in complex environments by fusing temporal abstraction with latent-space planning and sampling.

Abstract

Sequential decision-making in high-dimensional continuous action spaces, particularly in stochastic environments, faces significant computational challenges. We explore this challenge in the traditional offline RL setting, where an agent must learn how to make decisions based on data collected through a stochastic behavior policy. We present Latent Macro Action Planner (L-MAP), which addresses this challenge by learning a set of temporally extended macro-actions through a state-conditional Vector Quantized Variational Autoencoder (VQ-VAE), effectively reducing action dimensionality. L-MAP employs a (separate) learned prior model that acts as a latent transition model and allows efficient sampling of plausible actions. During planning, our approach accounts for stochasticity in both the environment and the behavior policy by using Monte Carlo tree search (MCTS). In offline RL settings, including stochastic continuous control tasks, L-MAP efficiently searches over discrete latent actions to yield high expected returns. Empirical results demonstrate that L-MAP maintains low decision latency despite increased action dimensionality. Notably, across tasks ranging from continuous control with inherently stochastic dynamics to high-dimensional robotic hand manipulation, L-MAP significantly outperforms existing model-based methods and performs on-par with strong model-free actor-critic baselines, highlighting the effectiveness of the proposed approach in planning in complex and stochastic environments with high-dimensional action spaces.

Paper Structure

This paper contains 17 sections, 10 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: (a) Overview of planning over the pre-constructed search space. (b) As the number of MCTS iterations increases (10, 50, 100 from left to right), using a pre-constructed search space with MCTS achieves better performance with lower decision latency.
  • Figure 2: An overview of our VQ-VAE model that discretizes state-macro action sequences
  • Figure 3: Pre-construction of the latent search space by sampling and evaluating latent macro-action codes, caching the top-k candidates, and recursively expanding the planning tree for efficient macro-level planning.
  • Figure 4: Illustration of our MCTS process for macro-level planning. The algorithm iteratively selects actions using the UCT policy, applies progressive widening to balance exploration and exploitation, performs parallel expansion of multiple macro actions and their potential outcomes, and backpropagates estimated Q-values to efficiently explore and refine the planning tree.
  • Figure 5: Results of ablation studies, where the height of the bar is the mean normalized scores on high noise gym locomotion control tasks.
  • ...and 3 more figures