Table of Contents
Fetching ...

Planning from Pixels using Inverse Dynamics Models

Keiran Paster, Sheila A. McIlraith, Jimmy Ba

TL;DR

This work tackles the challenge of planning from high-dimensional pixel observations by learning task-conditioned latent world models that predict action sequences to achieve goals. It introduces GLAMOR, which simultaneously learns an inverse dynamics model and an action prior to factor planning in a latent space, enabling efficient, heuristic-guided search via random shooting. The approach demonstrates strong performance and sample efficiency on diverse visual goal tasks in Atari and the DeepMind Control Suite, outperforming prior model-free methods in many settings. The findings highlight the value of a latent, goal-conditioned planning framework for fast adaptation to new tasks with sparse rewards and suggest avenues for extending to general rewards and stochastic environments.

Abstract

Learning task-agnostic dynamics models in high-dimensional observation spaces can be challenging for model-based RL agents. We propose a novel way to learn latent world models by learning to predict sequences of future actions conditioned on task completion. These task-conditioned models adaptively focus modeling capacity on task-relevant dynamics, while simultaneously serving as an effective heuristic for planning with sparse rewards. We evaluate our method on challenging visual goal completion tasks and show a substantial increase in performance compared to prior model-free approaches.

Planning from Pixels using Inverse Dynamics Models

TL;DR

This work tackles the challenge of planning from high-dimensional pixel observations by learning task-conditioned latent world models that predict action sequences to achieve goals. It introduces GLAMOR, which simultaneously learns an inverse dynamics model and an action prior to factor planning in a latent space, enabling efficient, heuristic-guided search via random shooting. The approach demonstrates strong performance and sample efficiency on diverse visual goal tasks in Atari and the DeepMind Control Suite, outperforming prior model-free methods in many settings. The findings highlight the value of a latent, goal-conditioned planning framework for fast adaptation to new tasks with sparse rewards and suggest avenues for extending to general rewards and stochastic environments.

Abstract

Learning task-agnostic dynamics models in high-dimensional observation spaces can be challenging for model-based RL agents. We propose a novel way to learn latent world models by learning to predict sequences of future actions conditioned on task completion. These task-conditioned models adaptively focus modeling capacity on task-relevant dynamics, while simultaneously serving as an effective heuristic for planning with sparse rewards. We evaluate our method on challenging visual goal completion tasks and show a substantial increase in performance compared to prior model-free approaches.

Paper Structure

This paper contains 29 sections, 1 theorem, 12 equations, 12 figures, 1 table, 1 algorithm.

Key Result

Proposition A.1

Let $a_*(g)$ be an action that maximizes the probability $p(s=g|\cdot)$. If there exist two goals $g, g'$ such that $p(g) > 0$, $p(g') > 0$, and $p(s=g|a_*(g)) > p(s=g|a_*(g')) > 0$, then one-step GCSL does not converge to an optimal policy.

Figures (12)

  • Figure 1: The network architecture for the inverse dynamics model used in GLAMOR. ResNets are used to encode state features and an LSTM predicts the action sequence.
  • Figure 2: Both in Atari and on tasks from the Deepmind Control Suite, GLAMOR outperforms prior methods. The goal achievement rate is averaged over all games / control tasks and over three seeds. See \ref{['fig:atari_curves']} and \ref{['fig:control_curves']} in the appendix for more detailed training curves.
  • Figure 3: (a) The agent starts in the center and must travel to the goal tile. Top shows the rate at which the agent eventually achieved the goal and bottom shows the rate at which the agent achieved the goal with the shortest available path. The amount of compute used for planning is shown on the x-axis. As the planning budget increases, both the number of successfully reached goals and the number of goals achieved optimally improves substantially. Brighter means a higher achievement rate. (b) In "naive-end", the agent greedily tries to take a shortest path to the goal for $T$ timesteps and is evaluated at the end. In "plan-end", the agent explicitely constructs a plan to achieve the goal state at the end of its trajectory. GLAMOR (Ours) can choose to terminate its episode early.
  • Figure 4: Using intermediate information to guide the planning process helps GLAMOR achieve more goals than when it only looks at the estimated probability of reaching a goal at the end of the episode.
  • Figure 5: Hyperparameters used to train GLAMOR.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Proposition A.1
  • proof