Table of Contents
Fetching ...

IQL-TD-MPC: Implicit Q-Learning for Hierarchical Model Predictive Control

Rohan Chitnis, Yingchen Xu, Bobak Hashemi, Lucas Lehnert, Urun Dogan, Zheqing Zhu, Olivier Delalleau

TL;DR

The paper tackles offline model-based reinforcement learning by proposing IQL-TD-MPC, which integrates Implicit Q-Learning with TD-MPC to robustly plan in a latent space trained on offline data. It further introduces a hierarchical Manager–Worker framework where a temporally abstract Manager outputs intent embeddings $g_t$ to augment any offline Worker, enabling longer-horizon reasoning. Empirical results on D4RL show that IQL-TD-MPC is a strong offline planner and that the Manager–Worker setup yields substantial gains on antmaze and maze2d tasks, though performance can deteriorate on fine-grained locomotion. The approach offers a general, modular pathway to boost a range of offline RL algorithms by injecting temporally abstract, plan-guiding signals learned through MPC planning in a latent space.

Abstract

Model-based reinforcement learning (RL) has shown great promise due to its sample efficiency, but still struggles with long-horizon sparse-reward tasks, especially in offline settings where the agent learns from a fixed dataset. We hypothesize that model-based RL agents struggle in these environments due to a lack of long-term planning capabilities, and that planning in a temporally abstract model of the environment can alleviate this issue. In this paper, we make two key contributions: 1) we introduce an offline model-based RL algorithm, IQL-TD-MPC, that extends the state-of-the-art Temporal Difference Learning for Model Predictive Control (TD-MPC) with Implicit Q-Learning (IQL); 2) we propose to use IQL-TD-MPC as a Manager in a hierarchical setting with any off-the-shelf offline RL algorithm as a Worker. More specifically, we pre-train a temporally abstract IQL-TD-MPC Manager to predict "intent embeddings", which roughly correspond to subgoals, via planning. We empirically show that augmenting state representations with intent embeddings generated by an IQL-TD-MPC manager significantly improves off-the-shelf offline RL agents' performance on some of the most challenging D4RL benchmark tasks. For instance, the offline RL algorithms AWAC, TD3-BC, DT, and CQL all get zero or near-zero normalized evaluation scores on the medium and large antmaze tasks, while our modification gives an average score over 40.

IQL-TD-MPC: Implicit Q-Learning for Hierarchical Model Predictive Control

TL;DR

The paper tackles offline model-based reinforcement learning by proposing IQL-TD-MPC, which integrates Implicit Q-Learning with TD-MPC to robustly plan in a latent space trained on offline data. It further introduces a hierarchical Manager–Worker framework where a temporally abstract Manager outputs intent embeddings to augment any offline Worker, enabling longer-horizon reasoning. Empirical results on D4RL show that IQL-TD-MPC is a strong offline planner and that the Manager–Worker setup yields substantial gains on antmaze and maze2d tasks, though performance can deteriorate on fine-grained locomotion. The approach offers a general, modular pathway to boost a range of offline RL algorithms by injecting temporally abstract, plan-guiding signals learned through MPC planning in a latent space.

Abstract

Model-based reinforcement learning (RL) has shown great promise due to its sample efficiency, but still struggles with long-horizon sparse-reward tasks, especially in offline settings where the agent learns from a fixed dataset. We hypothesize that model-based RL agents struggle in these environments due to a lack of long-term planning capabilities, and that planning in a temporally abstract model of the environment can alleviate this issue. In this paper, we make two key contributions: 1) we introduce an offline model-based RL algorithm, IQL-TD-MPC, that extends the state-of-the-art Temporal Difference Learning for Model Predictive Control (TD-MPC) with Implicit Q-Learning (IQL); 2) we propose to use IQL-TD-MPC as a Manager in a hierarchical setting with any off-the-shelf offline RL algorithm as a Worker. More specifically, we pre-train a temporally abstract IQL-TD-MPC Manager to predict "intent embeddings", which roughly correspond to subgoals, via planning. We empirically show that augmenting state representations with intent embeddings generated by an IQL-TD-MPC manager significantly improves off-the-shelf offline RL agents' performance on some of the most challenging D4RL benchmark tasks. For instance, the offline RL algorithms AWAC, TD3-BC, DT, and CQL all get zero or near-zero normalized evaluation scores on the medium and large antmaze tasks, while our modification gives an average score over 40.
Paper Structure (18 sections, 2 equations, 2 figures, 8 tables)

This paper contains 18 sections, 2 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Overview of our hierarchical framework. The Manager is a model-based IQL-TD-MPC agent (inspired by kostrikov2022iql and hansen2022tdmpc) that operates on a coarse timescale to generate intent embeddings $g_t$. To do so, the Manager performs Model Predictive Control over $H$ planning steps (which is $kH$ environment steps), using a learned policy ${\pi^M_{\theta}}$, dynamics model $f^M_{\theta}$, reward function $R^M_{\theta}$, and critic $Q^M_{\theta}$. Each intent $g_t$ is concatenated with the state $s_t$ and given to the Worker to output actions $a_t$. This Worker can be any offline RL algorithm.
  • Figure 2: Visualization of an episode of the Behavioral Cloning (BC) agent on the antmaze-large-play-v2 task. On the left, without intent embeddings, the ant gets stuck close to the start of the maze, never reaching the goal. On the right, the ant reaches the goal, guided by the intent embeddings whose decoding is visualized in green. We see that the intent embeddings act as latent-space subgoals.