IQL-TD-MPC: Implicit Q-Learning for Hierarchical Model Predictive Control
Rohan Chitnis, Yingchen Xu, Bobak Hashemi, Lucas Lehnert, Urun Dogan, Zheqing Zhu, Olivier Delalleau
TL;DR
The paper tackles offline model-based reinforcement learning by proposing IQL-TD-MPC, which integrates Implicit Q-Learning with TD-MPC to robustly plan in a latent space trained on offline data. It further introduces a hierarchical Manager–Worker framework where a temporally abstract Manager outputs intent embeddings $g_t$ to augment any offline Worker, enabling longer-horizon reasoning. Empirical results on D4RL show that IQL-TD-MPC is a strong offline planner and that the Manager–Worker setup yields substantial gains on antmaze and maze2d tasks, though performance can deteriorate on fine-grained locomotion. The approach offers a general, modular pathway to boost a range of offline RL algorithms by injecting temporally abstract, plan-guiding signals learned through MPC planning in a latent space.
Abstract
Model-based reinforcement learning (RL) has shown great promise due to its sample efficiency, but still struggles with long-horizon sparse-reward tasks, especially in offline settings where the agent learns from a fixed dataset. We hypothesize that model-based RL agents struggle in these environments due to a lack of long-term planning capabilities, and that planning in a temporally abstract model of the environment can alleviate this issue. In this paper, we make two key contributions: 1) we introduce an offline model-based RL algorithm, IQL-TD-MPC, that extends the state-of-the-art Temporal Difference Learning for Model Predictive Control (TD-MPC) with Implicit Q-Learning (IQL); 2) we propose to use IQL-TD-MPC as a Manager in a hierarchical setting with any off-the-shelf offline RL algorithm as a Worker. More specifically, we pre-train a temporally abstract IQL-TD-MPC Manager to predict "intent embeddings", which roughly correspond to subgoals, via planning. We empirically show that augmenting state representations with intent embeddings generated by an IQL-TD-MPC manager significantly improves off-the-shelf offline RL agents' performance on some of the most challenging D4RL benchmark tasks. For instance, the offline RL algorithms AWAC, TD3-BC, DT, and CQL all get zero or near-zero normalized evaluation scores on the medium and large antmaze tasks, while our modification gives an average score over 40.
