GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models
Mianchu Wang, Rui Yang, Xi Chen, Hao Sun, Meng Fang, Giovanni Montana
TL;DR
GOPlan addresses offline GCRL under data scarcity and OOD goal generalization by a two-stage, model-based approach: pretraining a multi-modal prior policy with an advantage-weighted CGAN and learning an ensemble of dynamics models for planning; followed by reanalysis-driven policy fine-tuning that generates high-quality imagined trajectories for both intra-trajectory and inter-trajectory goals. Planning scores imagined trajectories using exponential weighting and filters them by model-uncertainty to maintain trajectories within the models' support. Empirical results show GOPlan achieving state-of-the-art performance on offline multi-goal navigation and manipulation tasks, with strong performance under small data budgets and notable generalization to OOD goals, while ablations confirm the importance of the two-stage design, prior choice, and uncertainty-aware reanalysis. The work demonstrates that integrating principled, goal-conditioned policy learning with planning in learned dynamics provides a practical path to robust, data-efficient offline RL in real-world, multi-goal settings.
Abstract
Offline Goal-Conditioned RL (GCRL) offers a feasible paradigm for learning general-purpose policies from diverse and multi-task offline datasets. Despite notable recent progress, the predominant offline GCRL methods, mainly model-free, face constraints in handling limited data and generalizing to unseen goals. In this work, we propose Goal-conditioned Offline Planning (GOPlan), a novel model-based framework that contains two key phases: (1) pretraining a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset; (2) employing the reanalysis method with planning to generate imagined trajectories for funetuning policies. Specifically, we base the prior policy on an advantage-weighted conditioned generative adversarial network, which facilitates distinct mode separation, mitigating the pitfalls of out-of-distribution (OOD) actions. For further policy optimization, the reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals. With thorough experimental evaluations, we demonstrate that GOPlan achieves state-of-the-art performance on various offline multi-goal navigation and manipulation tasks. Moreover, our results highlight the superior ability of GOPlan to handle small data budgets and generalize to OOD goals.
