Table of Contents
Fetching ...

GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models

Mianchu Wang, Rui Yang, Xi Chen, Hao Sun, Meng Fang, Giovanni Montana

TL;DR

GOPlan addresses offline GCRL under data scarcity and OOD goal generalization by a two-stage, model-based approach: pretraining a multi-modal prior policy with an advantage-weighted CGAN and learning an ensemble of dynamics models for planning; followed by reanalysis-driven policy fine-tuning that generates high-quality imagined trajectories for both intra-trajectory and inter-trajectory goals. Planning scores imagined trajectories using exponential weighting and filters them by model-uncertainty to maintain trajectories within the models' support. Empirical results show GOPlan achieving state-of-the-art performance on offline multi-goal navigation and manipulation tasks, with strong performance under small data budgets and notable generalization to OOD goals, while ablations confirm the importance of the two-stage design, prior choice, and uncertainty-aware reanalysis. The work demonstrates that integrating principled, goal-conditioned policy learning with planning in learned dynamics provides a practical path to robust, data-efficient offline RL in real-world, multi-goal settings.

Abstract

Offline Goal-Conditioned RL (GCRL) offers a feasible paradigm for learning general-purpose policies from diverse and multi-task offline datasets. Despite notable recent progress, the predominant offline GCRL methods, mainly model-free, face constraints in handling limited data and generalizing to unseen goals. In this work, we propose Goal-conditioned Offline Planning (GOPlan), a novel model-based framework that contains two key phases: (1) pretraining a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset; (2) employing the reanalysis method with planning to generate imagined trajectories for funetuning policies. Specifically, we base the prior policy on an advantage-weighted conditioned generative adversarial network, which facilitates distinct mode separation, mitigating the pitfalls of out-of-distribution (OOD) actions. For further policy optimization, the reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals. With thorough experimental evaluations, we demonstrate that GOPlan achieves state-of-the-art performance on various offline multi-goal navigation and manipulation tasks. Moreover, our results highlight the superior ability of GOPlan to handle small data budgets and generalize to OOD goals.

GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models

TL;DR

GOPlan addresses offline GCRL under data scarcity and OOD goal generalization by a two-stage, model-based approach: pretraining a multi-modal prior policy with an advantage-weighted CGAN and learning an ensemble of dynamics models for planning; followed by reanalysis-driven policy fine-tuning that generates high-quality imagined trajectories for both intra-trajectory and inter-trajectory goals. Planning scores imagined trajectories using exponential weighting and filters them by model-uncertainty to maintain trajectories within the models' support. Empirical results show GOPlan achieving state-of-the-art performance on offline multi-goal navigation and manipulation tasks, with strong performance under small data budgets and notable generalization to OOD goals, while ablations confirm the importance of the two-stage design, prior choice, and uncertainty-aware reanalysis. The work demonstrates that integrating principled, goal-conditioned policy learning with planning in learned dynamics provides a practical path to robust, data-efficient offline RL in real-world, multi-goal settings.

Abstract

Offline Goal-Conditioned RL (GCRL) offers a feasible paradigm for learning general-purpose policies from diverse and multi-task offline datasets. Despite notable recent progress, the predominant offline GCRL methods, mainly model-free, face constraints in handling limited data and generalizing to unseen goals. In this work, we propose Goal-conditioned Offline Planning (GOPlan), a novel model-based framework that contains two key phases: (1) pretraining a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset; (2) employing the reanalysis method with planning to generate imagined trajectories for funetuning policies. Specifically, we base the prior policy on an advantage-weighted conditioned generative adversarial network, which facilitates distinct mode separation, mitigating the pitfalls of out-of-distribution (OOD) actions. For further policy optimization, the reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals. With thorough experimental evaluations, we demonstrate that GOPlan achieves state-of-the-art performance on various offline multi-goal navigation and manipulation tasks. Moreover, our results highlight the superior ability of GOPlan to handle small data budgets and generalize to OOD goals.
Paper Structure (35 sections, 8 equations, 9 figures, 6 tables)

This paper contains 35 sections, 8 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: The two-stage framework of GOPlan: pretraining a prior policy and a group of dynamics models, and finetuning policy with imagined trajectories generated by the reanalysis method.
  • Figure 2: An example about modeling the multi-modal behavior policy while maximizing average rewards. The $x$-axis represents the state, and the $y$-axis represents the multi-modal action. (a-1) shows the action distribution of the offline dataset. (a-2) shows the corresponding reward distribution. (b-1) and (b-2) illustrate the action distributions generated by Gaussian and Weighted Gaussian. (c-1) and (c-2) illustrate action distributions from CVAE and Weighted CVAE. (d-1) and (d-2) illustrate action distributions from CGAN and Weighted CGAN.
  • Figure 3: Illustration of intra-trajectory and inter-trajectory reanalysis. There are six scenarios: (a-1) the imagined trajectory is valid and better than the original trajectory; (a-2) the imagined trajectory fails to reach the goal within the same number of steps as the original trajectory; (b-1) a valid imagined trajectory connects the state to an inter-trajectory goal; (b-2) a valid imagined trajectory that does not achieve the desired goal; (a-3) (b-3) invalid imagined trajectories with large uncertainty.
  • Figure 4: Average performance on OOD generalization tasks over 5 random seeds. The error bars depict the upper and lower bounds of the returns within each task group.
  • Figure 5: Stochastic environment evaluation.
  • ...and 4 more figures