Table of Contents
Fetching ...

Goal-Reaching Policy Learning from Non-Expert Observations via Effective Subgoal Guidance

RenMing Huang, Shaochong Liu, Yunqiang Pei, Peng Wang, Guoqing Wang, Yang Yang, Hengtao Shen

TL;DR

This work develops a diffusion strategy-based high-level policy to generate reasonable subgoals as waypoints, preferring states that more easily lead to the final goal, and learns state-goal value functions to encourage efficient subgoal reaching.

Abstract

In this work, we address the challenging problem of long-horizon goal-reaching policy learning from non-expert, action-free observation data. Unlike fully labeled expert data, our data is more accessible and avoids the costly process of action labeling. Additionally, compared to online learning, which often involves aimless exploration, our data provides useful guidance for more efficient exploration. To achieve our goal, we propose a novel subgoal guidance learning strategy. The motivation behind this strategy is that long-horizon goals offer limited guidance for efficient exploration and accurate state transition. We develop a diffusion strategy-based high-level policy to generate reasonable subgoals as waypoints, preferring states that more easily lead to the final goal. Additionally, we learn state-goal value functions to encourage efficient subgoal reaching. These two components naturally integrate into the off-policy actor-critic framework, enabling efficient goal attainment through informative exploration. We evaluate our method on complex robotic navigation and manipulation tasks, demonstrating a significant performance advantage over existing methods. Our ablation study further shows that our method is robust to observation data with various corruptions.

Goal-Reaching Policy Learning from Non-Expert Observations via Effective Subgoal Guidance

TL;DR

This work develops a diffusion strategy-based high-level policy to generate reasonable subgoals as waypoints, preferring states that more easily lead to the final goal, and learns state-goal value functions to encourage efficient subgoal reaching.

Abstract

In this work, we address the challenging problem of long-horizon goal-reaching policy learning from non-expert, action-free observation data. Unlike fully labeled expert data, our data is more accessible and avoids the costly process of action labeling. Additionally, compared to online learning, which often involves aimless exploration, our data provides useful guidance for more efficient exploration. To achieve our goal, we propose a novel subgoal guidance learning strategy. The motivation behind this strategy is that long-horizon goals offer limited guidance for efficient exploration and accurate state transition. We develop a diffusion strategy-based high-level policy to generate reasonable subgoals as waypoints, preferring states that more easily lead to the final goal. Additionally, we learn state-goal value functions to encourage efficient subgoal reaching. These two components naturally integrate into the off-policy actor-critic framework, enabling efficient goal attainment through informative exploration. We evaluate our method on complex robotic navigation and manipulation tasks, demonstrating a significant performance advantage over existing methods. Our ablation study further shows that our method is robust to observation data with various corruptions.
Paper Structure (23 sections, 17 equations, 10 figures, 1 table, 1 algorithm)

This paper contains 23 sections, 17 equations, 10 figures, 1 table, 1 algorithm.

Figures (10)

  • Figure 1: Overview of EGR-PO. (a) Our method is composed of two key learning components: a state-goal value function designed for informative exploration and a high-level policy to generate reasonable subgoals. (b) Integrating the two components into the actor-critic method, where the learned state-goal value function provides exploration rewards to encourage meaningful exploration, and the reasonable subgoals provide clear guidance signals.
  • Figure 2: We study the robotic navigation and manipulation tasks with sparse reward.
  • Figure 3: Comparison with online learning methods on robotic manipulation and navigation tasks. Shaded regions denote the $95$% confidence intervals across $5$ random seeds. Best viewed in color.
  • Figure 4: Comparison with offline pre-training and online fine-tuning methods. Shaded regions denote the $95$% confidence intervals across $5$ random seeds. Best viewed in color.
  • Figure 5: Visualizations of the agent's exploration behaviors on antmaze-large. The dots are uniformly sampled from the online replay buffer and colored by the training environment step. The visualization results are obtained by sampling $512$ points from a maximum of $120K$ environment steps. The results show that Ours achieves higher learning efficiency via informative explorations.
  • ...and 5 more figures