Goal-conditioned Offline Reinforcement Learning through State Space Partitioning

Mianchu Wang; Yue Jin; Giovanni Montana

Goal-conditioned Offline Reinforcement Learning through State Space Partitioning

Mianchu Wang, Yue Jin, Giovanni Montana

TL;DR

This work tackles offline goal-conditioned reinforcement learning under sparse rewards and distribution shift by introducing DAWOG, a dual-advantage weighting scheme. It combines a goal-conditioned advantage with a target-region advantage derived from a state-space partition based on the value function, shaping a short-horizon objective toward higher-valued regions. The method provides theoretical guarantees that the learned policy is never worse than the behavior policy and demonstrates strong empirical gains across Grid World, AntMaze, and Gym robotics benchmarks. The approach is practical, scalable, and show robust performance under hyperparameter variations, with potential extensions to adaptive partitioning and online settings. Overall, DAWOG advances offline GCRL by addressing multi-modality and distribution shift through a principled, partition-based inductive bias.

Abstract

Offline reinforcement learning (RL) aims to infer sequential decision policies using only offline datasets. This is a particularly difficult setup, especially when learning to achieve multiple different goals or outcomes under a given scenario with only sparse rewards. For offline learning of goal-conditioned policies via supervised learning, previous work has shown that an advantage weighted log-likelihood loss guarantees monotonic policy improvement. In this work we argue that, despite its benefits, this approach is still insufficient to fully address the distribution shift and multi-modality problems. The latter is particularly severe in long-horizon tasks where finding a unique and optimal policy that goes from a state to the desired goal is challenging as there may be multiple and potentially conflicting solutions. To tackle these challenges, we propose a complementary advantage-based weighting scheme that introduces an additional source of inductive bias: given a value-based partitioning of the state space, the contribution of actions expected to lead to target regions that are easier to reach, compared to the final goal, is further increased. Empirically, we demonstrate that the proposed approach, Dual-Advantage Weighted Offline Goal-conditioned RL (DAWOG), outperforms several competing offline algorithms in commonly used benchmarks. Analytically, we offer a guarantee that the learnt policy is never worse than the underlying behaviour policy.

Goal-conditioned Offline Reinforcement Learning through State Space Partitioning

TL;DR

Abstract

Paper Structure (26 sections, 3 theorems, 47 equations, 14 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 3 theorems, 47 equations, 14 figures, 1 table, 1 algorithm.

Introduction
Related work
Preliminaries
Methods
Target region advantage function
The DAWOG algorithm
Policy improvement guarantees
Experimental results
Tasks and datasets
Grid World
AntMaze navigation
Gym robotics
Implementation details
Competing methods
Performance comparisons and analysis
...and 11 more sections

Key Result

Proposition 1

DAWOG learns a policy $\pi_\theta$ to minimize the KL-divergence from where $w = \beta A^{\pi_b}(s, a, g) + \tilde{\beta} \tilde{A}^{\pi_b}(s, a, G(s,g))$, $G(s,g)$ is the target region, and $N(s, g)$ is a normalizing factor to ensuring that $\sum_{a \in \mathcal{A}} \tilde{\pi}_{dual}(a \mid s, g)=1$.

Figures (14)

Figure 1: Visualization of trajectories (in blue) across various maze environments. These trajectories are produced by policies trained through supervised learning using different action weighting schemes: no action weighting (left), goal-conditioned advantage weighting (middle), and dual-advantage weighting (right). The task involves an agent (represented as an ant) navigating from a starting position (orange circle) to an end goal (red circle). Branching points near the circles highlight areas where the multi-modality issue is pronounced. Our proposed dual-advantage weighting scheme significantly mitigates this issue. The green circle indicates the optimal path, while the red circle marks a suboptimal route.
Figure 2: Comparison of normalized weights from various weighting schemes. Referring to Figure \ref{['fig:traj_visual_intro']}, the red circles demarcate optimal and sub-optimal areas given the target. The histograms in this figure illustrate that the dual-advantage scheme more effectively differentiates states in the optimal area from those in the sub-optimal area, allocating higher weights to the 'optimal' area states.
Figure 3: Illustration of the two advantage functions used by DAWOG for a simple navigation task. First, a goal-conditioned advantage is learned using only relabeled offline data. Then, a target-region advantage is obtained by partitioning the states according to their goal-conditioned value function, identifying a target region, and rewarding actions leading to this region in the smallest possible number of steps. DAWOG updates the policy to imitate the offline data through an exponential weighting factor that depends on both advantages.
Figure 4: An illustration of goal-conditioned state space partitions for two simple Grid World navigation tasks. In each instance, the desired goal is represented by a red circle. In these environments, each state simply corresponds to a position on the grid and, in the top row, is color-coded according to its goal-conditional value. In the lower row, states sharing similar values have been merged to form a partition. For any given state, the proposed target region advantage up-weights actions that move the agent directly towards a neighboring region with higher-value.
Figure 5: Training curves for different tasks using different algorithms, each one implementing a different weighting scheme: dual-advantage, no advantage, only goal-conditioned advantage, and only the target region advantage. The solid line and the shaded area respectively present the mean and the standard deviation computed from $4$ independent runs.
...and 9 more figures

Theorems & Definitions (9)

Definition 1
Definition 2
Definition 3
Definition 4
Proposition 1
proof
Proposition 2
Proposition 3
proof

Goal-conditioned Offline Reinforcement Learning through State Space Partitioning

TL;DR

Abstract

Goal-conditioned Offline Reinforcement Learning through State Space Partitioning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (9)