Goal-Conditioned Reinforcement Learning from Sub-Optimal Data on Metric Spaces

Alfredo Reichlin; Miguel Vasco; Hang Yin; Danica Kragic

Goal-Conditioned Reinforcement Learning from Sub-Optimal Data on Metric Spaces

Alfredo Reichlin, Miguel Vasco, Hang Yin, Danica Kragic

TL;DR

Goal: address learning optimal goal-reaching behavior from severely sub-optimal offline data under sparse rewards and deterministic/invertible dynamics. Approach: introduce MetricRL, which learns a distance-monotonic latent map ${\phi:S\to Z}$ so that ${\tilde{V}(s)=\gamma^{d_Z(\phi(s),\phi(s_g))} r_g}$ informs a weighted imitation policy, avoiding TD targets and out-of-distribution issues; prove that, under distance-monotonicity, any policy greedy on ${\tilde{V}}$ is optimal in these MDPs. Contributions: formal distance-monotonicity definition, a practical loss ${\mathcal{L}_\theta(D)}$ to learn DM representations, a provable optimality guarantee for the resulting greedy policy, and strong empirical results across standard offline GOAL tasks including image observations via a super-state trick. Findings: MetricRL consistently recovers near-optimal policies from severely sub-optimal offline data and scales to high-dimensional observations, outperforming baselines like CQL, BCQ, BEAR, IQL, and QRL. Impact: provides a principled metric-space framework to tackle distribution shift in offline RL for goal-reaching tasks, with implications for safer, scalable deployment in real-world robotics and control problems.

Abstract

We study the problem of learning optimal behavior from sub-optimal datasets for goal-conditioned offline reinforcement learning under sparse rewards, invertible actions and deterministic transitions. To mitigate the effects of \emph{distribution shift}, we propose MetricRL, a method that combines metric learning for value function approximation with weighted imitation learning for policy estimation. MetricRL avoids conservative or behavior-cloning constraints, enabling effective learning even in severely sub-optimal regimes. We introduce distance monotonicity as a key property linking metric representations to optimality and design an objective that explicitly promotes it. Empirically, MetricRL consistently outperforms prior state-of-the-art goal-conditioned RL methods in recovering near-optimal behavior from sub-optimal offline data.

Goal-Conditioned Reinforcement Learning from Sub-Optimal Data on Metric Spaces

TL;DR

so that

informs a weighted imitation policy, avoiding TD targets and out-of-distribution issues; prove that, under distance-monotonicity, any policy greedy on

is optimal in these MDPs. Contributions: formal distance-monotonicity definition, a practical loss

to learn DM representations, a provable optimality guarantee for the resulting greedy policy, and strong empirical results across standard offline GOAL tasks including image observations via a super-state trick. Findings: MetricRL consistently recovers near-optimal policies from severely sub-optimal offline data and scales to high-dimensional observations, outperforming baselines like CQL, BCQ, BEAR, IQL, and QRL. Impact: provides a principled metric-space framework to tackle distribution shift in offline RL for goal-reaching tasks, with implications for safer, scalable deployment in real-world robotics and control problems.

Abstract

Paper Structure (26 sections, 1 theorem, 16 equations, 16 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 1 theorem, 16 equations, 16 figures, 1 table, 1 algorithm.

Introduction
Preliminaries and Assumptions
Method
Distance Monotonicity
Value Function Approximation
MetricRL
Practical Implementation
Results
Discussion
Metric Space Ablation
Limitations
Related Work
Conclusions
Appendix
Sparse Goal-Conditioned Value Functions in Isometric Spaces
...and 11 more sections

Key Result

Theorem 3.2

If the MDP is deterministic, sparse, and goal-conditioned, then holds if $\phi$ is distance monotonic.

Figures (16)

Figure 1: Average reward on Minigrid DoorKeyMinigridMiniworld23 as a function of the expected reward present in the offline dataset. We contribute MetricRL (red line), a novel goal-conditioned offline RL agent able to learn near-optimal behavior from severely sub-optimal datasets.
Figure 2: We explore a form of symmetry in representation learning for goal-conditioned offline reinforcement learning: we learn a metric space in which Euclidean distances between the representation of states ($z, z', z"$) are related to the value function of the agent. We call our approach MetricRL. In the Minigrid Doorkey environment, moving greedily to adjacent states translates to the optimal policy (red line) to reach the goal (in green). Our objective is to preserve the local structure of adjacent representations (orange arrows, left) while maximizing the separation between non-adjacent ones (blue arrows, left).
Figure 3: Optimizing Equation \ref{['eq:loss']} increases the ratio of distance monotonic triplets (blue curve) on Maze2D (Large). Distance monotonicity is also correlated with an increase in the average return of the agent (orange curve).
Figure 4: Estimated value function for different methods using a dataset collected from a random policy in Maze2D Large. We highlight (in red) how MetricRL is the only method able to correctly assign low values to states that are close in the Euclidean space but not in terms of distance to the goal. Additionally, we highlight (in orange) that our proposed distance monotonicity in complex topologies is not equivalent to isometries, yet we are still able to recover provably optimal policies (as we discuss in Section \ref{['sec:method:metric']}). All values are normalized.
Figure 5: Visualization of the two-dimensional latent space of MetricRL in the DoorKey environment when considering state features (left) and image (right) observations. We observe that the addition of a super-state (red star on the right figure) for image observations results in a significant change in the structure of the embedded graph as each set of states with a different (and visible) goal gets separated (orange dots). Nonetheless, in both cases, the optimal policy still follows a geodesic in the graph: from the starting state (blue) the agent needs to pick up the key (yellow) to open the locked door (orange) and move to the goal state (green).
...and 11 more figures

Theorems & Definitions (3)

Definition 3.1
Theorem 3.2
proof

Goal-Conditioned Reinforcement Learning from Sub-Optimal Data on Metric Spaces

TL;DR

Abstract

Goal-Conditioned Reinforcement Learning from Sub-Optimal Data on Metric Spaces

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (3)