Goal-Conditioned Reinforcement Learning from Sub-Optimal Data on Metric Spaces
Alfredo Reichlin, Miguel Vasco, Hang Yin, Danica Kragic
TL;DR
Goal: address learning optimal goal-reaching behavior from severely sub-optimal offline data under sparse rewards and deterministic/invertible dynamics. Approach: introduce MetricRL, which learns a distance-monotonic latent map ${\phi:S\to Z}$ so that ${\tilde{V}(s)=\gamma^{d_Z(\phi(s),\phi(s_g))} r_g}$ informs a weighted imitation policy, avoiding TD targets and out-of-distribution issues; prove that, under distance-monotonicity, any policy greedy on ${\tilde{V}}$ is optimal in these MDPs. Contributions: formal distance-monotonicity definition, a practical loss ${\mathcal{L}_\theta(D)}$ to learn DM representations, a provable optimality guarantee for the resulting greedy policy, and strong empirical results across standard offline GOAL tasks including image observations via a super-state trick. Findings: MetricRL consistently recovers near-optimal policies from severely sub-optimal offline data and scales to high-dimensional observations, outperforming baselines like CQL, BCQ, BEAR, IQL, and QRL. Impact: provides a principled metric-space framework to tackle distribution shift in offline RL for goal-reaching tasks, with implications for safer, scalable deployment in real-world robotics and control problems.
Abstract
We study the problem of learning optimal behavior from sub-optimal datasets for goal-conditioned offline reinforcement learning under sparse rewards, invertible actions and deterministic transitions. To mitigate the effects of \emph{distribution shift}, we propose MetricRL, a method that combines metric learning for value function approximation with weighted imitation learning for policy estimation. MetricRL avoids conservative or behavior-cloning constraints, enabling effective learning even in severely sub-optimal regimes. We introduce distance monotonicity as a key property linking metric representations to optimality and design an objective that explicitly promotes it. Empirically, MetricRL consistently outperforms prior state-of-the-art goal-conditioned RL methods in recovering near-optimal behavior from sub-optimal offline data.
