How Far I'll Go: Offline Goal-Conditioned Reinforcement Learning via $f$-Advantage Regression

Yecheng Jason Ma; Jason Yan; Dinesh Jayaraman; Osbert Bastani

How Far I'll Go: Offline Goal-Conditioned Reinforcement Learning via $f$-Advantage Regression

Yecheng Jason Ma, Jason Yan, Dinesh Jayaraman, Osbert Bastani

TL;DR

GoFAR advances offline goal-conditioned RL by formulating GCRL as state-occupancy matching and solving it via a dual f-divergence objective. It delivers a relabeling-free, uninterleaved training regime that yields strong finite-sample guarantees and enables learning a goal-conditioned planner for zero-shot transfer. Empirically, GoFAR outperforms prior baselines across six offline GCRL tasks, demonstrates robustness under stochasticity, succeeds on real dexterous manipulation, and enables cross-robot planning. This approach offers a principled, scalable pathway for offline Skill-learning and hierarchical control in real-world robotics.

Abstract

Offline goal-conditioned reinforcement learning (GCRL) promises general-purpose skill learning in the form of reaching diverse goals from purely offline datasets. We propose $\textbf{Go}$al-conditioned $f$-$\textbf{A}$dvantage $\textbf{R}$egression (GoFAR), a novel regression-based offline GCRL algorithm derived from a state-occupancy matching perspective; the key intuition is that the goal-reaching task can be formulated as a state-occupancy matching problem between a dynamics-abiding imitator agent and an expert agent that directly teleports to the goal. In contrast to prior approaches, GoFAR does not require any hindsight relabeling and enjoys uninterleaved optimization for its value and policy networks. These distinct features confer GoFAR with much better offline performance and stability as well as statistical performance guarantee that is unattainable for prior methods. Furthermore, we demonstrate that GoFAR's training objectives can be re-purposed to learn an agent-independent goal-conditioned planner from purely offline source-domain data, which enables zero-shot transfer to new target domains. Through extensive experiments, we validate GoFAR's effectiveness in various problem settings and tasks, significantly outperforming prior state-of-art. Notably, on a real robotic dexterous manipulation task, while no other method makes meaningful progress, GoFAR acquires complex manipulation behavior that successfully accomplishes diverse goals.

How Far I'll Go: Offline Goal-Conditioned Reinforcement Learning via $f$-Advantage Regression

TL;DR

Abstract

Offline goal-conditioned reinforcement learning (GCRL) promises general-purpose skill learning in the form of reaching diverse goals from purely offline datasets. We propose

al-conditioned

dvantage

egression (GoFAR), a novel regression-based offline GCRL algorithm derived from a state-occupancy matching perspective; the key intuition is that the goal-reaching task can be formulated as a state-occupancy matching problem between a dynamics-abiding imitator agent and an expert agent that directly teleports to the goal. In contrast to prior approaches, GoFAR does not require any hindsight relabeling and enjoys uninterleaved optimization for its value and policy networks. These distinct features confer GoFAR with much better offline performance and stability as well as statistical performance guarantee that is unattainable for prior methods. Furthermore, we demonstrate that GoFAR's training objectives can be re-purposed to learn an agent-independent goal-conditioned planner from purely offline source-domain data, which enables zero-shot transfer to new target domains. Through extensive experiments, we validate GoFAR's effectiveness in various problem settings and tasks, significantly outperforming prior state-of-art. Notably, on a real robotic dexterous manipulation task, while no other method makes meaningful progress, GoFAR acquires complex manipulation behavior that successfully accomplishes diverse goals.

Paper Structure (51 sections, 10 theorems, 67 equations, 11 figures, 7 tables, 2 algorithms)

This paper contains 51 sections, 10 theorems, 67 equations, 11 figures, 7 tables, 2 algorithms.

Introduction
Related Work
Problem Formulation
Problem Formulation
Goal-Conditioned $f$-Advantage Regression
Algorithm
Optimal Goal-Weighting Property
Uninterleaved Optimization and Performance Guarantee
Goal-Conditioned Planning
Experiments
Offline GCRL
Robustness in Stochastic Offline GCRL Settings
Real-World Robotic Dexterous Manipulation
Zero-Shot Transfer Across Robots
Conclusion
...and 36 more sections

Key Result

Proposition 4.1

Given any $r(s;g)$, for each $g$ in the support of $p(g)$, define $p(s;g) = \frac{e^{r(s;g)}}{Z(g)}$, where $Z(g) := \int e^{r(s;g)} ds$ is the normalizing constant. Then, the following equality holds: where $J(\pi)$ is the GCRL objective (Eq. eq:gcrl-objective) with reward $r(s;g)$ and $C := \mathbb{E}_{g \sim p(g)}[\log Z(g)]$.

Figures (11)

Figure 1: GoFAR schematic illustration.
Figure 2: GCRL can be thought of imitating an expert agent that can teleport to goals.
Figure 3: D'Claw (left); Cross-Embodiment transfer source (middle) and target (right) domains.
Figure 4: Offline GCRL ablation studies. While GoFAR is robust to hindsight relabeling, removing it is highly detrimental to all baselines.
Figure 5: Stochastic environment evaluation. GoFAR is more robust to stochastic environments due to its lack of hindsight goal relabeling.
...and 6 more figures

Theorems & Definitions (21)

Proposition 4.1
Proposition 4.2
Proposition 4.3
Theorem 4.1
Definition A.1: $f$-divergence
Definition A.2: Fenchel conjugate
Definition A.3
Proposition B.1
proof
Lemma B.1
...and 11 more

How Far I'll Go: Offline Goal-Conditioned Reinforcement Learning via $f$-Advantage Regression

TL;DR

Abstract

How Far I'll Go: Offline Goal-Conditioned Reinforcement Learning via $f$-Advantage Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (21)