Table of Contents
Fetching ...

SMORE: Score Models for Offline Goal-Conditioned Reinforcement Learning

Harshit Sikchi, Rohan Chitnis, Ahmed Touati, Alborz Geramifard, Amy Zhang, Scott Niekum

TL;DR

SMORE tackles offline goal-conditioned RL by recasting GCRL as an occupancy-matching problem and deriving a discriminator-free dual objective via convex duality. It learns unnormalized scores S(s,a,g) that quantify action-goal importance, avoiding density estimation or discriminators. Through a mixture-occupancy framework and expectile-based offline constraints, SMORE achieves strong performance across robot manipulation and locomotion tasks, including high-dimensional image observations, while remaining robust to stochasticity and reduced expert coverage. This approach advances offline GCRL by providing a principled, scalable, and robust method that leverages existing datasets without hand-engineered rewards or discriminators, with significant implications for learning generalist agents from offline data.

Abstract

Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions. Offline GCRL is pivotal for developing generalist agents capable of leveraging pre-existing datasets to learn diverse and reusable skills without hand-engineering reward functions. However, contemporary approaches to GCRL based on supervised learning and contrastive learning are often suboptimal in the offline setting. An alternative perspective on GCRL optimizes for occupancy matching, but necessitates learning a discriminator, which subsequently serves as a pseudo-reward for downstream RL. Inaccuracies in the learned discriminator can cascade, negatively influencing the resulting policy. We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe. The key insight is combining the occupancy matching perspective of GCRL with a convex dual formulation to derive a learning objective that can better leverage suboptimal offline data. SMORe learns scores or unnormalized densities representing the importance of taking an action at a state for reaching a particular goal. SMORe is principled and our extensive experiments on the fully offline GCRL benchmark composed of robot manipulation and locomotion tasks, including high-dimensional observations, show that SMORe can outperform state-of-the-art baselines by a significant margin.

SMORE: Score Models for Offline Goal-Conditioned Reinforcement Learning

TL;DR

SMORE tackles offline goal-conditioned RL by recasting GCRL as an occupancy-matching problem and deriving a discriminator-free dual objective via convex duality. It learns unnormalized scores S(s,a,g) that quantify action-goal importance, avoiding density estimation or discriminators. Through a mixture-occupancy framework and expectile-based offline constraints, SMORE achieves strong performance across robot manipulation and locomotion tasks, including high-dimensional image observations, while remaining robust to stochasticity and reduced expert coverage. This approach advances offline GCRL by providing a principled, scalable, and robust method that leverages existing datasets without hand-engineered rewards or discriminators, with significant implications for learning generalist agents from offline data.

Abstract

Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions. Offline GCRL is pivotal for developing generalist agents capable of leveraging pre-existing datasets to learn diverse and reusable skills without hand-engineering reward functions. However, contemporary approaches to GCRL based on supervised learning and contrastive learning are often suboptimal in the offline setting. An alternative perspective on GCRL optimizes for occupancy matching, but necessitates learning a discriminator, which subsequently serves as a pseudo-reward for downstream RL. Inaccuracies in the learned discriminator can cascade, negatively influencing the resulting policy. We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe. The key insight is combining the occupancy matching perspective of GCRL with a convex dual formulation to derive a learning objective that can better leverage suboptimal offline data. SMORe learns scores or unnormalized densities representing the importance of taking an action at a state for reaching a particular goal. SMORe is principled and our extensive experiments on the fully offline GCRL benchmark composed of robot manipulation and locomotion tasks, including high-dimensional observations, show that SMORe can outperform state-of-the-art baselines by a significant margin.
Paper Structure (36 sections, 6 theorems, 57 equations, 4 figures, 16 tables, 1 algorithm)

This paper contains 36 sections, 6 theorems, 57 equations, 4 figures, 16 tables, 1 algorithm.

Key Result

proposition 1

Consider a stochastic MDP, a stochastic policy $\pi$, and a sparse reward function $r(s,a,g)=\mathbb{E}_{s'\sim p(\cdot|s,a)}{\left[\mathbb{I}(\phi(s')=g,q^{\texttt{train}}(g)>0)\right]}$ where $\mathbb{I}$ is an indicator function. Define a soft goal transition distribution to be $q(s,a,g)~\propto where $\mathcal{H}$ denotes the entropy, $\alpha$ is a temperature parameter and $C$ is the partiti

Figures (4)

  • Figure 1: Illustration of the SMORe objective where $\beta^c=1-\beta$: SMORe matches a mixture distribution of current policy and offline data to a mixture of the goal-transition distribution and offline data in order to find the optimal goal reaching policy.
  • Figure 2: SMORe is robust in stochastic environments. With increasing noise, SMORe still outperforms prior methods.
  • Figure 3: Evaluation on simulated manipulation tasks with image observations. The left image shows the starting state at the top and the goal at the bottom for evaluation tasks. The error bars show the standard deviation with 5 random seeds. SMORe is competitive or outperforms prior methods on all the tasks we considered.
  • Figure : SMORe

Theorems & Definitions (10)

  • proposition 1
  • theorem 1
  • proposition 1
  • proof
  • proposition 2
  • proof
  • theorem 1
  • proof
  • theorem 2
  • proof