Table of Contents
Fetching ...

Dense and Diverse Goal Coverage in Multi Goal Reinforcement Learning

Sagalpreet Singh, Rishi Saket, Aravindan Raghuveer

TL;DR

The paper addresses multi-goal reinforcement learning by introducing a policy-mixture framework that jointly maximizes return and disperses visitation across goal states, formalized through ${\mathcal{Z}}({\overline{\pi}}) = J_\gamma({\overline{\pi}}) + {\mathcal{I}}^{{\mathcal{S}}^+}(d[{\overline{\pi}}])$ and optimized via a Frank–Wolfe-based DDGC algorithm. At each iteration, a batch RL subroutine (FQI) learns a new policy from mixture-derived rewards, progressively expanding the mixture to cover more goals while preserving high return; theoretical guarantees bound the sub-optimality with respect to the objective. The approach is extended to continuous spaces using FQI/FAC and practical extensions like exploratory sampling and a goal buffer to robustly discover and retain goal states. Empirical results on synthetic MDPs and Brax/JaxGCRL benchmarks demonstrate that DDGC achieves near-optimal return with substantially improved diversity over goal states, validating the value of explicitly balancing return and dispersion in goal coverage.

Abstract

Reinforcement Learning algorithms are primarily focused on learning a policy that maximizes expected return. As a result, the learned policy can exploit one or few reward sources. However, in many natural situations, it is desirable to learn a policy that induces a dispersed marginal state distribution over rewarding states, while maximizing the expected return which is typically tied to reaching a goal state. This aspect remains relatively unexplored. Existing techniques based on entropy regularization and intrinsic rewards use stochasticity for encouraging exploration to find an optimal policy which may not necessarily lead to dispersed marginal state distribution over rewarding states. Other RL algorithms which match a target distribution assume the latter to be available apriori. This may be infeasible in large scale systems where enumeration of all states is not possible and a state is determined to be a goal state only upon reaching it. We formalize the problem of maximizing the expected return while uniformly visiting the goal states as Multi Goal RL in which an oracle classifier over the state space determines the goal states. We propose a novel algorithm that learns a high-return policy mixture with marginal state distribution dispersed over the set of goal states. Our algorithm is based on optimizing a custom RL reward which is computed - based on the current policy mixture - at each iteration for a set of sampled trajectories. The latter are used via an offline RL algorithm to update the policy mixture. We prove performance guarantees for our algorithm, showing efficient convergence bounds for optimizing a natural objective which captures the expected return as well as the dispersion of the marginal state distribution over the goal states. We design and perform experiments on synthetic MDPs and standard RL environments to evaluate the effectiveness of our algorithm.

Dense and Diverse Goal Coverage in Multi Goal Reinforcement Learning

TL;DR

The paper addresses multi-goal reinforcement learning by introducing a policy-mixture framework that jointly maximizes return and disperses visitation across goal states, formalized through and optimized via a Frank–Wolfe-based DDGC algorithm. At each iteration, a batch RL subroutine (FQI) learns a new policy from mixture-derived rewards, progressively expanding the mixture to cover more goals while preserving high return; theoretical guarantees bound the sub-optimality with respect to the objective. The approach is extended to continuous spaces using FQI/FAC and practical extensions like exploratory sampling and a goal buffer to robustly discover and retain goal states. Empirical results on synthetic MDPs and Brax/JaxGCRL benchmarks demonstrate that DDGC achieves near-optimal return with substantially improved diversity over goal states, validating the value of explicitly balancing return and dispersion in goal coverage.

Abstract

Reinforcement Learning algorithms are primarily focused on learning a policy that maximizes expected return. As a result, the learned policy can exploit one or few reward sources. However, in many natural situations, it is desirable to learn a policy that induces a dispersed marginal state distribution over rewarding states, while maximizing the expected return which is typically tied to reaching a goal state. This aspect remains relatively unexplored. Existing techniques based on entropy regularization and intrinsic rewards use stochasticity for encouraging exploration to find an optimal policy which may not necessarily lead to dispersed marginal state distribution over rewarding states. Other RL algorithms which match a target distribution assume the latter to be available apriori. This may be infeasible in large scale systems where enumeration of all states is not possible and a state is determined to be a goal state only upon reaching it. We formalize the problem of maximizing the expected return while uniformly visiting the goal states as Multi Goal RL in which an oracle classifier over the state space determines the goal states. We propose a novel algorithm that learns a high-return policy mixture with marginal state distribution dispersed over the set of goal states. Our algorithm is based on optimizing a custom RL reward which is computed - based on the current policy mixture - at each iteration for a set of sampled trajectories. The latter are used via an offline RL algorithm to update the policy mixture. We prove performance guarantees for our algorithm, showing efficient convergence bounds for optimizing a natural objective which captures the expected return as well as the dispersion of the marginal state distribution over the goal states. We design and perform experiments on synthetic MDPs and standard RL environments to evaluate the effectiveness of our algorithm.

Paper Structure

This paper contains 22 sections, 4 theorems, 52 equations, 5 figures, 2 tables.

Key Result

Lemma 2.1

(proved in Appendix appendix:objective_properties) The following properties hold for ${\overline{\Pi}}$ and ${\mathcal{Z}}$:

Figures (5)

  • Figure 1: (a) A continuing deterministic MDP with a start state and 3 goal states (highlighted). Assume that all states with no outgoing edges are sink states i.e. any action in these states leads back to the same state. State $S_{01}$ is critical where enforcing high stochasticity can lead to low returns. However, it is perfectly fine to enforce highest entropy on action selection at state $S_{12}$. (b) An algorithm that encourages exploration may still end up learning a high return policy with low stochasticity which only reaches a subset of goal states. (c) A policy that encourages exploration by promoting stochastic policies will enforce high stochasticity at all states (including critical states) which may lead to policies with suboptimal expected returns. (d) A desirable policy is the one which reaches a diverse set of goal states without compromising on the expected return.
  • Figure 2: For the MDP in Figure \ref{['figure:mdp']}, we compare the discounted marginal state distribution induced by different algorithms over the 7 states in the MDP. Each bar indicates the induced probability mass over a state. Green bars indicate discounted marginal state probability of different goal states and red bars indicate those of the non-goal states.
  • Figure 3: (a) Plot of discounted return by learnt policies; (b) Plot of partial entropy of discounted marginal state distribution over discretized state space of learnt policies; (c) Plot of modified partial gini criterion of discounted marginal state distribution over discretized state space of learnt policies. In each of the plots, values are averaged over 5 runs and normalized for each environment. Higher value is better for all 3 metrics visualized in the plots above.
  • Figure 4: (a) A continuing deterministic MDP with a start state, and highlighted states as the goal states. The terminal states act as sink states i.e. any action taken in these states leads back to the same state. (b) Marginal state distribution with the highest expected return as it is concentrated on the closest goal state to avoid higher discounting penalty. (c) Marginal state distribution that is well dispersed across goal states but compromises on the expected return by incurring larger discounting penalty.
  • Figure 5: (a) A continuing deterministic MDP with a start state, and highlighted states as the goal states. (b) Marginal state distribution with the highest expected return as it is concentrated densely on the goal state by avoiding visitation on non-goal state. (c) Marginal state distribution that is obtained for $\gamma=1$ by optimizing for the DDGC objective is well dispersed across goal states but compromises on expected return by visiting one non-goal state.

Theorems & Definitions (9)

  • Lemma 2.1
  • Lemma 3.1
  • Lemma 3.2
  • Theorem 3.1
  • proof
  • proof
  • proof
  • proof
  • proof