ViSA: Visited-State Augmentation for Generalized Goal-Space Contrastive Reinforcement Learning

Issa Nakamura; Tomoya Yamanokuchi; Yuki Kadokawa; Jia Qu; Shun Otsub; Ken Miyamoto; Shotaro Miwa; Takamitsu Matsubara

ViSA: Visited-State Augmentation for Generalized Goal-Space Contrastive Reinforcement Learning

Issa Nakamura, Tomoya Yamanokuchi, Yuki Kadokawa, Jia Qu, Shun Otsub, Ken Miyamoto, Shotaro Miwa, Takamitsu Matsubara

Abstract

Goal-Conditioned Reinforcement Learning (GCRL) is a framework for learning a policy that can reach arbitrarily given goals. In particular, Contrastive Reinforcement Learning (CRL) provides a framework for policy updates using an approximation of the value function estimated via contrastive learning, achieving higher sample efficiency compared to conventional methods. However, since CRL treats the visited state as a pseudo-goal during learning, it can accurately estimate the value function only for limited goals. To address this issue, we propose a novel data augmentation approach for CRL called ViSA (Visited-State Augmentation). ViSA consists of two components: 1) generating augmented state samples, with the aim of augmenting hard-to-visit state samples during on-policy exploration, and 2) learning consistent embedding space, which uses an augmented state as auxiliary information to regularize the embedding space by reformulating the objective function of the embedding space based on mutual information. We evaluate ViSA in simulation and real-world robotic tasks and show improved goal-space generalization, which permits accurate value estimation for hard-to-visit goals. Further details can be found on the project page: \href{https://issa-n.github.io/projectPage_ViSA/}{\texttt{https://issa-n.github.io/projectPage\_ViSA/}}

ViSA: Visited-State Augmentation for Generalized Goal-Space Contrastive Reinforcement Learning

Abstract

Paper Structure (30 sections, 19 equations, 7 figures, 1 algorithm)

This paper contains 30 sections, 19 equations, 7 figures, 1 algorithm.

Introduction
Related Work
Mitigating Sample Bias in GCRL
Regularization of Embedded Spaces for CL
Preliminaries
Contrastive Learning
Goal-Conditioned RL as a Goal-reaching Problem
Contrastive Reinforcement Learning
Proposed Method
Generating Augmented State Samples
Reformulation of Objective Function
Learning Consistent Embedding Space
Estimation of Joint Mutual Information $J$
Estimation of Unique Mutual Information $U$
Experiments
...and 15 more sections

Figures (7)

Figure 1: Overview of ViSA. This approach has two components: 1) generating augmented state samples to augment hard-to-visit state samples during on-policy exploration and 2) learning consistent embedding space, which uses an augmented state as auxiliary information to regularize the embedding space. The embedding space is trained to pull a visited state closer to a state-action pair (anchor) and push a random state farther away, encoding goal reachability from the anchor. ViSA suppresses estimation bias by considering relative distances of augmented states and augmented random states with respect to the anchor, enabling accurate value estimation and action selection for diverse goals in CRL.
Figure 2: Framework of ViSA: (a) Generating augmented state samples. Using augmentation distribution $p(s_a \mid s_v)$, hard-to-visit states from on-policy rollouts are artificially augmented. Crosses indicate samples. Dotted region is inherently reachable state space, blue region contains samples from visited state distribution $p(s_v \mid s,a)$, and red region includes additional samples from $p(s_a \mid s_v)$. (b) Learning consistent embedding space. To regularize embedding space and prevent overfitting to visited states $s_v$, we use not only anchors $(s,a)$ and visited states $s_v^+, s_v^-$ from the previous CRL method but also augmented states $s_a^+, s_a^-$ as auxiliary information. Objective function of embedding space is reformulated based on mutual information factorization, and encoders $\psi, \phi, \hat{\phi}$ are trained to maximize mutual information $I_{SaFE}((s,a); s_v)$ estimated using augmented states $s_a$. Mutual information terms inside dashed boxes are used to compute $I_{SaFE}((s,a); s_v)$, where red and green text indicate upper and lower bounds, respectively.
Figure 3: Experimental environments
Figure 4: Learning results from robot tasks. Plots show success rate during learning for each task. Solid lines indicate the mean over three trials, and shaded areas represent variance.
Figure 5: Visited state distributions and learning results for a naive sample diversification CRL method. (a) Visualization of visited state distributions. CRL is modified by adjusting discount factor $\gamma$ to collect visited state samples $s_v$ more broadly from on-policy rollouts. For visualization, sampling probabilities are normalized so that the maximum value is 1. (b) Learning curves of success rates. Solid lines show mean success rates over three trials for proposed method and baselines, and shaded regions designate variance.
...and 2 more figures

ViSA: Visited-State Augmentation for Generalized Goal-Space Contrastive Reinforcement Learning

Abstract

ViSA: Visited-State Augmentation for Generalized Goal-Space Contrastive Reinforcement Learning

Authors

Abstract

Table of Contents

Figures (7)