Table of Contents
Fetching ...

C-Learning: Learning to Achieve Goals via Recursive Classification

Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine

TL;DR

The paper reframes goal-conditioned reinforcement learning as predicting and controlling the future state distribution via a binary classifier, rather than relying on rewards.It introduces C-learning, an off-policy bootstrapping approach that converts classifier outputs into a density over future states and optimizes policies toward commanded goals.The authors prove convergence properties of the off-policy C-learning updates and compare with Q-learning and hindsight relabeling, showing more accurate density estimates and competitive task performance.Experiments across gridworld and continuous-control tasks, including Sawyer manipulation, demonstrate robustness, reduced hyperparameter sensitivity (no need for a goal-sampling ratio), and practical scalability.

Abstract

We study the problem of predicting and controlling the future state distribution of an autonomous agent. This problem, which can be viewed as a reframing of goal-conditioned reinforcement learning (RL), is centered around learning a conditional probability density function over future states. Instead of directly estimating this density function, we indirectly estimate this density function by training a classifier to predict whether an observation comes from the future. Via Bayes' rule, predictions from our classifier can be transformed into predictions over future states. Importantly, an off-policy variant of our algorithm allows us to predict the future state distribution of a new policy, without collecting new experience. This variant allows us to optimize functionals of a policy's future state distribution, such as the density of reaching a particular goal state. While conceptually similar to Q-learning, our work lays a principled foundation for goal-conditioned RL as density estimation, providing justification for goal-conditioned methods used in prior work. This foundation makes hypotheses about Q-learning, including the optimal goal-sampling ratio, which we confirm experimentally. Moreover, our proposed method is competitive with prior goal-conditioned RL methods.

C-Learning: Learning to Achieve Goals via Recursive Classification

TL;DR

The paper reframes goal-conditioned reinforcement learning as predicting and controlling the future state distribution via a binary classifier, rather than relying on rewards.It introduces C-learning, an off-policy bootstrapping approach that converts classifier outputs into a density over future states and optimizes policies toward commanded goals.The authors prove convergence properties of the off-policy C-learning updates and compare with Q-learning and hindsight relabeling, showing more accurate density estimates and competitive task performance.Experiments across gridworld and continuous-control tasks, including Sawyer manipulation, demonstrate robustness, reduced hyperparameter sensitivity (no need for a goal-sampling ratio), and practical scalability.

Abstract

We study the problem of predicting and controlling the future state distribution of an autonomous agent. This problem, which can be viewed as a reframing of goal-conditioned reinforcement learning (RL), is centered around learning a conditional probability density function over future states. Instead of directly estimating this density function, we indirectly estimate this density function by training a classifier to predict whether an observation comes from the future. Via Bayes' rule, predictions from our classifier can be transformed into predictions over future states. Importantly, an off-policy variant of our algorithm allows us to predict the future state distribution of a new policy, without collecting new experience. This variant allows us to optimize functionals of a policy's future state distribution, such as the density of reaching a particular goal state. While conceptually similar to Q-learning, our work lays a principled foundation for goal-conditioned RL as density estimation, providing justification for goal-conditioned methods used in prior work. This foundation makes hypotheses about Q-learning, including the optimal goal-sampling ratio, which we confirm experimentally. Moreover, our proposed method is competitive with prior goal-conditioned RL methods.

Paper Structure

This paper contains 40 sections, 6 theorems, 44 equations, 12 figures, 1 table, 3 algorithms.

Key Result

Lemma 1

Let policy $\pi(\mathbf{a_t} \mid \mathbf{s_t})$, dynamics function $p(\mathbf{s_{t+1}} \mid \mathbf{s_t}, \mathbf{a_t})$, and marginal distribution $p(\mathbf{s_{t+}})$ be given. If a classifier $C_\theta$ is the Bayes-optimal classifier, then it satisfies the follow identity for all states $\mathb

Figures (12)

  • Figure 1: Testing Hypotheses about Q-learning: (Left) As predicted, Q-values often sum to less than 1. (Right) The performance of Q-learning is sensitive to the relabeling ratio. Our analysis predicts that the optimal relabeling ratio is approximately $\lambda = \frac{1}{2}(1 + \gamma)$. C-learning (dashed orange) does not require tuning this ratio and outperforms Q-learning, even when the relabeling ratio for Q-learning is optimally chosen.
  • Figure 2: Predicting the Future: C-learning makes accurate predictions of the expected future state across a range of tasks and discount values. In contrast, learning a 1-step dynamics model and unrolling that model results in high error for large discount values.
  • Figure 3: Goal-conditioned RL: C-learning is competitive with prior goal-conditioned RL methods across a suite of benchmark tasks, without requiring careful tuning of the relabeling distribution.
  • Figure 4: Q-learning is sensitive to the relabeling ratio. Our analysis predicts the optimal relabeling ratio.
  • Figure 5: We use C-learning and Q-learning to predict the future state distribution. (Right) In the on-policy setting, both the Monte Carlo and TD versions of C-learning achieve significantly lower error than Q-learning. (Right) In the off-policy setting, the TD version of C-learning achieves lower error than Q-learning, while Monte Carlo C-learning performs poorly, as expected.
  • ...and 7 more figures

Theorems & Definitions (15)

  • Definition 1
  • Remark 1
  • Remark 2
  • Definition 2
  • Remark 3
  • Lemma 1: C-learning Bellman Equation
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • ...and 5 more