Table of Contents
Fetching ...

Soft Forward-Backward Representations for Zero-shot Reinforcement Learning with General Utilities

Marco Bagatella, Thomas Rupf, Georg Martius, Andreas Krause

TL;DR

Soft Forward-Backward Representations (Soft FB) extend zero-shot RL beyond linear rewards to arbitrary differentiable General Utilities by learning a family of entropy-regularized policies whose occupancies admit a low-rank representation $M^z=F_z^\top B$. At training time, Soft FB optimizes a maximum-entropy objective, yielding a policy family $\{\pi_z\}_{z\in\mathcal{Z}}$ via $\pi_z \propto \exp(F_z^\top z + M^z \mathcal{H}_{\pi_z})$, and at test time performs zero-order search over the low-dimensional embedding $z$ to maximize a downstream utility $f(M^\pi)$. The authors prove that max-entropy solutions to linear rewards are recovered, and that the retrieved policy set is expressive enough to approximate any differentiable utility to arbitrary precision, enabling zero-shot optimization of GU. Practically, they propose a reparameterization to map $z$ to a bounded sphere, and train forward/backward representations with sample-based objectives, optionally using an explicit flow-based model for the successor measure to improve inference. Empirically, Soft FB matches existing forward-backward methods on linear tasks but significantly outperforms them on non-deterministic objectives (e.g., pure exploration, stochastic imitation), with strong scalability to high-dimensional control benchmarks like the DeepMind Control Suite when coupled with flow-based measure models.

Abstract

Recent advancements in zero-shot reinforcement learning (RL) have facilitated the extraction of diverse behaviors from unlabeled, offline data sources. In particular, forward-backward algorithms (FB) can retrieve a family of policies that can approximately solve any standard RL problem (with additive rewards, linear in the occupancy measure), given sufficient capacity. While retaining zero-shot properties, we tackle the greater problem class of RL with general utilities, in which the objective is an arbitrary differentiable function of the occupancy measure. This setting is strictly more expressive, capturing tasks such as distribution matching or pure exploration, which may not be reduced to additive rewards. We show that this additional complexity can be captured by a novel, maximum entropy (soft) variant of the forward-backward algorithm, which recovers a family of stochastic policies from offline data. When coupled with zero-order search over compact policy embeddings, this algorithm can sidestep iterative optimization schemes, and optimizes general utilities directly at test-time. Across both didactic and high-dimensional experiments, we demonstrate that our method retains favorable properties of FB algorithms, while also extending their range to more general RL problems.

Soft Forward-Backward Representations for Zero-shot Reinforcement Learning with General Utilities

TL;DR

Soft Forward-Backward Representations (Soft FB) extend zero-shot RL beyond linear rewards to arbitrary differentiable General Utilities by learning a family of entropy-regularized policies whose occupancies admit a low-rank representation . At training time, Soft FB optimizes a maximum-entropy objective, yielding a policy family via , and at test time performs zero-order search over the low-dimensional embedding to maximize a downstream utility . The authors prove that max-entropy solutions to linear rewards are recovered, and that the retrieved policy set is expressive enough to approximate any differentiable utility to arbitrary precision, enabling zero-shot optimization of GU. Practically, they propose a reparameterization to map to a bounded sphere, and train forward/backward representations with sample-based objectives, optionally using an explicit flow-based model for the successor measure to improve inference. Empirically, Soft FB matches existing forward-backward methods on linear tasks but significantly outperforms them on non-deterministic objectives (e.g., pure exploration, stochastic imitation), with strong scalability to high-dimensional control benchmarks like the DeepMind Control Suite when coupled with flow-based measure models.

Abstract

Recent advancements in zero-shot reinforcement learning (RL) have facilitated the extraction of diverse behaviors from unlabeled, offline data sources. In particular, forward-backward algorithms (FB) can retrieve a family of policies that can approximately solve any standard RL problem (with additive rewards, linear in the occupancy measure), given sufficient capacity. While retaining zero-shot properties, we tackle the greater problem class of RL with general utilities, in which the objective is an arbitrary differentiable function of the occupancy measure. This setting is strictly more expressive, capturing tasks such as distribution matching or pure exploration, which may not be reduced to additive rewards. We show that this additional complexity can be captured by a novel, maximum entropy (soft) variant of the forward-backward algorithm, which recovers a family of stochastic policies from offline data. When coupled with zero-order search over compact policy embeddings, this algorithm can sidestep iterative optimization schemes, and optimizes general utilities directly at test-time. Across both didactic and high-dimensional experiments, we demonstrate that our method retains favorable properties of FB algorithms, while also extending their range to more general RL problems.
Paper Structure (28 sections, 6 theorems, 27 equations, 8 figures, 7 tables)

This paper contains 28 sections, 6 theorems, 27 equations, 8 figures, 7 tables.

Key Result

Theorem 3.1

touati2021learning For an arbitrary bounded reward vector $R \in \mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$, if both Equations eq:low_rank_decomposition and eq:greedy_policy hold for all $z \in \mathcal{Z}$, $\pi_{BR}$ is optimal with respect to $R$: $M^{\pi_{BR}}R = \max_{\pi} M^\pi R$.

Figures (8)

  • Figure 1: We propose Soft FB, a soft version of the Forward-Backward algorithm which solves maximum entropy RL instances to retrieve a richer set of stochastic policies, and searches them to optimize general utilities at test-time.
  • Figure 2: Geometric interpretation of $z$ after reparameterization: the stochasticity of $\pi_z$ grows with $\|z\|$.
  • Figure 3: Qualitative evaluation of Soft FB in a didactic environment. White dots are samples from policies $\pi_z$ over a 2D actions space, and the color map represents learned unregularized Q-values $Q_R^z$ for each action ($F_\theta(s_0, a, z)^\top z$). From left to right, we infer task embeddings $z$ for a goal-reaching task, and scale them linearly. The policies conditioned on $z$ become more deterministic as its norm increases. The same visualization for FB can be found in Appendix \ref{['app:collapse']}.
  • Figure 4: Quantitative results over several General RL objectives in a didactic environment. The $x$-axis and $y$-axis represent, respectively, offline performance estimates, and ground-truth performance in the environment. Each dot represents a policy sampled from each method across 3 seeds; for each seed, a darker dot marks the best policy according to offline evaluation. Horizontal lines represent the mean performance over points with the respective color. The policies captured by Soft FB (right) are more expressive, and the top policies affording to offline evaluation outperform, on average, those trained by FB (left). Explicit measure models (top) are more accurate.
  • Figure 5: Zero-shot cumulative returns (in blue) and step-wise policy entropy (in orange) of Soft FB for different levels of entropy regularization in DMC, averaged over linear tasks. As entropy regularization decreases, returns generally improve, eventually matching the performance of FB (in grey), or surprisingly exceeding it in quadruped. Shaded areas represent $95\%$ CIs over 5 seeds.
  • ...and 3 more figures

Theorems & Definitions (12)

  • Theorem 3.1
  • Remark 3.2
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 1.1
  • proof
  • Lemma 1.1
  • proof
  • Theorem 1.1
  • proof
  • ...and 2 more