Soft Forward-Backward Representations for Zero-shot Reinforcement Learning with General Utilities

Marco Bagatella; Thomas Rupf; Georg Martius; Andreas Krause

Soft Forward-Backward Representations for Zero-shot Reinforcement Learning with General Utilities

Marco Bagatella, Thomas Rupf, Georg Martius, Andreas Krause

TL;DR

Soft Forward-Backward Representations (Soft FB) extend zero-shot RL beyond linear rewards to arbitrary differentiable General Utilities by learning a family of entropy-regularized policies whose occupancies admit a low-rank representation $M^z=F_z^\top B$. At training time, Soft FB optimizes a maximum-entropy objective, yielding a policy family $\{\pi_z\}_{z\in\mathcal{Z}}$ via $\pi_z \propto \exp(F_z^\top z + M^z \mathcal{H}_{\pi_z})$, and at test time performs zero-order search over the low-dimensional embedding $z$ to maximize a downstream utility $f(M^\pi)$. The authors prove that max-entropy solutions to linear rewards are recovered, and that the retrieved policy set is expressive enough to approximate any differentiable utility to arbitrary precision, enabling zero-shot optimization of GU. Practically, they propose a reparameterization to map $z$ to a bounded sphere, and train forward/backward representations with sample-based objectives, optionally using an explicit flow-based model for the successor measure to improve inference. Empirically, Soft FB matches existing forward-backward methods on linear tasks but significantly outperforms them on non-deterministic objectives (e.g., pure exploration, stochastic imitation), with strong scalability to high-dimensional control benchmarks like the DeepMind Control Suite when coupled with flow-based measure models.

Abstract

Recent advancements in zero-shot reinforcement learning (RL) have facilitated the extraction of diverse behaviors from unlabeled, offline data sources. In particular, forward-backward algorithms (FB) can retrieve a family of policies that can approximately solve any standard RL problem (with additive rewards, linear in the occupancy measure), given sufficient capacity. While retaining zero-shot properties, we tackle the greater problem class of RL with general utilities, in which the objective is an arbitrary differentiable function of the occupancy measure. This setting is strictly more expressive, capturing tasks such as distribution matching or pure exploration, which may not be reduced to additive rewards. We show that this additional complexity can be captured by a novel, maximum entropy (soft) variant of the forward-backward algorithm, which recovers a family of stochastic policies from offline data. When coupled with zero-order search over compact policy embeddings, this algorithm can sidestep iterative optimization schemes, and optimizes general utilities directly at test-time. Across both didactic and high-dimensional experiments, we demonstrate that our method retains favorable properties of FB algorithms, while also extending their range to more general RL problems.

Soft Forward-Backward Representations for Zero-shot Reinforcement Learning with General Utilities

TL;DR

. At training time, Soft FB optimizes a maximum-entropy objective, yielding a policy family

via

, and at test time performs zero-order search over the low-dimensional embedding

to maximize a downstream utility

. The authors prove that max-entropy solutions to linear rewards are recovered, and that the retrieved policy set is expressive enough to approximate any differentiable utility to arbitrary precision, enabling zero-shot optimization of GU. Practically, they propose a reparameterization to map

to a bounded sphere, and train forward/backward representations with sample-based objectives, optionally using an explicit flow-based model for the successor measure to improve inference. Empirically, Soft FB matches existing forward-backward methods on linear tasks but significantly outperforms them on non-deterministic objectives (e.g., pure exploration, stochastic imitation), with strong scalability to high-dimensional control benchmarks like the DeepMind Control Suite when coupled with flow-based measure models.

Abstract

Paper Structure (28 sections, 6 theorems, 27 equations, 8 figures, 7 tables)

This paper contains 28 sections, 6 theorems, 27 equations, 8 figures, 7 tables.

Introduction
Background
Forward-Backward Representations and General Utilities
Soft Forward-Backward Representations
Core algorithm
Guarantees
Practical algorithm
Inference
Experiments
Qualitative evaluation
Quantitative evaluation
High-dimensional evaluation
Related Works
Conclusion
Theoretical results and proofs
...and 13 more sections

Key Result

Theorem 3.1

touati2021learning For an arbitrary bounded reward vector $R \in \mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$, if both Equations eq:low_rank_decomposition and eq:greedy_policy hold for all $z \in \mathcal{Z}$, $\pi_{BR}$ is optimal with respect to $R$: $M^{\pi_{BR}}R = \max_{\pi} M^\pi R$.

Figures (8)

Figure 1: We propose Soft FB, a soft version of the Forward-Backward algorithm which solves maximum entropy RL instances to retrieve a richer set of stochastic policies, and searches them to optimize general utilities at test-time.
Figure 2: Geometric interpretation of $z$ after reparameterization: the stochasticity of $\pi_z$ grows with $\|z\|$.
Figure 3: Qualitative evaluation of Soft FB in a didactic environment. White dots are samples from policies $\pi_z$ over a 2D actions space, and the color map represents learned unregularized Q-values $Q_R^z$ for each action ($F_\theta(s_0, a, z)^\top z$). From left to right, we infer task embeddings $z$ for a goal-reaching task, and scale them linearly. The policies conditioned on $z$ become more deterministic as its norm increases. The same visualization for FB can be found in Appendix \ref{['app:collapse']}.
Figure 4: Quantitative results over several General RL objectives in a didactic environment. The $x$-axis and $y$-axis represent, respectively, offline performance estimates, and ground-truth performance in the environment. Each dot represents a policy sampled from each method across 3 seeds; for each seed, a darker dot marks the best policy according to offline evaluation. Horizontal lines represent the mean performance over points with the respective color. The policies captured by Soft FB (right) are more expressive, and the top policies affording to offline evaluation outperform, on average, those trained by FB (left). Explicit measure models (top) are more accurate.
Figure 5: Zero-shot cumulative returns (in blue) and step-wise policy entropy (in orange) of Soft FB for different levels of entropy regularization in DMC, averaged over linear tasks. As entropy regularization decreases, returns generally improve, eventually matching the performance of FB (in grey), or surprisingly exceeding it in quadruped. Shaded areas represent $95\%$ CIs over 5 seeds.
...and 3 more figures

Theorems & Definitions (12)

Theorem 3.1
Remark 3.2
Theorem 4.1
Theorem 4.2
Theorem 1.1
proof
Lemma 1.1
proof
Theorem 1.1
proof
...and 2 more

Soft Forward-Backward Representations for Zero-shot Reinforcement Learning with General Utilities

TL;DR

Abstract

Soft Forward-Backward Representations for Zero-shot Reinforcement Learning with General Utilities

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (12)