Soft Forward-Backward Representations for Zero-shot Reinforcement Learning with General Utilities
Marco Bagatella, Thomas Rupf, Georg Martius, Andreas Krause
TL;DR
Soft Forward-Backward Representations (Soft FB) extend zero-shot RL beyond linear rewards to arbitrary differentiable General Utilities by learning a family of entropy-regularized policies whose occupancies admit a low-rank representation $M^z=F_z^\top B$. At training time, Soft FB optimizes a maximum-entropy objective, yielding a policy family $\{\pi_z\}_{z\in\mathcal{Z}}$ via $\pi_z \propto \exp(F_z^\top z + M^z \mathcal{H}_{\pi_z})$, and at test time performs zero-order search over the low-dimensional embedding $z$ to maximize a downstream utility $f(M^\pi)$. The authors prove that max-entropy solutions to linear rewards are recovered, and that the retrieved policy set is expressive enough to approximate any differentiable utility to arbitrary precision, enabling zero-shot optimization of GU. Practically, they propose a reparameterization to map $z$ to a bounded sphere, and train forward/backward representations with sample-based objectives, optionally using an explicit flow-based model for the successor measure to improve inference. Empirically, Soft FB matches existing forward-backward methods on linear tasks but significantly outperforms them on non-deterministic objectives (e.g., pure exploration, stochastic imitation), with strong scalability to high-dimensional control benchmarks like the DeepMind Control Suite when coupled with flow-based measure models.
Abstract
Recent advancements in zero-shot reinforcement learning (RL) have facilitated the extraction of diverse behaviors from unlabeled, offline data sources. In particular, forward-backward algorithms (FB) can retrieve a family of policies that can approximately solve any standard RL problem (with additive rewards, linear in the occupancy measure), given sufficient capacity. While retaining zero-shot properties, we tackle the greater problem class of RL with general utilities, in which the objective is an arbitrary differentiable function of the occupancy measure. This setting is strictly more expressive, capturing tasks such as distribution matching or pure exploration, which may not be reduced to additive rewards. We show that this additional complexity can be captured by a novel, maximum entropy (soft) variant of the forward-backward algorithm, which recovers a family of stochastic policies from offline data. When coupled with zero-order search over compact policy embeddings, this algorithm can sidestep iterative optimization schemes, and optimizes general utilities directly at test-time. Across both didactic and high-dimensional experiments, we demonstrate that our method retains favorable properties of FB algorithms, while also extending their range to more general RL problems.
