Off-Policy Reinforcement Learning with High Dimensional Reward
Dong Neuck Lee, Michael R. Kosorok
TL;DR
The paper broadens distributional reinforcement learning to rewards valued in infinite-dimensional Banach spaces, proving that the distributional Bellman operator remains a contraction under the 1-Wasserstein metric and that high-dimensional returns can be faithfully approximated in finite-dimensional Euclidean spaces. It introduces two key components: a theoretical approximation framework that projects Banach-space-valued returns onto finite dimensions with controllable error, and a practical algorithm that uses hypercube-based discretization and the distributional Bellman operator to learn the distribution of returns and optimize a user-defined utility $\phi$. Theoretical contributions include contraction results, Banach-space approximation theorems, and stabilized convergence under Wasserstein metrics, while simulations demonstrate accurate distribution estimation and effective policy search under both known and unknown dynamics. The work enables solving complex, multi-objective, and potentially infinite-dimensional reward problems, with practical implications for domains requiring flexible utilities and high-/infinite-dimensional reward representations.
Abstract
Conventional off-policy reinforcement learning (RL) focuses on maximizing the expected return of scalar rewards. Distributional RL (DRL), in contrast, studies the distribution of returns with the distributional Bellman operator in a Euclidean space, leading to highly flexible choices for utility. This paper establishes robust theoretical foundations for DRL. We prove the contraction property of the Bellman operator even when the reward space is an infinite-dimensional separable Banach space. Furthermore, we demonstrate that the behavior of high- or infinite-dimensional returns can be effectively approximated using a lower-dimensional Euclidean space. Leveraging these theoretical insights, we propose a novel DRL algorithm that tackles problems which have been previously intractable using conventional reinforcement learning approaches.
